Submit feedback on
Poorly Configured Autoscaling on Databricks Clusters
We've received your feedback.
Thanks for reaching out!
Oops! Something went wrong while submitting the form.
Close
Poorly Configured Autoscaling on Databricks Clusters
Nicole Boyd
Service Category
Compute
Cloud Provider
Databricks
Service Name
Databricks Compute
Inefficiency Type
Inefficient Configuration
Explanation

Autoscaling is a core mechanism for aligning compute supply with workload demand, yet it's often underutilized or misconfigured. In older clusters or ad-hoc environments, autoscaling may be disabled by default or set with tight min/max worker limits that prevent scaling. This can lead to persistent overprovisioning (and wasted cost during idle periods) or underperformance due to insufficient parallelism and job queuing. Poor autoscaling settings are especially common in manually created all-purpose clusters, where idle resources often go unnoticed.

Overly wide autoscaling ranges can also introduce instability: Databricks may rapidly scale up to the upper limit if demand briefly spikes, leading to cost spikes or degraded performance. Understanding workload characteristics is key to tuning autoscaling appropriately.

Relevant Billing Model

Databricks charges based on Databricks Units (DBUs) per node per hour. When autoscaling is misconfigured — such as by fixing the worker count (Min \= Max), setting narrow ranges, or disabling it entirely — clusters may remain overprovisioned during idle periods or underprovisioned during peak demand, resulting in inefficient compute spend or job failures.

Detection
  • Identify clusters with identical min and max worker values (fixed size)
  • Review clusters with low task parallelism or frequent job queuing
  • Check for long idle durations on high-capacity clusters
  • Analyze job failures caused by memory or executor limits
  • Use the Spark UI or Ganglia metrics to monitor how often clusters scale up or down
  • Look for consistent underuse of allocated workers during job execution
  • Review logs or metrics for unexpected or excessive scale-up behavior
Remediation
  • Use autoscaling for variable workloads, but avoid overly wide min/max ranges that allow clusters to over-expand. Databricks may aggressively scale up if limits are too high, leading to cost spikes and instability.
  • For predictable, recurring jobs with stable compute requirements, consider using fixed-size clusters to avoid the cost and time of scaling transitions.
  • Tune autoscaling thresholds based on real workload behavior. Start narrow and adjust iteratively, based on runtime performance and cluster utilization.
  • Establish cluster policies to enforce sensible auto scaling defaults or require justification for disabling autoscaling entirely.
  • Regularly review cluster usage patterns to refine scaling decisions. While some tools can automate this, most teams can start with basic monitoring of scaling events and job runtimes.
Relevant Documentation
Submit Feedback