Submit feedback on
On-Demand-Only Configuration for Non-Production Databricks Clusters
We've received your feedback.
Thanks for reaching out!
Oops! Something went wrong while submitting the form.
Close
On-Demand-Only Configuration for Non-Production Databricks Clusters
Benjamin van der Maas
Service Category
Compute
Cloud Provider
Databricks
Service Name
Databricks Clusters
Inefficiency Type
Suboptimal Pricing Model
Explanation

In non-production environments—such as development, testing, and experimentation—many teams default to on-demand nodes out of habit or caution. However, Databricks offers built-in support for using spot instances safely. Its job scheduler and cluster management system are designed to detect spot instance evictions and automatically replace them with on-demand nodes when necessary, making the use of spot compute relatively low-risk.

Failing to enable spot for non-critical or short-lived workloads leads to unnecessary overspend. The inefficiency often arises because spot usage is not enabled by default and must be explicitly selected in cluster settings. In teams that don’t revisit infrastructure defaults or where FinOps guardrails are missing, this results in a persistent cost gap between actual usage and what could be safely optimized.

Relevant Billing Model

Databricks clusters are billed based on the underlying virtual machines used for driver and worker nodes. When on-demand instances are selected, charges are based on standard cloud provider rates. If spot instances are enabled (where available), compute costs can be significantly lower—often 60–90% cheaper. Databricks includes native failover capabilities that automatically replace preempted spot nodes with on-demand nodes to maintain job continuity, minimizing the impact of eviction risk.

Detection
  • Identify Databricks clusters that are not configured to use spot instances
  • Filter for non-production environments (e.g., dev, test, staging) where eviction risk is acceptable
  • Review the duration and criticality of jobs; short-lived or interruptible workloads are ideal candidates
  • Check whether spot replacement policies are enabled in workspace settings
  • Evaluate whether cost differences between current on-demand usage and spot alternatives are material
Remediation
  • Enable spot instance usage for non-production clusters where workloads are resilient to interruption
  • Leverage Databricks’ native fallback-to-on-demand capabilities to preserve job continuity
  • Establish workspace-level defaults or templates that promote spot usage in dev/test clusters
  • Periodically audit compute configurations to detect persistent on-demand usage in non-critical environments
Relevant Documentation
  • https://docs.databricks.com/clusters/configure.html
  • https://docs.databricks.com/clusters/instance-pools.html
  • https://docs.databricks.com/clusters/clusters-manage.html
Submit Feedback