On-Demand-Only Configuration for Non-Production Databricks Clusters

Benjamin van der Maas

CER:

CER-0098

Service Category

Compute

Cloud Provider

Databricks

Service Name

Databricks Clusters

Inefficiency Type

Suboptimal Pricing Model

Explanation

In non-production environments—such as development, testing, and experimentation—many teams default to on-demand nodes out of habit or caution. However, Databricks offers built-in support for using spot instances safely. Its job scheduler and cluster management system are designed to detect spot instance evictions and automatically replace them with on-demand nodes when necessary, making the use of spot compute relatively low-risk.

Failing to enable spot for non-critical or short-lived workloads leads to unnecessary overspend. The inefficiency often arises because spot usage is not enabled by default and must be explicitly selected in cluster settings. In teams that don’t revisit infrastructure defaults or where FinOps guardrails are missing, this results in a persistent cost gap between actual usage and what could be safely optimized.

Relevant Billing Model

Databricks clusters are billed based on the underlying virtual machines used for driver and worker nodes. When on-demand instances are selected, charges are based on standard cloud provider rates. If spot instances are enabled (where available), compute costs can be significantly lower—often 60–90% cheaper. Databricks includes native failover capabilities that automatically replace preempted spot nodes with on-demand nodes to maintain job continuity, minimizing the impact of eviction risk.

Detection

Identify Databricks clusters that are not configured to use spot instances
Filter for non-production environments (e.g., dev, test, staging) where eviction risk is acceptable
Review the duration and criticality of jobs; short-lived or interruptible workloads are ideal candidates
Check whether spot replacement policies are enabled in workspace settings
Evaluate whether cost differences between current on-demand usage and spot alternatives are material

Remediation

Enable spot instance usage for non-production clusters where workloads are resilient to interruption
Leverage Databricks’ native fallback-to-on-demand capabilities to preserve job continuity
Establish workspace-level defaults or templates that promote spot usage in dev/test clusters
Periodically audit compute configurations to detect persistent on-demand usage in non-critical environments

Relevant Documentation

https://docs.databricks.com/clusters/configure.html
https://docs.databricks.com/clusters/instance-pools.html
https://docs.databricks.com/clusters/clusters-manage.html

Submit Feedback