Inefficient Query Design in Databricks SQL and Spark Jobs

Nicole Boyd

CER:

Databricks-Compute-4415

Service Category

Compute

Cloud Provider

Databricks

Service Name

Databricks SQL

Inefficiency Type

Inefficient Configuration

Explanation

Many Spark and SQL workloads in Databricks suffer from micro-optimization issues — such as unfiltered joins, unnecessary shuffles, missing broadcast joins, and repeated scans of uncached data. These problems increase compute time and resource utilization, especially in exploratory or development environments. Disabling Adaptive Query Execution (AQE) can further exacerbate inefficiencies. Optimizing queries reduces DBU costs, improves cluster performance, and enhances user experience.

Relevant Billing Model

Databricks charges by Databricks Units (DBUs), which are billed per-second based on the compute resources used. Inefficient query design leads to longer execution times, increased memory and shuffle usage, and higher DBU consumption without proportional business value.

Detection

Identify queries with long execution times and large shuffle stages
Review query plans for missing filters or non-optimal joins (e.g., lack of broadcast joins)
Check for repeated reads from large datasets without use of caching
Check memory and storage metrics for signs of inefficient caching (e.g., low cache hit rate despite frequent reads)
Evaluate whether Adaptive Query Execution (AQE) is enabled — it is on by default in recent Spark versions, but may be disabled in older clusters or overridden in session or job-level configurations.

Remediation

Enable Adaptive Query Execution to improve join strategies and reduce shuffle
Use broadcast joins for small lookup tables where applicable
Apply filtering and predicate pushdown early in the query
Cache frequently accessed tables or intermediate query results
Refactor inefficient joins, subqueries, or nested aggregations
Promote a culture of reviewing physical query plans during development
Implement partition pruning or Z-ordering to improve scan efficiency for large tables

Relevant Documentation

Optimizing Apache Spark SQL Joins in Databricks
Caching and Persistence
Adaptive Query Execution in Spark

Submit Feedback