4 of 55.
A developer is working on a Spark application that processes a large dataset using SQL queries. Despite having a large cluster, the developer notices that the job is underutilizing the available resources. Executors remain idle for most of the time, and logs reveal that the number of tasks per stage is very low. The developer suspects that this is causing suboptimal cluster performance.
Which action should the developer take to improve cluster utilization?
In Spark SQL and DataFrame operations, the configuration parameter spark.sql.shuffle.partitions defines the number of partitions created during shuffle operations such as join, groupBy, and distinct.
The default value (in Spark 3.5) is 200.
If this number is too low, Spark creates fewer tasks, leading to idle executors and poor cluster utilization.
Increasing this value allows Spark to create more tasks that can run in parallel across executors, effectively using more cluster resources.
Correct approach:
spark.conf.set('spark.sql.shuffle.partitions', 400)
This increases the parallelism level of shuffle stages and improves overall resource utilization.
Why the other options are incorrect:
B: Reducing partitions further would decrease parallelism and worsen the underutilization issue.
C: Dynamic resource allocation scales executors up or down based on workload, but it doesn't fix low task parallelism caused by insufficient shuffle partitions.
D: Increasing dataset size is not a tuning solution and doesn't address task-level under-parallelization.
Reference (Databricks Apache Spark 3.5 -- Python / Study Guide):
Spark SQL Configuration: spark.sql.shuffle.partitions --- controls the number of shuffle partitions.
Databricks Exam Guide (June 2025): Section ''Troubleshooting and Tuning Apache Spark DataFrame API Applications'' --- tuning strategies, partitioning, and optimizing cluster utilization.
===========
Currently there are no comments in this discussion, be the first to comment!