Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Topic 2 Question 34 Discussion

Actual exam question for Databricks's Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam
Question #: 34
Topic #: 2
[All Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions]

The code block displayed below contains an error. When the code block below has executed, it should have divided DataFrame transactionsDf into 14 parts, based on columns storeId and

transactionDate (in this order). Find the error.

Code block:

transactionsDf.coalesce(14, ("storeId", "transactionDate"))

Show Suggested Answer Hide Answer
Suggested Answer: A

transactionsDf.select('storeId').dropDuplicates().count()

Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column.

transactionsDf.select(count('storeId')).dropDuplicates()

No. transactionsDf.select(count('storeId')) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context.

transactionsDf.dropDuplicates().agg(count('storeId'))

Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates

instead.

transactionsDf.distinct().select('storeId').count()

Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count

not represent the number of unique values in that column.

transactionsDf.select(distinct('storeId')).count()

False. There is no distinct method in pyspark.sql.functions.


Contribute your Thoughts:

Rebbecca
2 days ago
Well, well, look at that! The code is as clear as mud. At least the correct answer is here to save the day. Option B, my friends, is the way to go.
upvoted 0 times
...
Ahmad
3 days ago
Ah, I see the problem! The 'coalesce' operator needs to be replaced with 'repartition', and the parentheses around the column names should be removed. Looks like option B is the correct answer.
upvoted 0 times
...
Ernestine
7 days ago
The code block has a few issues. The operator 'coalesce' is not the correct one to use for repartitioning. Also, the parentheses around the column names need to be removed.
upvoted 0 times
...
Demetra
16 days ago
Yes, and we should append .select() to the code block as well.
upvoted 0 times
...
Inocencia
17 days ago
I agree with Demetra. Also, the parentheses around the column names need to be removed.
upvoted 0 times
...
Demetra
24 days ago
I think the error is that the operator coalesce needs to be replaced by repartition.
upvoted 0 times
...

Save Cancel