Databricks Exam Databricks Certified Associate Developer for Apache Spark 3.0 Topic 2 Question 34 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.0 exam

Question #: 34
Topic #: 2

[All Databricks Certified Associate Developer for Apache Spark 3.0 Questions]

The code block displayed below contains an error. When the code block below has executed, it should have divided DataFrame transactionsDf into 14 parts, based on columns storeId and

transactionDate (in this order). Find the error.

Code block:

transactionsDf.coalesce(14, ("storeId", "transactionDate"))

AThe parentheses around the column names need to be removed and .select() needs to be appended to the code block.

BOperator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .count() needs to be appended to the code block.
(Correct)

COperator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .select() needs to be appended to the code block.

DOperator coalesce needs to be replaced by repartition and the parentheses around the column names need to be replaced by square brackets.

EOperator coalesce needs to be replaced by repartition.

Show Suggested Answer

Suggested Answer: A

transactionsDf.select('storeId').dropDuplicates().count()

Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column.

transactionsDf.select(count('storeId')).dropDuplicates()

No. transactionsDf.select(count('storeId')) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context.

transactionsDf.dropDuplicates().agg(count('storeId'))

Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates

instead.

transactionsDf.distinct().select('storeId').count()

Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count

not represent the number of unique values in that column.

transactionsDf.select(distinct('storeId')).count()

False. There is no distinct method in pyspark.sql.functions.

by Salena at Mar 02, 2023, 06:56 PM

Limited Time Offer

25%

Off

Get Premium Databricks Certified Associate Developer for Apache Spark 3.0 Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

2 months ago

I'm pretty sure the code block is as lost as a goose in a snowstorm. But hey, at least we've got the options to choose from. Time to put on our thinking caps and find the right answer!

upvoted 0 times

Karan

1 months ago

Exactly! And we should append .select() to the code block as well.

upvoted 0 times

...

Marjory

1 months ago

You're right! And we also need to remove the parentheses around the column names.

upvoted 0 times

...

Teresita

1 months ago

I think the error is that the operator coalesce needs to be replaced by repartition.

upvoted 0 times

...

Rebbecca

2 months ago

Well, well, look at that! The code is as clear as mud. At least the correct answer is here to save the day. Option B, my friends, is the way to go.

upvoted 0 times

Chan

10 hours ago

Oh, I see. So, the correct option is to replace operator coalesce with repartition, remove the parentheses around the column names, and append .count() to the code block.

upvoted 0 times

2 months ago

Ah, I see the problem! The 'coalesce' operator needs to be replaced with 'repartition', and the parentheses around the column names should be removed. Looks like option B is the correct answer.

upvoted 0 times

...

3 months ago

I think the error is that the operator coalesce needs to be replaced by repartition.

upvoted 0 times

...

Databricks Exam Databricks Certified Associate Developer for Apache Spark 3.0 Topic 2 Question 34 Discussion

Contribute your Thoughts:

Alline

Shayne

Luis

Portia

Alberto

Karan

Marjory

Teresita

Rebbecca

Chan

Trinidad

Toshia

Kris

Melda

Dawne

Mayra

Ahmad

Ernestine

Valentine

Deja

Demetra

Inocencia

Demetra