Databricks Exam Databricks Certified Associate Developer for Apache Spark 3.0 Topic 2 Question 57 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.0 exam

Question #: 57
Topic #: 2

[All Databricks Certified Associate Developer for Apache Spark 3.0 Questions]

Which of the following code blocks returns the number of unique values in column storeId of DataFrame transactionsDf?

AtransactionsDf.select('storeId').dropDuplicates().count()

BtransactionsDf.select(count('storeId')).dropDuplicates()

CtransactionsDf.select(distinct('storeId')).count()

DtransactionsDf.dropDuplicates().agg(count('storeId'))

EtransactionsDf.distinct().select('storeId').count()

Show Suggested Answer

Suggested Answer: A

transactionsDf.select('storeId').dropDuplicates().count()

Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column.

transactionsDf.select(count('storeId')).dropDuplicates()

No. transactionsDf.select(count('storeId')) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context.

transactionsDf.dropDuplicates().agg(count('storeId'))

Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates

instead.

transactionsDf.distinct().select('storeId').count()

Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count

not represent the number of unique values in that column.

transactionsDf.select(distinct('storeId')).count()

False. There is no distinct method in pyspark.sql.functions.

by Noel at May 31, 2024, 09:36 AM

Limited Time Offer

25%

Off

Get Premium Databricks Certified Associate Developer for Apache Spark 3.0 Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Danica

12 months ago

Option B? Really? That's like trying to count the number of unique snowflakes by first counting all the snowflakes and then dropping the duplicates. Definitely not the way to go here.

upvoted 0 times

...

Melissa

1 years ago

D is the way to go, my friends. It's the most comprehensive solution, and it's got that fancy `agg()` function. Gotta love that data aggregation magic!

upvoted 0 times

...

Janet

1 years ago

Hmm, I'm torn between A and E. They both seem to be doing the same thing, but E might be a bit more concise. What do you guys think?

upvoted 0 times

Antonio

12 months ago

I agree, A looks like the right choice.

upvoted 0 times

...

Melita

12 months ago

I think A is the correct one.

upvoted 0 times

...

I think option A is the correct one.

upvoted 0 times

...