Databricks Certified Associate Developer for Apache Spark 3.5 Exam - Topic 6 Question 1 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.5 exam

Question #: 1
Topic #: 6

[All Databricks Certified Associate Developer for Apache Spark 3.5 Questions]

A data scientist at a financial services company is working with a Spark DataFrame containing transaction records. The DataFrame has millions of rows and includes columns for transaction_id, account_number, transaction_amount, and timestamp. Due to an issue with the source system, some transactions were accidentally recorded multiple times with identical information across all fields. The data scientist needs to remove rows with duplicates across all fields to ensure accurate financial reporting.

Which approach should the data scientist use to deduplicate the orders using PySpark?

Adf = df.dropDuplicates()

Bdf = df.groupBy('transaction_id').agg(F.first('account_number'), F.first('transaction_amount'), F.first('timestamp'))

Cdf = df.filter(F.col('transaction_id').isNotNull())

Ddf = df.dropDuplicates(['transaction_amount'])

Show Suggested Answer

Suggested Answer: A

dropDuplicates() with no column list removes duplicates based on all columns.

It's the most efficient and semantically correct way to deduplicate records that are completely identical across all fields.

From the PySpark documentation:

dropDuplicates(): Return a new DataFrame with duplicate rows removed, considering all columns if none are specified.

--- Source: PySpark DataFrame.dropDuplicates() API

by Lura at Nov 25, 2025, 09:31 AM

Limited Time Offer

25%

Off

Get Premium Databricks Certified Associate Developer for Apache Spark 3.5 Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Shayne

3 months ago

A is definitely the way to go for accurate reporting. Simple and effective!

upvoted 0 times

...

Jackie

3 months ago

D only removes duplicates based on transaction_amount, not all fields.

upvoted 0 times

...

Rose

3 months ago

I feel like C doesn’t really address duplicates. It's not the right choice.

upvoted 0 times

...

Geoffrey

3 months ago

Option B could work, but it’s more complex than needed.

upvoted 0 times

...

Cecil

3 months ago

Agreed! A is straightforward and efficient for this task.

upvoted 0 times

...

Kristine

4 months ago

I think option A is the best choice. It removes all duplicates easily.

upvoted 0 times

...

Martha

4 months ago

C) doesn't really help with duplicates, right?

upvoted 0 times

...

Olive

4 months ago

B) is also a good option if you want to keep specific fields.

upvoted 0 times

...

Tresa

5 months ago

Wait, can you really just drop duplicates like that?

upvoted 0 times

...

Elenore

5 months ago

Option A is the way to go. Who needs all those fancy Spark functions when you can just drop the duplicates? Easy peasy!

upvoted 0 times

...

Earlean

5 months ago

Option B seems like the best choice. It will ensure we keep the first instance of each unique transaction while preserving the other relevant information.

upvoted 0 times

...

Oneida

5 months ago

Option D is a bit too specific. We need to remove duplicates across all fields, not just the transaction_amount.

upvoted 0 times

...

Alayna

5 months ago

I think Option A is the way to go. It's the simplest and most straightforward approach to deduplicating the entire DataFrame.

upvoted 0 times

...

Lizette

5 months ago

Option D seems off because it only drops duplicates based on transaction_amount, not all fields. I think we need to consider all columns for deduplication.

upvoted 0 times

...

King

6 months ago

I feel like option C is not relevant since it only filters out null transaction IDs and doesn't address duplicates at all.

upvoted 0 times

...

Kasandra

6 months ago

I'm not entirely sure, but I remember practicing a question where we used groupBy to aggregate data. Maybe option B could work, but it seems more complex than necessary.

upvoted 0 times

...

Zack

6 months ago

I think option A is the right choice since it directly removes duplicates across all fields, which is what we need here.

upvoted 0 times

...

Mitsue

6 months ago

Option D seems like it would only remove duplicates based on the transaction_amount column, which isn't what we're looking for. I think B or A are the best choices here, but I'd want to test them out on a sample of the data to see which one works better.

upvoted 0 times

...

Anisha

6 months ago

I like the idea behind option B, but I'm worried it might not catch all the duplicates if there are any issues with the data, like null values or inconsistencies in the timestamp format. Maybe A is the safest bet, even if it's a bit more straightforward.

upvoted 0 times

...

Pamella

6 months ago

Option C doesn't seem relevant here. We want to remove duplicates, not filter out null values. I'm leaning towards B or A, but I'll need to think through the implications of each approach.

upvoted 0 times

...

Robt

7 months ago

Totally agree, that's the simplest method!

upvoted 0 times

...

Fernanda

7 months ago

Option B looks good, but I'm not sure if it will handle cases where the transaction_id is null.

upvoted 0 times

...

Nichelle

7 months ago

A) df = df.dropDuplicates() is the way to go!

upvoted 0 times

...

Loreen

8 months ago

Hmm, I'm a bit confused. Wouldn't option A just be the easiest way to remove all duplicates across all columns? I'm not sure if the other options are necessary.

upvoted 0 times

...

Shanda

8 months ago

I think I'd go with option B. Grouping by the transaction_id and taking the first value for the other columns seems like the best way to deduplicate while preserving the original data.

upvoted 0 times