Databricks Exam Databricks-Certified-Professional-Data-Engineer Topic 2 Question 34 Discussion

Actual exam question for Databricks's Databricks-Certified-Professional-Data-Engineer exam

Question #: 34
Topic #: 2

[All Databricks-Certified-Professional-Data-Engineer Questions]

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

ASet the configuration delta.deduplicate = true.

BVACUUM the Delta table after each batch completes.

CPerform an insert-only merge with a matching condition on a unique key.

DPerform a full outer join on a unique key and overwrite existing data.

ERely on Delta Lake schema enforcement to prevent duplicate records.

Show Suggested Answer

Suggested Answer: C

To deduplicate data against previously processed records as it is inserted into a Delta table, you can use the merge operation with an insert-only clause. This allows you to insert new records that do not match any existing records based on a unique key, while ignoring duplicate records that match existing records. For example, you can use the following syntax:

MERGE INTO target_table USING source_table ON target_table.unique_key = source_table.unique_key WHEN NOT MATCHED THEN INSERT *

This will insert only the records from the source table that have a unique key that is not present in the target table, and skip the records that have a matching key. This way, you can avoid inserting duplicate records into the Delta table.

https://docs.databricks.com/delta/delta-update.html#upsert-into-a-table-using-merge

https://docs.databricks.com/delta/delta-update.html#insert-only-merge

by Georgiann at May 04, 2025, 10:52 PM

Limited Time Offer

25%

15 days ago

Option C looks like the best approach to me. Performing an insert-only merge with a unique key condition will allow the data engineer to deduplicate the data as it's inserted into the Delta table.

upvoted 0 times

Dominga

1 days ago

User 2: Yeah, using a unique key condition in an insert-only merge is a smart approach.

upvoted 0 times

...

Willodean

7 days ago

User 1: I agree, option C seems like the most efficient way to deduplicate the data.

upvoted 0 times

...

Lonny

20 days ago

I disagree, I believe the correct answer is E) Rely on Delta Lake schema enforcement to prevent duplicate records.

upvoted 0 times

...

Solange

1 months ago

I think the answer is C) Perform an insert-only merge with a matching condition on a unique key.

upvoted 0 times

...