Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Certified-Professional-Data-Engineer Topic 2 Question 34 Discussion

Actual exam question for Databricks's Databricks-Certified-Professional-Data-Engineer exam
Question #: 34
Topic #: 2
[All Databricks-Certified-Professional-Data-Engineer Questions]

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

Show Suggested Answer Hide Answer
Suggested Answer: C

To deduplicate data against previously processed records as it is inserted into a Delta table, you can use the merge operation with an insert-only clause. This allows you to insert new records that do not match any existing records based on a unique key, while ignoring duplicate records that match existing records. For example, you can use the following syntax:

MERGE INTO target_table USING source_table ON target_table.unique_key = source_table.unique_key WHEN NOT MATCHED THEN INSERT *

This will insert only the records from the source table that have a unique key that is not present in the target table, and skip the records that have a matching key. This way, you can avoid inserting duplicate records into the Delta table.


https://docs.databricks.com/delta/delta-update.html#upsert-into-a-table-using-merge

https://docs.databricks.com/delta/delta-update.html#insert-only-merge

Contribute your Thoughts:

Maryann
4 days ago
Hmm, I'm not sure about Option E. Relying on Delta Lake schema enforcement alone seems a bit risky for deduplicating data. I'd go with Option C or D.
upvoted 0 times
...
Cristy
6 days ago
I think D) Perform a full outer join on a unique key and overwrite existing data could also be a valid approach.
upvoted 0 times
...
Solange
8 days ago
But wouldn't setting the configuration delta.deduplicate = true also help in deduplicating data?
upvoted 0 times
...
Paris
9 days ago
I think Option D is the way to go. A full outer join on a unique key and overwriting the existing data is a robust way to handle late-arriving, duplicate records.
upvoted 0 times
...
Macy
15 days ago
Option C looks like the best approach to me. Performing an insert-only merge with a unique key condition will allow the data engineer to deduplicate the data as it's inserted into the Delta table.
upvoted 0 times
Dominga
1 days ago
User 2: Yeah, using a unique key condition in an insert-only merge is a smart approach.
upvoted 0 times
...
Willodean
7 days ago
User 1: I agree, option C seems like the most efficient way to deduplicate the data.
upvoted 0 times
...
...
Lonny
20 days ago
I disagree, I believe the correct answer is E) Rely on Delta Lake schema enforcement to prevent duplicate records.
upvoted 0 times
...
Solange
1 months ago
I think the answer is C) Perform an insert-only merge with a matching condition on a unique key.
upvoted 0 times
...

Save Cancel