A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.
In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?
To deduplicate data against previously processed records as it is inserted into a Delta table, you can use the merge operation with an insert-only clause. This allows you to insert new records that do not match any existing records based on a unique key, while ignoring duplicate records that match existing records. For example, you can use the following syntax:
MERGE INTO target_table USING source_table ON target_table.unique_key = source_table.unique_key WHEN NOT MATCHED THEN INSERT *
This will insert only the records from the source table that have a unique key that is not present in the target table, and skip the records that have a matching key. This way, you can avoid inserting duplicate records into the Delta table.
https://docs.databricks.com/delta/delta-update.html#upsert-into-a-table-using-merge
https://docs.databricks.com/delta/delta-update.html#insert-only-merge
Maryann
4 days agoCristy
6 days agoSolange
8 days agoParis
9 days agoMacy
15 days agoDominga
1 days agoWillodean
7 days agoLonny
20 days agoSolange
1 months ago