A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.
In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?
To deduplicate data against previously processed records as it is inserted into a Delta table, you can use the merge operation with an insert-only clause. This allows you to insert new records that do not match any existing records based on a unique key, while ignoring duplicate records that match existing records. For example, you can use the following syntax:
MERGE INTO target_table USING source_table ON target_table.unique_key = source_table.unique_key WHEN NOT MATCHED THEN INSERT *
This will insert only the records from the source table that have a unique key that is not present in the target table, and skip the records that have a matching key. This way, you can avoid inserting duplicate records into the Delta table.
https://docs.databricks.com/delta/delta-update.html#upsert-into-a-table-using-merge
https://docs.databricks.com/delta/delta-update.html#insert-only-merge
Dottie
5 months agoLajuana
5 months agoDenise
5 months agoYolando
6 months agoGlory
6 months agoYuki
6 months agoFrancesco
6 months agoIlene
7 months agoPhung
7 months agoLigia
7 months agoBernadine
7 months agoKristal
7 months agoTashia
8 months agoRoyal
8 months agoJustine
11 months agoSylvia
11 months agoFallon
10 months agoTandra
10 months agoDenise
11 months agoNatalie
11 months agoMaryann
12 months agoCristy
12 months agoSolange
12 months agoParis
12 months agoEllsworth
11 months agoKenneth
11 months agoMacy
1 year agoElly
11 months agoDominga
12 months agoWillodean
12 months agoLonny
1 year agoSolange
1 year ago