A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.
In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?
To deduplicate data against previously processed records as it is inserted into a Delta table, you can use the merge operation with an insert-only clause. This allows you to insert new records that do not match any existing records based on a unique key, while ignoring duplicate records that match existing records. For example, you can use the following syntax:
MERGE INTO target_table USING source_table ON target_table.unique_key = source_table.unique_key WHEN NOT MATCHED THEN INSERT *
This will insert only the records from the source table that have a unique key that is not present in the target table, and skip the records that have a matching key. This way, you can avoid inserting duplicate records into the Delta table.
https://docs.databricks.com/delta/delta-update.html#upsert-into-a-table-using-merge
https://docs.databricks.com/delta/delta-update.html#insert-only-merge
Dottie
4 months agoLajuana
4 months agoDenise
4 months agoYolando
4 months agoGlory
4 months agoYuki
5 months agoFrancesco
5 months agoIlene
5 months agoPhung
5 months agoLigia
5 months agoBernadine
6 months agoKristal
6 months agoTashia
6 months agoRoyal
6 months agoJustine
10 months agoSylvia
10 months agoFallon
8 months agoTandra
9 months agoDenise
9 months agoNatalie
9 months agoMaryann
10 months agoCristy
10 months agoSolange
10 months agoParis
10 months agoEllsworth
10 months agoKenneth
10 months agoMacy
11 months agoElly
10 months agoDominga
10 months agoWillodean
10 months agoLonny
11 months agoSolange
11 months ago