A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:
(spark.readStream
.format("parquet")
.load("/mnt/raw_orders/")
.withWatermark("time", "2 hours")
.dropDuplicates(["customer_id", "order_id"])
.writeStream
.trigger(once=True)
.table("orders")
)
Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system. If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?
Comprehensive and Detailed Explanation From Exact Extract:
Exact extract: ''dropDuplicates with watermark performs stateful deduplication on the keys within the watermark delay.''
Exact extract: ''Records older than the event-time watermark are considered late and may be dropped.''
Exact extract: ''trigger(once) processes all available data once and then stops.''
The watermark of 2 hours bounds the deduplication state. Duplicate orders within the 2-hour window are removed; duplicates arriving later than 2 hours behind the corresponding first event are considered late and are ignored, so they won't appear, but any orders that themselves arrive later than the watermark will be dropped and thus be missing.
===========
Markus
9 hours agoSage
6 days agoHildegarde
11 days agoLorean
16 days agoLinn
21 days agoGenevieve
26 days agoShawna
1 month agoMalcolm
1 month agoKristeen
1 month agoStephen
2 months agoGarry
2 months agoSean
2 months agoLindy
2 months agoBlair
2 months agoCharlene
2 months agoBritt
3 months agoMagdalene
3 months agoXuan
3 months agoMerilyn
3 months ago