3 of 55. A data engineer observes that the upstream streaming source feeds the event table frequently and sends duplicate records. Upon analyzing the current production table, the data engineer found that the time difference in the event_timestamp column of the duplicate records is, at most, 30 minutes.
To remove the duplicates, the engineer adds the code:
df = df.withWatermark("event_timestamp", "30 minutes")
What is the result?
In Structured Streaming, a watermark defines the maximum delay for event-time data to be considered in stateful operations like deduplication or window aggregations.
Behavior:
df = df.withWatermark('event_timestamp', '30 minutes')
This sets a 30-minute watermark, meaning Spark will only keep track of events that arrive within 30 minutes of the latest event time seen so far. When used with:
df.dropDuplicates(['event_id', 'event_timestamp'])
Spark removes duplicates that arrive within the watermark threshold (in this case, within 30 minutes).
Why other options are incorrect:
A: Watermarks do not remove all duplicates; they only manage those within the defined event-time window.
B: Watermark durations can be expressed as strings like '30 minutes', '10 seconds', etc., not only seconds.
D: Structured Streaming supports deduplication using withWatermark() and dropDuplicates().
Reference (Databricks Apache Spark 3.5 -- Python / Study Guide):
PySpark Structured Streaming Guide --- withWatermark() and dropDuplicates() methods for event-time deduplication.
Databricks Certified Associate Developer for Apache Spark Exam Guide (June 2025): Section ''Structured Streaming'' --- Topic: Streaming Deduplication with and without watermark usage.
Tandra
9 hours agoTarra
6 days agoLou
11 days agoGayla
16 days agoNoah
21 days agoLaine
26 days agoSina
1 month agoChandra
1 month agoSabra
1 month agoJoanne
2 months agoErnie
2 months agoLaine
2 months agoHarrison
2 months agoPenney
2 months agoChantay
2 months agoAgustin
3 months agoHarrison
3 months agoDalene
3 months agoRobt
3 months ago