New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Certified Data Engineer Professional Exam - Topic 3 Question 42 Discussion

Actual exam question for Databricks's Databricks Certified Data Engineer Professional exam
Question #: 42
Topic #: 3
[All Databricks Certified Data Engineer Professional Questions]

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:

(spark.readStream

.format("parquet")

.load("/mnt/raw_orders/")

.withWatermark("time", "2 hours")

.dropDuplicates(["customer_id", "order_id"])

.writeStream

.trigger(once=True)

.table("orders")

)

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system. If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

Show Suggested Answer Hide Answer
Suggested Answer: A

Comprehensive and Detailed Explanation From Exact Extract:

Exact extract: ''dropDuplicates with watermark performs stateful deduplication on the keys within the watermark delay.''

Exact extract: ''Records older than the event-time watermark are considered late and may be dropped.''

Exact extract: ''trigger(once) processes all available data once and then stops.''

The watermark of 2 hours bounds the deduplication state. Duplicate orders within the 2-hour window are removed; duplicates arriving later than 2 hours behind the corresponding first event are considered late and are ignored, so they won't appear, but any orders that themselves arrive later than the watermark will be dropped and thus be missing.


===========

Contribute your Thoughts:

0/2000 characters
Markus
9 hours ago
C) sounds plausible, but I doubt all records are held for 2 hours.
upvoted 0 times
...
Sage
6 days ago
Wait, how can duplicates be ignored if they come in late?
upvoted 0 times
...
Hildegarde
11 days ago
Totally agree, the watermark will drop those late entries.
upvoted 0 times
...
Lorean
16 days ago
I'm pretty sure the correct answer is D. The question specifically states that the upstream system can enqueue duplicate entries hours apart, so the 2-hour watermark won't prevent those from being retained.
upvoted 0 times
...
Linn
21 days ago
Haha, this question is like a riddle wrapped in an enigma. I'll just guess and hope for the best!
upvoted 0 times
...
Genevieve
26 days ago
Option D makes sense to me. The question mentions that the upstream system can enqueue duplicate entries hours apart, so the 2-hour watermark won't catch those.
upvoted 0 times
...
Shawna
1 month ago
I'm not sure about this one. The code looks complicated, and I'm not familiar with Parquet data and Databricks jobs.
upvoted 0 times
...
Malcolm
1 month ago
Option D seems correct. The code uses a 2-hour watermark, so records arriving more than 2 hours late will be ignored.
upvoted 0 times
...
Kristeen
1 month ago
I thought the watermark would keep all records for 2 hours before deduplication, so option C seems plausible, but I need to double-check that.
upvoted 0 times
...
Stephen
2 months ago
Okay, let me walk through this step-by-step. The watermark is 2 hours, so records arriving more than 2 hours late will be ignored. But the deduplication is based on customer_id and order_id, not the time field. So I think option D is the right answer here.
upvoted 0 times
...
Garry
2 months ago
I'm pretty confident on this one. The watermark is just used to handle late-arriving records, but the deduplication is the key here. Since it's based on customer_id and order_id, option D is correct - the orders table may contain duplicate records if they are enqueued more than 2 hours apart.
upvoted 0 times
...
Sean
2 months ago
Okay, let me think this through. The watermark is set to 2 hours, so any records arriving more than 2 hours late will be ignored. But the deduplication is based on customer_id and order_id, so I think option D is correct - duplicates more than 2 hours apart could be retained.
upvoted 0 times
...
Lindy
2 months ago
A) is correct, late records will be ignored.
upvoted 0 times
...
Blair
2 months ago
I feel like option D makes sense because if duplicates come in after 2 hours, they might still be processed, right? But I'm not entirely sure.
upvoted 0 times
...
Charlene
2 months ago
I remember something about watermarking, but I'm not sure how it interacts with late records. Does it really ignore them after 2 hours?
upvoted 0 times
...
Britt
3 months ago
I think I practiced a similar question where the watermark was set to 1 hour. If I recall correctly, it did drop late records, but I can't remember if it affected duplicates.
upvoted 0 times
...
Magdalene
3 months ago
I think D) is misleading, duplicates shouldn't be retained.
upvoted 0 times
...
Xuan
3 months ago
Hmm, I'm a bit confused on the watermark and deduplication. Does the 2 hour window apply to the deduplication or just the watermark? I'm not sure if duplicate records more than 2 hours apart would be retained or not.
upvoted 0 times
...
Merilyn
3 months ago
I think I've got a good handle on this. The key is understanding how the watermark and deduplication work together. I'd go with option B - the orders table will only contain the most recent 2 hours of records and no duplicates.
upvoted 0 times
...

Save Cancel