Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Certified Data Engineer Professional Exam - Topic 3 Question 42 Discussion

Actual exam question for Databricks's Databricks Certified Data Engineer Professional exam
Question #: 42
Topic #: 3
[All Databricks Certified Data Engineer Professional Questions]

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:

(spark.readStream

.format("parquet")

.load("/mnt/raw_orders/")

.withWatermark("time", "2 hours")

.dropDuplicates(["customer_id", "order_id"])

.writeStream

.trigger(once=True)

.table("orders")

)

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system. If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

Show Suggested Answer Hide Answer
Suggested Answer: A

Comprehensive and Detailed Explanation From Exact Extract:

Exact extract: ''dropDuplicates with watermark performs stateful deduplication on the keys within the watermark delay.''

Exact extract: ''Records older than the event-time watermark are considered late and may be dropped.''

Exact extract: ''trigger(once) processes all available data once and then stops.''

The watermark of 2 hours bounds the deduplication state. Duplicate orders within the 2-hour window are removed; duplicates arriving later than 2 hours behind the corresponding first event are considered late and are ignored, so they won't appear, but any orders that themselves arrive later than the watermark will be dropped and thus be missing.


===========

Contribute your Thoughts:

0/2000 characters
Yvette
2 months ago
I feel B is misleading. It implies a strict 2-hour window, which isn't realistic.
upvoted 0 times
...
Edgar
2 months ago
A is definitely right. Late records shouldn't clutter the table.
upvoted 0 times
...
Bethanie
2 months ago
I lean towards D. If duplicates are enqueued hours apart, they might still show up.
upvoted 0 times
...
Deeann
2 months ago
C sounds plausible. Holding records for 2 hours before deduplication could work.
upvoted 0 times
...
Leonor
2 months ago
I disagree, A seems too limiting. B makes more sense. Only recent records, no duplicates.
upvoted 0 times
...
Lashanda
2 months ago
I think option A is correct. No duplicates, but late records are ignored.
upvoted 0 times
...
Markus
3 months ago
C) sounds plausible, but I doubt all records are held for 2 hours.
upvoted 0 times
...
Sage
3 months ago
Wait, how can duplicates be ignored if they come in late?
upvoted 0 times
...
Hildegarde
3 months ago
Totally agree, the watermark will drop those late entries.
upvoted 0 times
...
Lorean
4 months ago
I'm pretty sure the correct answer is D. The question specifically states that the upstream system can enqueue duplicate entries hours apart, so the 2-hour watermark won't prevent those from being retained.
upvoted 0 times
...
Linn
4 months ago
Haha, this question is like a riddle wrapped in an enigma. I'll just guess and hope for the best!
upvoted 0 times
...
Genevieve
4 months ago
Option D makes sense to me. The question mentions that the upstream system can enqueue duplicate entries hours apart, so the 2-hour watermark won't catch those.
upvoted 0 times
...
Shawna
4 months ago
I'm not sure about this one. The code looks complicated, and I'm not familiar with Parquet data and Databricks jobs.
upvoted 0 times
...
Malcolm
4 months ago
Option D seems correct. The code uses a 2-hour watermark, so records arriving more than 2 hours late will be ignored.
upvoted 0 times
...
Kristeen
4 months ago
I thought the watermark would keep all records for 2 hours before deduplication, so option C seems plausible, but I need to double-check that.
upvoted 0 times
...
Stephen
5 months ago
Okay, let me walk through this step-by-step. The watermark is 2 hours, so records arriving more than 2 hours late will be ignored. But the deduplication is based on customer_id and order_id, not the time field. So I think option D is the right answer here.
upvoted 0 times
...
Garry
5 months ago
I'm pretty confident on this one. The watermark is just used to handle late-arriving records, but the deduplication is the key here. Since it's based on customer_id and order_id, option D is correct - the orders table may contain duplicate records if they are enqueued more than 2 hours apart.
upvoted 0 times
...
Sean
5 months ago
Okay, let me think this through. The watermark is set to 2 hours, so any records arriving more than 2 hours late will be ignored. But the deduplication is based on customer_id and order_id, so I think option D is correct - duplicates more than 2 hours apart could be retained.
upvoted 0 times
...
Lindy
5 months ago
A) is correct, late records will be ignored.
upvoted 0 times
...
Blair
5 months ago
I feel like option D makes sense because if duplicates come in after 2 hours, they might still be processed, right? But I'm not entirely sure.
upvoted 0 times
...
Charlene
5 months ago
I remember something about watermarking, but I'm not sure how it interacts with late records. Does it really ignore them after 2 hours?
upvoted 0 times
...
Britt
6 months ago
I think I practiced a similar question where the watermark was set to 1 hour. If I recall correctly, it did drop late records, but I can't remember if it affected duplicates.
upvoted 0 times
...
Magdalene
6 months ago
I think D) is misleading, duplicates shouldn't be retained.
upvoted 0 times
...
Xuan
6 months ago
Hmm, I'm a bit confused on the watermark and deduplication. Does the 2 hour window apply to the deduplication or just the watermark? I'm not sure if duplicate records more than 2 hours apart would be retained or not.
upvoted 0 times
...
Merilyn
6 months ago
I think I've got a good handle on this. The key is understanding how the watermark and deduplication work together. I'd go with option B - the orders table will only contain the most recent 2 hours of records and no duplicates.
upvoted 0 times
Louann
20 days ago
I still lean towards B. No duplicates in the last 2 hours sounds solid.
upvoted 0 times
...
Joaquin
26 days ago
Yeah, but if they’re more than 2 hours apart, they might not be an issue.
upvoted 0 times
...
Novella
1 month ago
I think option D makes sense too. Duplicates can still slip through.
upvoted 0 times
...
Idella
1 month ago
True, but the watermark should handle that.
upvoted 0 times
...
Gracie
1 month ago
I see your point, but what about late records?
upvoted 0 times
...
...

Save Cancel