Databricks Certified Data Engineer Professional Exam - Topic 3 Question 42 Discussion

Question

Databricks Certified Data Engineer Professional Exam - Topic 3 Question 42 Discussion

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:(spark.readStream.format("parquet").load("/mnt/raw_orders/").withWatermark("time", "2 hours").dropDuplicates(["customer_id", "order_id"]).writeStream.trigger(once=True).table("orders"))Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system. If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

B) The orders table will contain only the most recent 2 hours of records and no duplicates will be present.

C) All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.

D) Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.

Accepted Answer

A) The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.

Databricks Certified Data Engineer Professional Exam - Topic 3 Question 42 Discussion

Databricks Certified Data Engineer Professional Exam - Topic 3 Question 42 Discussion

Contribute your Thoughts:

Yvette

Edgar

Bethanie

Deeann

Leonor

Lashanda

Markus

Sage

Hildegarde

Lorean

Linn

Genevieve

Shawna

Malcolm

Kristeen

Stephen

Garry

Sean

Lindy

Blair

Charlene

Britt

Magdalene

Xuan

Merilyn

Louann

Joaquin

Novella

Idella

Gracie