Databricks Exam Databricks Certified Data Engineer Professional Topic 3 Question 31 Discussion

Actual exam question for Databricks's Databricks Certified Data Engineer Professional exam

Question #: 31
Topic #: 3

[All Databricks Certified Data Engineer Professional Questions]

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.

Which approach would simplify the identification of these changed records?

AApply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.

BConvert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.

CCalculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.

DModify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.

EReplace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.

Show Suggested Answer

Suggested Answer: D

Delta Lake, built on top of Parquet, enhances query performance through data skipping, which is based on the statistics collected for each file in a table. For tables with a large number of columns, Delta Lake by default collects and stores statistics only for the first 32 columns. These statistics include min/max values and null counts, which are used to optimize query execution by skipping irrelevant data files. When dealing with highly nested JSON structures, understanding this behavior is crucial for schema design, especially when determining which fields should be flattened or prioritized in the table structure to leverage data skipping efficiently for performance optimization. Reference: Databricks documentation on Delta Lake optimization techniques, including data skipping and statistics collection (https://docs.databricks.com/delta/optimizations/index.html).

by Alyce at Mar 06, 2025, 10:17 AM

Limited Time Offer

25%

Off

Get Premium Databricks Certified Data Engineer Professional Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

2 months ago

Option A is just plain lazy. Why would you want to run the model on every single row when you can be more efficient? That's just asking for trouble.

upvoted 0 times

2 months ago

Option E is my pick. Merging the changed records and making predictions on those is a clean and maintainable solution.

upvoted 0 times

Val

2 months ago

I think option C could also work well by calculating the difference before making new predictions.

upvoted 0 times

...

Stevie

2 months ago

I agree, option E seems like the most efficient way to handle the changed records.

upvoted 0 times

...

Ardella

2 months ago

I think Option B is the way to go. Using Structured Streaming to incrementally process the data and make predictions is a more scalable and robust approach.

upvoted 0 times

Jarvis

21 days ago

upvoted 0 times

...

Rikki

22 days ago

upvoted 0 times

...

3 months ago

I bet the data engineer is wishing they had a magic wand to deal with all those nested fields. *waves wand*

upvoted 0 times

Lottie

2 months ago

upvoted 0 times

...

Barbra

2 months ago

upvoted 0 times

...

Terry

2 months ago

A) Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.

upvoted 0 times

...