Independence Day Deal! Unlock 25% OFF Today – Limited-Time Offer - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks Certified Data Engineer Professional Topic 3 Question 31 Discussion

Actual exam question for Databricks's Databricks Certified Data Engineer Professional exam
Question #: 31
Topic #: 3
[All Databricks Certified Data Engineer Professional Questions]

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.

Which approach would simplify the identification of these changed records?

Show Suggested Answer Hide Answer
Suggested Answer: D

Delta Lake, built on top of Parquet, enhances query performance through data skipping, which is based on the statistics collected for each file in a table. For tables with a large number of columns, Delta Lake by default collects and stores statistics only for the first 32 columns. These statistics include min/max values and null counts, which are used to optimize query execution by skipping irrelevant data files. When dealing with highly nested JSON structures, understanding this behavior is crucial for schema design, especially when determining which fields should be flattened or prioritized in the table structure to leverage data skipping efficiently for performance optimization. Reference: Databricks documentation on Delta Lake optimization techniques, including data skipping and statistics collection (https://docs.databricks.com/delta/optimizations/index.html).


Contribute your Thoughts:

Amos
2 months ago
Option B sounds like the data engineering team's dream come true. Structured Streaming? Sounds like they just want to play with the latest and greatest tech.
upvoted 0 times
Lakeesha
6 days ago
E) Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.
upvoted 0 times
...
Lettie
15 days ago
B) Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
upvoted 0 times
...
Celeste
17 days ago
A) Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
upvoted 0 times
...
...
Kristian
2 months ago
Option A is just plain lazy. Why would you want to run the model on every single row when you can be more efficient? That's just asking for trouble.
upvoted 0 times
Caprice
7 days ago
E) Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.
upvoted 0 times
...
Arlene
18 days ago
C) Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.
upvoted 0 times
...
Alesia
19 days ago
A) Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
upvoted 0 times
...
...
Renea
2 months ago
I'm torn between Options C and D. Either one could work, but D seems a bit more straightforward with the timestamp field.
upvoted 0 times
Jin
15 days ago
Yes, using the current timestamp field for identification could simplify the process of making predictions on updated records.
upvoted 0 times
...
Tammy
16 days ago
Option D with the timestamp field does seem straightforward and reliable for identifying changed records.
upvoted 0 times
...
Laquita
17 days ago
I agree, calculating the difference between previous predictions and current data could be effective.
upvoted 0 times
...
Johna
1 months ago
Option C seems like a good approach to identify changed records efficiently.
upvoted 0 times
...
...
Rory
2 months ago
Option E is my pick. Merging the changed records and making predictions on those is a clean and maintainable solution.
upvoted 0 times
Val
2 months ago
I think option C could also work well by calculating the difference before making new predictions.
upvoted 0 times
...
Stevie
2 months ago
I agree, option E seems like the most efficient way to handle the changed records.
upvoted 0 times
...
...
Ardella
2 months ago
I think Option B is the way to go. Using Structured Streaming to incrementally process the data and make predictions is a more scalable and robust approach.
upvoted 0 times
Jarvis
17 days ago
E) Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.
upvoted 0 times
...
Rikki
18 days ago
C) Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.
upvoted 0 times
...
Wilbert
1 months ago
B) Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
upvoted 0 times
...
...
Florinda
2 months ago
I see both points, but I think option B could also be a good approach to consider.
upvoted 0 times
...
Gerald
2 months ago
Option C makes the most sense to me. Calculating the difference between previous and current predictions is the most efficient way to identify the changes and only run the model on those.
upvoted 0 times
...
Dwight
2 months ago
I disagree, I believe option E would be more efficient in this scenario.
upvoted 0 times
...
Janine
3 months ago
I think option C would simplify the identification of changed records.
upvoted 0 times
...
Ivory
3 months ago
I bet the data engineer is wishing they had a magic wand to deal with all those nested fields. *waves wand*
upvoted 0 times
Lottie
2 months ago
C) Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.
upvoted 0 times
...
Barbra
2 months ago
B) Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
upvoted 0 times
...
Terry
2 months ago
A) Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
upvoted 0 times
...
...

Save Cancel