Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Certified Data Engineer Professional Exam - Topic 3 Question 31 Discussion

Actual exam question for Databricks's Databricks Certified Data Engineer Professional exam
Question #: 31
Topic #: 3
[All Databricks Certified Data Engineer Professional Questions]

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.

Which approach would simplify the identification of these changed records?

Show Suggested Answer Hide Answer
Suggested Answer: D

Delta Lake, built on top of Parquet, enhances query performance through data skipping, which is based on the statistics collected for each file in a table. For tables with a large number of columns, Delta Lake by default collects and stores statistics only for the first 32 columns. These statistics include min/max values and null counts, which are used to optimize query execution by skipping irrelevant data files. When dealing with highly nested JSON structures, understanding this behavior is crucial for schema design, especially when determining which fields should be flattened or prioritized in the table structure to leverage data skipping efficiently for performance optimization. Reference: Databricks documentation on Delta Lake optimization techniques, including data skipping and statistics collection (https://docs.databricks.com/delta/optimizations/index.html).


Contribute your Thoughts:

0/2000 characters
Ashlyn
3 months ago
B could be interesting, but is it really necessary to switch to streaming?
upvoted 0 times
...
Sue
4 months ago
Wait, does E really handle all the edge cases?
upvoted 0 times
...
Izetta
4 months ago
Not sure about D, seems like it could complicate things.
upvoted 0 times
...
Lashawna
4 months ago
I agree, C is efficient for only predicting on changed records.
upvoted 0 times
...
Lezlie
4 months ago
Option C seems like a solid approach for tracking changes.
upvoted 0 times
...
Luke
4 months ago
I feel like option C makes sense because it focuses on comparing previous predictions, but I’m not sure if that’s the simplest approach compared to others.
upvoted 0 times
...
Amber
5 months ago
I practiced a similar question where we had to decide between batch and streaming processing; I wonder if option B could be a good fit since it talks about incrementally predicting.
upvoted 0 times
...
Micheal
5 months ago
I think option D sounds familiar because it involves adding a timestamp, which could help track changes, but I’m not entirely confident if it’s the most efficient way.
upvoted 0 times
...
Cory
5 months ago
I remember discussing how using a change data capture approach could help identify only the records that have changed, but I'm not sure which option aligns best with that.
upvoted 0 times
...
Ricki
5 months ago
Option D seems like a straightforward solution. Adding a timestamp field and using that to filter the data could be a simple and effective way to solve this problem. I'll need to consider the pros and cons, but it's definitely an option worth exploring.
upvoted 0 times
...
Cruz
5 months ago
I'm leaning towards option E. Replacing the overwrite logic with a merge statement and using a change data feed seems like a good way to identify the changed records. That way, we can focus the predictions on just the updated data.
upvoted 0 times
...
Ling
5 months ago
Option B sounds interesting, using Structured Streaming to incrementally process the data. That could be a more efficient approach than re-processing everything each time. I'll need to think through the details, but it seems like a promising solution.
upvoted 0 times
...
Katlyn
5 months ago
I'm a bit confused by the question. It's not clear to me how the different options would work in practice. I might try to get some clarification from the instructor before attempting to answer.
upvoted 0 times
...
Lilli
5 months ago
This seems like a tricky question. I'm not sure if I fully understand the requirements, but I think option C might be a good approach. Calculating the difference between previous predictions and the current data could help identify the changed records.
upvoted 0 times
...
Bobbye
5 months ago
Option D seems like a simple solution, but I'm not sure if it's the most efficient. Adding a timestamp field and then using that to identify the records written on a particular date could work, but it might not be as scalable as some of the other options.
upvoted 0 times
...
Cyril
5 months ago
I'm leaning towards option E. Replacing the overwrite logic with a merge statement and then making predictions on the changed records seems like a straightforward way to handle this. It's a common pattern, so I feel pretty confident about that approach.
upvoted 0 times
...
Kanisha
5 months ago
Option B with Structured Streaming sounds like an interesting approach. It could help us process the data more efficiently and only make predictions on the records that have changed. I'll have to look into that a bit more.
upvoted 0 times
...
Huey
5 months ago
I'm a bit confused by this question. It's not entirely clear to me how the different options would work in practice. I might need to think through the pros and cons of each approach a bit more.
upvoted 0 times
...
Crissy
6 months ago
This seems like a tricky question, but I think option C might be the way to go. Calculating the difference between the previous predictions and the current data could help us focus on just the customers that have changed.
upvoted 0 times
...
Amos
11 months ago
Option B sounds like the data engineering team's dream come true. Structured Streaming? Sounds like they just want to play with the latest and greatest tech.
upvoted 0 times
Lakeesha
9 months ago
E) Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.
upvoted 0 times
...
Lettie
10 months ago
B) Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
upvoted 0 times
...
Celeste
10 months ago
A) Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
upvoted 0 times
...
...
Kristian
11 months ago
Option A is just plain lazy. Why would you want to run the model on every single row when you can be more efficient? That's just asking for trouble.
upvoted 0 times
Caprice
9 months ago
E) Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.
upvoted 0 times
...
Arlene
10 months ago
C) Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.
upvoted 0 times
...
Alesia
10 months ago
A) Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
upvoted 0 times
...
...
Renea
11 months ago
I'm torn between Options C and D. Either one could work, but D seems a bit more straightforward with the timestamp field.
upvoted 0 times
Jin
10 months ago
Yes, using the current timestamp field for identification could simplify the process of making predictions on updated records.
upvoted 0 times
...
Tammy
10 months ago
Option D with the timestamp field does seem straightforward and reliable for identifying changed records.
upvoted 0 times
...
Laquita
10 months ago
I agree, calculating the difference between previous predictions and current data could be effective.
upvoted 0 times
...
Johna
10 months ago
Option C seems like a good approach to identify changed records efficiently.
upvoted 0 times
...
...
Rory
11 months ago
Option E is my pick. Merging the changed records and making predictions on those is a clean and maintainable solution.
upvoted 0 times
Val
11 months ago
I think option C could also work well by calculating the difference before making new predictions.
upvoted 0 times
...
Stevie
11 months ago
I agree, option E seems like the most efficient way to handle the changed records.
upvoted 0 times
...
...
Ardella
11 months ago
I think Option B is the way to go. Using Structured Streaming to incrementally process the data and make predictions is a more scalable and robust approach.
upvoted 0 times
Jarvis
10 months ago
E) Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.
upvoted 0 times
...
Rikki
10 months ago
C) Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.
upvoted 0 times
...
Wilbert
10 months ago
B) Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
upvoted 0 times
...
...
Florinda
11 months ago
I see both points, but I think option B could also be a good approach to consider.
upvoted 0 times
...
Gerald
11 months ago
Option C makes the most sense to me. Calculating the difference between previous and current predictions is the most efficient way to identify the changes and only run the model on those.
upvoted 0 times
...
Dwight
11 months ago
I disagree, I believe option E would be more efficient in this scenario.
upvoted 0 times
...
Janine
12 months ago
I think option C would simplify the identification of changed records.
upvoted 0 times
...
Ivory
12 months ago
I bet the data engineer is wishing they had a magic wand to deal with all those nested fields. *waves wand*
upvoted 0 times
Lottie
11 months ago
C) Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.
upvoted 0 times
...
Barbra
11 months ago
B) Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
upvoted 0 times
...
Terry
11 months ago
A) Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
upvoted 0 times
...
...

Save Cancel