Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Certified-Professional-Data-Engineer Topic 3 Question 31 Discussion

Actual exam question for Databricks's Databricks-Certified-Professional-Data-Engineer exam
Question #: 31
Topic #: 3
[All Databricks-Certified-Professional-Data-Engineer Questions]

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.

Which approach would simplify the identification of these changed records?

Show Suggested Answer Hide Answer
Suggested Answer: D

Delta Lake, built on top of Parquet, enhances query performance through data skipping, which is based on the statistics collected for each file in a table. For tables with a large number of columns, Delta Lake by default collects and stores statistics only for the first 32 columns. These statistics include min/max values and null counts, which are used to optimize query execution by skipping irrelevant data files. When dealing with highly nested JSON structures, understanding this behavior is crucial for schema design, especially when determining which fields should be flattened or prioritized in the table structure to leverage data skipping efficiently for performance optimization. Reference: Databricks documentation on Delta Lake optimization techniques, including data skipping and statistics collection (https://docs.databricks.com/delta/optimizations/index.html).


Contribute your Thoughts:

Rory
4 days ago
Option E is my pick. Merging the changed records and making predictions on those is a clean and maintainable solution.
upvoted 0 times
...
Ardella
8 days ago
I think Option B is the way to go. Using Structured Streaming to incrementally process the data and make predictions is a more scalable and robust approach.
upvoted 0 times
...
Florinda
9 days ago
I see both points, but I think option B could also be a good approach to consider.
upvoted 0 times
...
Gerald
9 days ago
Option C makes the most sense to me. Calculating the difference between previous and current predictions is the most efficient way to identify the changes and only run the model on those.
upvoted 0 times
...
Dwight
12 days ago
I disagree, I believe option E would be more efficient in this scenario.
upvoted 0 times
...
Janine
16 days ago
I think option C would simplify the identification of changed records.
upvoted 0 times
...
Ivory
17 days ago
I bet the data engineer is wishing they had a magic wand to deal with all those nested fields. *waves wand*
upvoted 0 times
Lottie
6 days ago
C) Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.
upvoted 0 times
...
Barbra
8 days ago
B) Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.
upvoted 0 times
...
Terry
9 days ago
A) Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.
upvoted 0 times
...
...

Save Cancel