Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Machine-Learning-Associate Topic 3 Question 16 Discussion

Actual exam question for Databricks's Databricks-Machine-Learning-Associate exam
Question #: 16
Topic #: 3
[All Databricks-Machine-Learning-Associate Questions]

A data scientist has developed a machine learning pipeline with a static input data set using Spark ML, but the pipeline is taking too long to process. They increase the number of workers in the cluster to get the pipeline to run more efficiently. They notice that the number of rows in the training set after reconfiguring the cluster is different from the number of rows in the training set prior to reconfiguring the cluster.

Which of the following approaches will guarantee a reproducible training and test set for each model?

Show Suggested Answer Hide Answer
Suggested Answer: B

To ensure reproducible training and test sets, writing the split data sets to persistent storage is a reliable approach. This allows you to consistently load the same training and test data for each model run, regardless of cluster reconfiguration or other changes in the environment.

Correct approach:

Split the data.

Write the split data to persistent storage (e.g., HDFS, S3).

Load the data from storage for each model training session.

train_df, test_df = spark_df.randomSplit([0.8, 0.2], seed=42) train_df.write.parquet('path/to/train_df.parquet') test_df.write.parquet('path/to/test_df.parquet') # Later, load the data train_df = spark.read.parquet('path/to/train_df.parquet') test_df = spark.read.parquet('path/to/test_df.parquet')


Spark DataFrameWriter Documentation

Contribute your Thoughts:

Sarah
6 months ago
Option B all the way, baby! Writing out the split data sets is like putting your training and test sets in a little data time capsule. Gotta keep that good stuff preserved, you know?
upvoted 0 times
...
Nieves
6 months ago
Hmm, this is a tricky one. I'd say B is the safest bet, but D could work too if you really want to micromanage the process. Either way, you gotta keep that data consistent!
upvoted 0 times
...
Kendra
6 months ago
I'm going to have to go with B. You can't trust the cluster to handle the data splitting consistently, so it's best to save the split data sets for later use.
upvoted 0 times
...
Amira
6 months ago
I think D is the best option. Manually partitioning the input data gives you more control over the data split, which is crucial for reproducibility.
upvoted 0 times
Cathrine
5 months ago
D: Manually configuring the cluster might be time-consuming, but it does guarantee reproducibility.
upvoted 0 times
...
Cecil
5 months ago
C: Writing out the split data sets to persistent storage could also help ensure reproducibility.
upvoted 0 times
...
Karan
5 months ago
B: I agree. It's important to have a consistent training and test set for each model.
upvoted 0 times
...
Leota
5 months ago
A: I think D is the best option. Manually partitioning the input data gives you more control over the data split, which is crucial for reproducibility.
upvoted 0 times
...
...
Serina
6 months ago
Option B seems like the way to go. Writing out the split data sets to persistent storage ensures that the training and test sets are reproducible, no matter how the cluster is configured.
upvoted 0 times
...
Tammy
6 months ago
I'm just impressed the data scientist even noticed the row count change. Most people would have just assumed the pipeline was working fine. Option B is the way to go, no doubt.
upvoted 0 times
Janae
5 months ago
Manually configuring the cluster might not guarantee reproducibility like writing out the split data sets.
upvoted 0 times
...
Merlyn
5 months ago
It's important to have a consistent training and test set for each model.
upvoted 0 times
...
Daniela
6 months ago
Yeah, writing out the split data sets to persistent storage will ensure reproducibility.
upvoted 0 times
...
Annice
6 months ago
I agree, option B is definitely the best choice here.
upvoted 0 times
...
...
Sabrina
6 months ago
Hmm, I'm not sure any of these options are truly 'guaranteed' to work. There's always a chance of some unexpected edge case. But B does seem like the safest bet.
upvoted 0 times
...
Willard
6 months ago
I agree with Santos. Persistent storage is the key to reproducibility. The other options might work in some cases, but B is the most reliable approach.
upvoted 0 times
Angelo
5 months ago
Manually configuring the cluster might not guarantee consistency in the training and test sets.
upvoted 0 times
...
Elden
5 months ago
I agree, that way we can ensure reproducibility for each model.
upvoted 0 times
...
Francene
5 months ago
I think writing out the split data sets to persistent storage is the best approach.
upvoted 0 times
...
...
Santos
6 months ago
Option B is the way to go. Writing out the split data sets to persistent storage ensures that the training and test sets remain consistent across different runs of the pipeline.
upvoted 0 times
Ammie
5 months ago
A: Absolutely. It's crucial for the reliability and accuracy of the model.
upvoted 0 times
...
Therese
6 months ago
B: I agree. It's important to have reproducible results when working with machine learning models.
upvoted 0 times
...
Telma
6 months ago
A: Option B is definitely the best choice. It will ensure consistency in the training and test sets.
upvoted 0 times
...
...
Vilma
7 months ago
But wouldn't manually partitioning the input data also guarantee reproducibility?
upvoted 0 times
...
Evan
7 months ago
I agree with Mauricio, that way we can ensure reproducibility for each model.
upvoted 0 times
...
Mauricio
7 months ago
I think we should write out the split data sets to persistent storage.
upvoted 0 times
...

Save Cancel