New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Machine Learning Associate Exam - Topic 1 Question 37 Discussion

Actual exam question for Databricks's Databricks Machine Learning Associate exam
Question #: 37
Topic #: 1
[All Databricks Machine Learning Associate Questions]

A data scientist has developed a machine learning pipeline with a static input data set using Spark ML, but the pipeline is taking too long to process. They increase the number of workers in the cluster to get the pipeline to run more efficiently. They notice that the number of rows in the training set after reconfiguring the cluster is different from the number of rows in the training set prior to reconfiguring the cluster.

Which of the following approaches will guarantee a reproducible training and test set for each model?

Show Suggested Answer Hide Answer
Suggested Answer: B

To ensure reproducible training and test sets, writing the split data sets to persistent storage is a reliable approach. This allows you to consistently load the same training and test data for each model run, regardless of cluster reconfiguration or other changes in the environment.

Correct approach:

Split the data.

Write the split data to persistent storage (e.g., HDFS, S3).

Load the data from storage for each model training session.

train_df, test_df = spark_df.randomSplit([0.8, 0.2], seed=42) train_df.write.parquet('path/to/train_df.parquet') test_df.write.parquet('path/to/test_df.parquet') # Later, load the data train_df = spark.read.parquet('path/to/train_df.parquet') test_df = spark.read.parquet('path/to/test_df.parquet')


Spark DataFrameWriter Documentation

Contribute your Thoughts:

0/2000 characters
Robt
9 hours ago
Totally agree with B, reproducibility is crucial!
upvoted 0 times
...
Herman
6 days ago
Wait, how can the row count change just by adding workers?
upvoted 0 times
...
Catherin
11 days ago
I disagree, D might be better for control.
upvoted 0 times
...
Frederic
16 days ago
B is the way to go! Persistent storage is key.
upvoted 0 times
...
Cheryl
21 days ago
Option B is the clear winner here. Persistent storage is the way to ensure your model training is consistent and reproducible.
upvoted 0 times
...
Catarina
26 days ago
Haha, I bet the data scientist is wishing they had a time machine to go back and split the data properly in the first place.
upvoted 0 times
...
Kenny
1 month ago
Option B is the most reliable approach. Reproducibility is key in machine learning, and writing the data to storage is the way to achieve that.
upvoted 0 times
...
Charlie
1 month ago
Manually partitioning the input data sounds familiar, but I wonder if that really ensures the same training and test sets every time.
upvoted 0 times
...
Shelton
1 month ago
I think writing out the split data sets to persistent storage makes sense, especially since it would keep the same data for each run.
upvoted 0 times
...
Kiera
2 months ago
I remember something about ensuring reproducibility in data splits, but I'm not sure if just manually configuring the cluster would help.
upvoted 0 times
...
Kattie
2 months ago
Hmm, this is an interesting one. I'm a bit stumped on why the row count would change like that. I think option B, writing out the split data sets, is probably the most reliable approach to ensure I can reproduce the results. I'll make sure to do that in the exam.
upvoted 0 times
...
Margurite
2 months ago
Ah, I see what's going on here. The non-deterministic data splitting is likely due to the way Spark ML handles partitioning when you scale up the cluster. I'd go with option B - writing out the split data sets to persistent storage. That way I can be sure I'm working with the same data every time.
upvoted 0 times
...
Mozell
2 months ago
I agree, B is the best choice. Manually configuring the cluster or partitioning the data could lead to inconsistencies.
upvoted 0 times
...
Maddie
2 months ago
Okay, let me think this through. If the number of rows is changing, that suggests the data splitting operation might be non-deterministic. I think option B, writing out the split data sets, is the safest bet to ensure I can reproduce the results.
upvoted 0 times
...
Audria
3 months ago
Option B is the way to go. Writing out the split data sets to persistent storage ensures that the training and test sets are consistent across runs.
upvoted 0 times
...
Lynelle
3 months ago
I feel like setting a seed in the data splitting operation could be relevant, but I can't recall if that's what guarantees reproducibility.
upvoted 0 times
...
Michel
3 months ago
Hmm, this is a tricky one. I'm a bit confused about why the number of rows would change like that. I'm leaning towards option D, manually partitioning the input data, since that would give me more control over the data splitting process.
upvoted 0 times
...
Raylene
3 months ago
I'm not sure why the number of rows would change after reconfiguring the cluster. That seems really strange to me. I think option B, writing out the split data sets to persistent storage, would be the best approach to ensure reproducibility.
upvoted 0 times
Stephaine
2 months ago
I agree, writing out the split data sets makes sense.
upvoted 0 times
...
...

Save Cancel