A data scientist has developed a machine learning pipeline with a static input data set using Spark ML, but the pipeline is taking too long to process. They increase the number of workers in the cluster to get the pipeline to run more efficiently. They notice that the number of rows in the training set after reconfiguring the cluster is different from the number of rows in the training set prior to reconfiguring the cluster.
Which of the following approaches will guarantee a reproducible training and test set for each model?
To ensure reproducible training and test sets, writing the split data sets to persistent storage is a reliable approach. This allows you to consistently load the same training and test data for each model run, regardless of cluster reconfiguration or other changes in the environment.
Correct approach:
Split the data.
Write the split data to persistent storage (e.g., HDFS, S3).
Load the data from storage for each model training session.
train_df, test_df = spark_df.randomSplit([0.8, 0.2], seed=42) train_df.write.parquet('path/to/train_df.parquet') test_df.write.parquet('path/to/test_df.parquet') # Later, load the data train_df = spark.read.parquet('path/to/train_df.parquet') test_df = spark.read.parquet('path/to/test_df.parquet')
Sarah
6 months agoNieves
6 months agoKendra
6 months agoAmira
6 months agoCathrine
5 months agoCecil
5 months agoKaran
5 months agoLeota
5 months agoSerina
6 months agoTammy
6 months agoJanae
5 months agoMerlyn
5 months agoDaniela
6 months agoAnnice
6 months agoSabrina
6 months agoWillard
6 months agoAngelo
5 months agoElden
5 months agoFrancene
5 months agoSantos
6 months agoAmmie
5 months agoTherese
6 months agoTelma
6 months agoVilma
7 months agoEvan
7 months agoMauricio
7 months ago