New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Amazon MLS-C01 Exam - Topic 3 Question 104 Discussion

Actual exam question for Amazon's MLS-C01 exam
Question #: 104
Topic #: 3
[All MLS-C01 Questions]

A finance company needs to forecast the price of a commodity. The company has compiled a dataset of historical daily prices. A data scientist must train various forecasting models on 80% of the dataset and must validate the efficacy of those models on the remaining 20% of the dataset.

What should the data scientist split the dataset into a training dataset and a validation dataset to compare model performance?

Show Suggested Answer Hide Answer
Suggested Answer: A

AComprehensive Explanation: The best way to split the dataset into a training dataset and a validation dataset is to pick a date so that 80% of the data points precede the date and assign that group of data points as the training dataset. This method preserves the temporal order of the data and ensures that the validation dataset reflects the most recent trends and patterns in the commodity price. This is important for forecasting models that rely on time series analysis and sequential data. The other methods would either introduce bias or lose information by ignoring the temporal structure of the data.

References:

Time Series Forecasting - Amazon SageMaker

Time Series Splitting - scikit-learn

Time Series Forecasting - Towards Data Science


Contribute your Thoughts:

0/2000 characters
Herminia
3 months ago
C sounds too complicated for this kind of task.
upvoted 0 times
...
Shoshana
3 months ago
A is definitely the way to go!
upvoted 0 times
...
Odette
3 months ago
Wait, can you really just pick a random date?
upvoted 0 times
...
Omega
4 months ago
I disagree, B seems more logical to me.
upvoted 0 times
...
Yvette
4 months ago
Option A makes the most sense for time series data.
upvoted 0 times
...
Shawn
4 months ago
Random sampling sounds tempting, but I recall that for time series forecasting, we shouldn't randomize the order, so I think option D is not appropriate.
upvoted 0 times
...
Berry
4 months ago
I practiced a similar question where we had to split data for model validation, and I feel like option C is too small a sample size for training.
upvoted 0 times
...
Magdalene
4 months ago
I'm not entirely sure, but I think picking a date for the split is crucial, and option B seems wrong because it suggests using later data for training.
upvoted 0 times
...
Mi
5 months ago
I remember we discussed that for time series data, it's important to maintain the chronological order, so I think option A makes the most sense.
upvoted 0 times
...
Veronica
5 months ago
Hmm, this is a tricky one. I'm leaning towards option B, where the training set comes after the validation set. That way, the model can learn from the future and be tested on the past, which could potentially work better for forecasting. But I'm not 100% sure, so I might need to do some research to decide.
upvoted 0 times
...
Sol
5 months ago
For time series data, I think option A is the way to go. Splitting by date ensures that the training set comes before the validation set, which mimics the real-world scenario where you'd use historical data to forecast the future. The other options don't seem as appropriate for this type of problem.
upvoted 0 times
...
Angella
5 months ago
This seems like a straightforward time series forecasting problem. I'd go with option A - split the dataset by date so that the training set comes before the validation set. That way, the model can learn from the past and be tested on future data, which is the real-world scenario.
upvoted 0 times
...
Jolene
5 months ago
I'm a bit confused on the best approach here. Should I really just split by date, or is there a more sophisticated way to do the train-test split? Option D sounds interesting, but I'm not sure if random sampling is the right way to handle time series data.
upvoted 0 times
...
Shasta
5 months ago
Okay, let's see. Aspirin sensitivity is the key here. I think I know the right answer, but I'll double-check my reasoning.
upvoted 0 times
...
Johnna
1 year ago
Wait, are we sure the answer isn't B? Because if it's not, I'm going to be kicking myself for the rest of the day. Option B all the way!
upvoted 0 times
...
Daniel
1 year ago
Option D might sound tempting, but that would just be a random mess. We need to split the data in a way that mimics the real-world scenario the model will be used in.
upvoted 0 times
Jesusa
1 year ago
C: Definitely. Option A ensures that the model is trained on past data and validated on future data, just like in real life.
upvoted 0 times
...
Denise
1 year ago
B: I agree. Option D would not provide a realistic representation of the data. We need to split it properly.
upvoted 0 times
...
Donte
1 year ago
A: Option A seems like the best choice. We need to maintain the chronological order of the data for accurate forecasting.
upvoted 0 times
...
...
Catarina
1 year ago
Haha, I'm just picturing the data scientist flipping a coin to decide which data points go where. But in all seriousness, Option B is the clear winner here.
upvoted 0 times
Cherrie
1 year ago
Definitely, random sampling wouldn't be as effective as choosing a date for the split.
upvoted 0 times
...
Jovita
1 year ago
Yeah, it makes sense to use a specific date to divide the data points.
upvoted 0 times
...
Erick
1 year ago
I agree, Option B is the most logical choice for splitting the dataset.
upvoted 0 times
...
...
Lyndia
1 year ago
I think randomly sampling data points for the training dataset is also a valid approach. As long as it's done without replacement, it should provide a good representation of the dataset.
upvoted 0 times
...
James
1 year ago
I agree with Kimberely. It makes sense to split the dataset based on a specific date to ensure a fair comparison of model performance.
upvoted 0 times
...
Kimberely
1 year ago
I think the data scientist should pick a date so that 80% of the data points precede the date and assign them as the training dataset.
upvoted 0 times
...
Destiny
1 year ago
I agree with Stefany. Option B is the way to go. Forecasting models need to be trained on historical data and then tested on future data to see how well they perform.
upvoted 0 times
...
Stefany
1 year ago
Option B makes the most sense. We want the training data to come first in time, so the model can learn from the past and then be validated on the future data.
upvoted 0 times
Carissa
1 year ago
Stratified sampling could introduce bias and not represent the dataset accurately.
upvoted 0 times
...
Fannie
1 year ago
Randomly sampling data points might not capture the time sequence needed for accurate forecasting.
upvoted 0 times
...
Muriel
1 year ago
It's important for the model to learn from past data first before being validated on future data.
upvoted 0 times
...
Nydia
1 year ago
I agree, option B is the best choice for splitting the dataset.
upvoted 0 times
...
...

Save Cancel