New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Amazon MLS-C01 Exam - Topic 6 Question 87 Discussion

Actual exam question for Amazon's MLS-C01 exam
Question #: 87
Topic #: 6
[All MLS-C01 Questions]

A machine learning engineer is building a bird classification model. The engineer randomly separates a dataset into a training dataset and a validation dataset. During the training phase, the model achieves very high accuracy. However, the model did not generalize well during validation of the validation dataset. The engineer realizes that the original dataset was imbalanced.

What should the engineer do to improve the validation accuracy of the model?

Show Suggested Answer Hide Answer
Suggested Answer: A

Stratified sampling is a technique that preserves the class distribution of the original dataset when creating a smaller or split dataset. This means that the proportion of examples from each class in the original dataset is maintained in the smaller or split dataset. Stratified sampling can help improve the validation accuracy of the model by ensuring that the validation dataset is representative of the original dataset and not biased towards any class. This can reduce the variance and overfitting of the model and increase its generalization ability. Stratified sampling can be applied to both oversampling and undersampling methods, depending on whether the goal is to increase or decrease the size of the dataset.

The other options are not effective ways to improve the validation accuracy of the model. Acquiring additional data about the majority classes in the original dataset will only increase the imbalance and make the model more biased towards the majority classes. Using a smaller, randomly sampled version of the training dataset will not guarantee that the class distribution is preserved and may result in losing important information from the minority classes. Performing systematic sampling on the original dataset will also not ensure that the class distribution is preserved and may introduce sampling bias if the original dataset is ordered or grouped by class.

References:

* Stratified Sampling for Imbalanced Datasets

* Imbalanced Data

* Tour of Data Sampling Methods for Imbalanced Classification


Contribute your Thoughts:

0/2000 characters
Lai
3 months ago
Agreed, stratified sampling seems like the best option!
upvoted 0 times
...
Geoffrey
3 months ago
Wait, systematic sampling? Isn't that less effective for imbalanced data?
upvoted 0 times
...
Marlon
3 months ago
Not sure if a smaller dataset is the answer here...
upvoted 0 times
...
Laurel
4 months ago
I think acquiring more data for the majority classes could help too.
upvoted 0 times
...
Mila
4 months ago
Stratified sampling is definitely the way to go!
upvoted 0 times
...
Cassi
4 months ago
Systematic sampling doesn't seem like it would address the imbalance issue effectively. I think option D might not be the right choice, but I can't remember the details of why we ruled it out in our studies.
upvoted 0 times
...
Iluminada
4 months ago
I feel like using a smaller, randomly sampled version of the training dataset might not actually help with the validation accuracy. Option C seems off to me, but I can't recall the exact reasoning.
upvoted 0 times
...
Sheron
4 months ago
I'm not entirely sure, but I think acquiring more data for the minority classes could also be a solution. Option B sounds familiar from our practice questions, but I wonder if it's the best approach here.
upvoted 0 times
...
Ailene
5 months ago
I remember we discussed the importance of stratified sampling in class, especially for imbalanced datasets. It seems like option A could help ensure that all classes are represented in both training and validation sets.
upvoted 0 times
...
Anjelica
5 months ago
I'm leaning towards option A - performing stratified sampling on the original dataset. That way, the training and validation sets will have a balanced representation of the classes, which should help the model generalize better.
upvoted 0 times
...
Glenn
5 months ago
Okay, I'm a bit confused here. If the model is already performing well on the training set, then the issue seems to be with the validation set. Acquiring more data for the majority classes might help, but I'm not sure if that's the best approach.
upvoted 0 times
...
Remedios
5 months ago
Hmm, this seems like a tricky one. I think the key is to address the imbalance in the original dataset. Stratified sampling could be a good approach to ensure the training and validation sets have a representative distribution of the classes.
upvoted 0 times
...
Karan
5 months ago
Using a smaller, randomly sampled version of the training dataset doesn't seem like the right solution here. That could actually make the imbalance problem worse. I think focusing on the sampling approach is the way to go.
upvoted 0 times
...
Sang
5 months ago
Okay, let me see here. I'm a bit unsure about this one, but I'll give it my best shot. Gotta stay focused and not overthink it.
upvoted 0 times
...
Terina
5 months ago
I think both statements seem valid - employers definitely want performance metrics for HMOs and non-HMO plans, and direct contracting is becoming more common to control healthcare costs.
upvoted 0 times
...
Nikita
5 months ago
Hmm, I'm a bit unsure about this one. I know Power Automate can help with workflow automation, but I'm not sure if that's the right component for this scenario. I'll need to think it through carefully.
upvoted 0 times
...
Pa
5 months ago
This looks like a straightforward security question. I'll start by thinking about the key characteristics of the Data Confidentiality pattern and how different security mechanisms could be used to implement it.
upvoted 0 times
...
Lottie
5 months ago
This seems straightforward. Since leaf 1 has been discovered, the next logical nodes to be discovered would be the two spines and the remaining three leaf switches.
upvoted 0 times
...
Ahmed
2 years ago
That's a good point, Fletcher. Having more data for the majority classes can definitely make the model more robust.
upvoted 0 times
...
Fletcher
2 years ago
I think acquiring additional data about the majority classes could also help improve validation accuracy.
upvoted 0 times
...
Alayna
2 years ago
I agree with Ahmed. Stratified sampling can help balance out the classes in the dataset.
upvoted 0 times
...
Ahmed
2 years ago
I think the engineer should perform stratified sampling on the original dataset.
upvoted 0 times
...
Isabelle
2 years ago
That's a good point, King. Having more data for the majority classes can definitely make the model more robust.
upvoted 0 times
...
King
2 years ago
I think acquiring additional data about the majority classes could also help improve validation accuracy.
upvoted 0 times
...
Fausto
2 years ago
I agree with Isabelle. Stratified sampling can help balance out the classes in the dataset.
upvoted 0 times
...
Isabelle
2 years ago
I think the engineer should perform stratified sampling on the original dataset.
upvoted 0 times
...
Kaycee
2 years ago
Haha, using a smaller, randomly sampled version of the training data? That's like trying to lose weight by cutting off your foot - not gonna work, my dude. Definitely don't go with option C.
upvoted 0 times
...
Isaiah
2 years ago
Acquiring more data for the majority classes, as option B suggests, could also be a good approach. But that might take a lot of time and effort. Stratified sampling seems like the more efficient solution here.
upvoted 0 times
...
Julianna
2 years ago
I'd go with option A, stratified sampling. That way, you can ensure that the training and validation sets have the same distribution of classes, which should help the model generalize better. Randomized sampling can sometimes lead to skewed class distributions in the splits.
upvoted 0 times
...
Maira
2 years ago
Oh man, this question is a tricky one. We've all been there, building a model that does great on the training data but tanks on the validation set. Sounds like the engineer has an imbalanced dataset on their hands, which is a common problem in ML.
upvoted 0 times
Arminda
2 years ago
Yes, adding more data for minority classes can definitely help improve model performance.
upvoted 0 times
...
Stephaine
2 years ago
B) Acquire additional data about the majority classes in the original dataset.
upvoted 0 times
...
Gail
2 years ago
That's a good suggestion. It helps in maintaining the class distribution.
upvoted 0 times
...
Jesusa
2 years ago
A) Perform stratified sampling on the original dataset.
upvoted 0 times
...
...

Save Cancel