Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Amazon MLA-C01 Exam - Topic 4 Question 18 Discussion

An ML engineer is training an ML model to identify medical patients for disease screening. The tabular dataset for training contains 50,000 patient records: 1,000 with the disease and 49,000 without the disease.The ML engineer splits the dataset into a training dataset, a validation dataset, and a test dataset.What should the ML engineer do to transform the data and make the data suitable for training?
B) Apply Synthetic Minority Oversampling Technique (SMOTE) to generate new synthetic samples of the minority class in the training dataset.
A) Apply principal component analysis (PCA) to oversample the minority class in the training dataset.
C) Randomly oversample the majority class in the validation dataset.
D) Apply k-means clustering to undersample the minority class in the test dataset.

Amazon MLA-C01 Exam - Topic 4 Question 18 Discussion

Actual exam question for Amazon's MLA-C01 exam
Question #: 18
Topic #: 4
[All MLA-C01 Questions]

An ML engineer is training an ML model to identify medical patients for disease screening. The tabular dataset for training contains 50,000 patient records: 1,000 with the disease and 49,000 without the disease.

The ML engineer splits the dataset into a training dataset, a validation dataset, and a test dataset.

What should the ML engineer do to transform the data and make the data suitable for training?

Show Suggested Answer Hide Answer
Suggested Answer: B

This dataset shows severe class imbalance, with only 2% of records representing patients with the disease. AWS ML best practices recommend correcting imbalance only in the training dataset, while keeping validation and test sets representative of real-world distributions.

Synthetic Minority Oversampling Technique (SMOTE) generates synthetic samples of the minority class by interpolating between existing minority examples. This improves the model's ability to learn disease-related patterns without discarding data.

PCA is a dimensionality reduction method, not an oversampling technique. Oversampling the majority class worsens imbalance. Altering the test dataset would invalidate evaluation results.

Therefore, applying SMOTE to the training dataset is the correct approach.


Contribute your Thoughts:

0/2000 characters
Harrison
1 month ago
I feel like randomly oversampling the majority class in the validation dataset doesn't really help with training, but I can't recall the exact reason.
upvoted 0 times
...
Adaline
1 month ago
I think SMOTE is often recommended for handling imbalanced datasets, so I might lean towards option B.
upvoted 0 times
...
Asuncion
1 month ago
I remember we discussed the importance of balancing the dataset, but I'm not sure if PCA is the right choice for oversampling.
upvoted 0 times
...

Save Cancel