New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Amazon MLS-C01 Exam - Topic 1 Question 90 Discussion

Actual exam question for Amazon's MLS-C01 exam
Question #: 90
Topic #: 1
[All MLS-C01 Questions]

A company wants to segment a large group of customers into subgroups based on shared characteristics. The company's data scientist is planning to use the Amazon SageMaker built-in k-means clustering algorithm for this task. The data scientist needs to determine the optimal number of subgroups (k) to use.

Which data visualization approach will MOST accurately determine the optimal value of k?

Show Suggested Answer Hide Answer
Suggested Answer: A

SageMaker Data Wrangler is a feature of SageMaker Studio that provides an end-to-end solution for importing, preparing, transforming, featurizing, and analyzing data. Data Wrangler includes built-in analyses that help generate visualizations and data insights in a few clicks. One of the built-in analyses is the Quick Model visualization, which can be used to quickly evaluate the data and produce importance scores for each feature. A feature importance score indicates how useful a feature is at predicting a target label. The feature importance score is between [0, 1] and a higher number indicates that the feature is more important to the whole dataset. The Quick Model visualization uses a random forest model to calculate the feature importance for each feature using the Gini importance method. This method measures the total reduction in node impurity (a measure of how well a node separates the classes) that is attributed to splitting on a particular feature. The ML developer can use the Quick Model visualization to obtain the importance scores for each feature of the dataset and use them to feature engineer the dataset. This solution requires the least development effort compared to the other options.

References:

* Analyze and Visualize

* Detect multicollinearity, target leakage, and feature correlation with Amazon SageMaker Data Wrangler


Contribute your Thoughts:

0/2000 characters
Merilyn
3 months ago
B seems interesting, but I’m not sure it’s the best for k-means.
upvoted 0 times
...
Rusty
3 months ago
Totally agree with D, it’s the most straightforward approach!
upvoted 0 times
...
Justine
3 months ago
Wait, can PCA really help with k-means? Seems a bit off.
upvoted 0 times
...
Claribel
4 months ago
I think A sounds more intuitive with the scatter plot!
upvoted 0 times
...
Ashleigh
4 months ago
Option D is the classic elbow method, right?
upvoted 0 times
...
Celestina
4 months ago
I recall that the elbow method is a common technique for this kind of problem, so option D makes the most sense to me. It aligns with what we practiced in class.
upvoted 0 times
...
Cherelle
4 months ago
I practiced a question about t-SNE before, but I don't think it's the right approach here. Option C feels off since it focuses on perplexity rather than directly on k.
upvoted 0 times
...
Colette
4 months ago
I'm not entirely sure, but I think PCA is useful for dimensionality reduction. Option A seems like it could help visualize the clusters, but I'm not confident it's the best way to determine k.
upvoted 0 times
...
Samira
5 months ago
I remember we discussed using the elbow method to find the optimal k, which sounds similar to option D with the SSE plot.
upvoted 0 times
...
Carolynn
5 months ago
I'm leaning towards the SSE plot as well. It's a classic technique that I'm comfortable with, and it should give me a clear indication of the optimal number of clusters. The other options seem a bit more complex and risky for an exam setting.
upvoted 0 times
...
Francis
5 months ago
The t-SNE plot seems interesting, but I'm not as familiar with it. I'd have to do some research to make sure I'm using it correctly. The SSE plot feels like the safer bet for this exam question.
upvoted 0 times
...
Fernanda
5 months ago
I'm a bit torn between the SSE plot and the PCA scatter plot. The PCA approach might give me a more visual sense of how the clusters are shaping up, but the SSE plot is probably more objective and quantitative.
upvoted 0 times
...
Ernie
5 months ago
I think the sum of squared errors (SSE) plot is the way to go here. It's a classic technique for determining the optimal number of clusters, and it's pretty straightforward to implement.
upvoted 0 times
...
Dion
5 months ago
I'm a little confused by the wording of the question. Does "extend the INET interface" mean physically connecting the interfaces, or is there some other configuration required? I'll need to think through the technical details to determine the right approach.
upvoted 0 times
...
Kati
5 months ago
Hmm, I'm not totally sure about this one. Audits, administration, and patching all seem like they could be ways to apply consistent configurations. I'll have to think this through carefully before answering.
upvoted 0 times
...
Lasandra
5 months ago
I'm confident I can solve this. The question is providing specific details about the issue, so I just need to match those to the most relevant option. I'll carefully read through the choices and select the one that makes the most sense.
upvoted 0 times
...
Lawrence
5 months ago
Definitely Bursting. The question says the retailer isn't ready to move to the cloud, and Bursting allows you to temporarily expand your on-premises resources to handle the increased demand.
upvoted 0 times
...
Lindsey
5 months ago
I'm a bit confused on how to determine the "highest product risks" in this case. Maybe I should talk to the QA lead to get some guidance on that.
upvoted 0 times
...
Lavera
10 months ago
Clustering customers? Sounds like a job for the k-means algorithm! Though I'd be tempted to just group them by their favorite ice cream flavor. Chocolate chip or bust!
upvoted 0 times
Arthur
8 months ago
C: I prefer option B. Creating a line plot of the explained variance will give a clear indication of when to stop adding clusters.
upvoted 0 times
...
Eileen
9 months ago
B: I agree. Option D might also work, but I think visually seeing the clusters on a scatter plot is more intuitive.
upvoted 0 times
...
Krystina
9 months ago
A: I think option A is the way to go. Creating scatter plots with different colors for each cluster will make it easier to see when they start to separate.
upvoted 0 times
...
...
Cammy
10 months ago
I'm with Ryan on this one. The SSE plot in option D is a simple and effective way to identify the elbow point and determine the optimal k.
upvoted 0 times
Ernest
9 months ago
I prefer the t-SNE plot in option C as it can provide a different perspective on the clustering structure.
upvoted 0 times
...
Lashawnda
9 months ago
I think using PCA components and creating scatter plots in option A could also be helpful in visualizing the clusters.
upvoted 0 times
...
Afton
10 months ago
I think both options have their merits, it depends on the specific dataset and goals of the analysis.
upvoted 0 times
...
Truman
10 months ago
I agree, the SSE plot in option D is a straightforward method to find the optimal k.
upvoted 0 times
...
Herman
10 months ago
True, the PCA components can help with dimensionality reduction and clustering.
upvoted 0 times
...
Josphine
10 months ago
But using PCA components in option A can also give a clear visualization of the clusters.
upvoted 0 times
...
Deandrea
10 months ago
I agree, the SSE plot in option D is a straightforward method to find the optimal k.
upvoted 0 times
...
...
Sherill
10 months ago
Ha! The t-SNE option (C) is a bit of a wild card. You'd have to play around with the perplexity to get a feel for the clusters, which sounds like a lot of work.
upvoted 0 times
...
Erasmo
10 months ago
B doesn't seem quite right to me. The explained variance curve may not always decrease linearly, so I'm not sure that's the best way to find k.
upvoted 0 times
...
Ryan
10 months ago
I think option D is the way to go. The elbow method using the SSE plot is a classic approach to determining the optimal number of clusters.
upvoted 0 times
Filiberto
10 months ago
I think option A might also work well, visually seeing the clusters can give a good indication of the optimal k value.
upvoted 0 times
...
Nelida
10 months ago
I agree, the elbow method is a reliable way to find the optimal number of clusters.
upvoted 0 times
...
...
Penney
11 months ago
That's a valid point, Whitley. Calculating SSE can indeed provide a more quantitative measure of the optimal number of subgroups.
upvoted 0 times
...
Whitley
11 months ago
I disagree, I believe option D is more accurate. Calculating SSE and plotting the curve will give a clear indication of the optimal value of k.
upvoted 0 times
...
Penney
11 months ago
I think option A is the best approach. Using PCA components and creating scatter plots will visually show the separation of clusters.
upvoted 0 times
...

Save Cancel