New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Amazon MLS-C01 Exam - Topic 1 Question 91 Discussion

Actual exam question for Amazon's MLS-C01 exam
Question #: 91
Topic #: 1
[All MLS-C01 Questions]

A Data Scientist wants to gain real-time insights into a data stream of GZIP files. Which solution would allow the use of SQL to query the stream with the LEAST latency?

Show Suggested Answer Hide Answer
Suggested Answer: B

The best solution to meet the requirements is to tune the csv_weight hyperparameter and the scale_pos_weight hyperparameter by using automatic model tuning (AMT). Optimize on {''HyperParameterTuningJobObjective'': {''MetricName'': ''validation:f1'', ''Type'': ''Maximize''}}.

The csv_weight hyperparameter is used to specify the instance weights for the training data in CSV format. This can help handle imbalanced data by assigning higher weights to the minority class examples and lower weights to the majority class examples. The scale_pos_weight hyperparameter is used to control the balance of positive and negative weights. It is the ratio of the number of negative class examples to the number of positive class examples. Setting a higher value for this hyperparameter can increase the importance of the positive class and improve the recall. Both of these hyperparameters can help the XGBoost model capture as many instances of returned items as possible.

Automatic model tuning (AMT) is a feature of Amazon SageMaker that automates the process of finding the best hyperparameter values for a machine learning model. AMT uses Bayesian optimization to search the hyperparameter space and evaluate the model performance based on a predefined objective metric. The objective metric is the metric that AMT tries to optimize by adjusting the hyperparameter values. For imbalanced classification problems, accuracy is not a good objective metric, as it can be misleading and biased towards the majority class. A better objective metric is the F1 score, which is the harmonic mean of precision and recall. The F1 score can reflect the balance between precision and recall and is more suitable for imbalanced data. The F1 score ranges from 0 to 1, where 1 is the best possible value. Therefore, the type of the objective should be ''Maximize'' to achieve the highest F1 score.

By tuning the csv_weight and scale_pos_weight hyperparameters and optimizing on the F1 score, the data scientist can meet the requirements most cost-effectively. This solution requires tuning only two hyperparameters, which can reduce the computation time and cost compared to tuning all possible hyperparameters. This solution also uses the appropriate objective metric for imbalanced classification, which can improve the model performance and capture more instances of returned items.

References:

* XGBoost Hyperparameters

* Automatic Model Tuning

* How to Configure XGBoost for Imbalanced Classification

* Imbalanced Data


Contribute your Thoughts:

0/2000 characters
Rene
3 months ago
I’m surprised Kinesis is the go-to here!
upvoted 0 times
...
Shaquana
3 months ago
D seems like a slower option, right?
upvoted 0 times
...
Audrie
3 months ago
C sounds interesting, but is it really the fastest?
upvoted 0 times
...
Shonda
4 months ago
I think B could work too, but not as fast.
upvoted 0 times
...
Renea
4 months ago
A is the best choice for low latency!
upvoted 0 times
...
Antonette
4 months ago
I feel like Kinesis Data Firehose is more about batch processing into S3, which might not be ideal for real-time insights. I think I’d lean towards Kinesis Data Analytics.
upvoted 0 times
...
Patrick
4 months ago
I’m a bit confused about the Kinesis Client Library option; I thought it was more for processing than querying. Would it really provide the least latency?
upvoted 0 times
...
Jules
4 months ago
I remember practicing a similar question, and I feel like AWS Glue might take longer due to the ETL process, so it probably isn't the best option for low latency.
upvoted 0 times
...
Cristal
5 months ago
I think Kinesis Data Analytics is the right choice here since it allows SQL querying directly on streaming data, but I'm not entirely sure about the latency aspect.
upvoted 0 times
...
Tatum
5 months ago
I'm feeling pretty confident about this one. Based on my understanding, option A with Kinesis Data Analytics and a Lambda function is the best choice to get the lowest latency for SQL querying of the data stream.
upvoted 0 times
...
Julie
5 months ago
Option D with Kinesis Data Firehose seems promising, as it can handle the GZIP files and load the data directly into S3. I'll need to double-check how the latency compares to the other options.
upvoted 0 times
...
Dorothy
5 months ago
I think the key here is minimizing latency, so I'm leaning towards option A with Kinesis Data Analytics and a Lambda function. That should allow for real-time processing and querying.
upvoted 0 times
...
Devora
5 months ago
Hmm, I'm a bit confused by the different AWS services mentioned. I'll need to review my notes on real-time data processing to make sure I understand the capabilities of each one.
upvoted 0 times
...
Demetra
5 months ago
This seems like a tricky question. I'll need to think through the different options carefully to determine which one has the least latency for querying the data stream with SQL.
upvoted 0 times
...
Chantell
5 months ago
I'm a bit confused on the difference between Administrator in Workspaces and Content Management in Workspaces. I'll need to double-check the Salesforce documentation to understand the correct location for creating the CMS collection.
upvoted 0 times
...
Kanisha
5 months ago
Hmm, this looks like a tricky one. I'll need to think carefully about the difference between microflows and macroflows.
upvoted 0 times
...
Cecily
9 months ago
I'm just waiting for the day when they release 'Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data, and then put it into an Amazon ES cluster and an Amazon S3 bucket, all while playing smooth jazz in the background.'
upvoted 0 times
Dulce
8 months ago
C) An Amazon Kinesis Client Library to transform the data and save it to an Amazon ES cluster.
upvoted 0 times
...
Kerrie
9 months ago
B) AWS Glue with a custom ETL script to transform the data.
upvoted 0 times
...
Britt
9 months ago
A) Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data.
upvoted 0 times
...
...
Kristel
10 months ago
B) AWS Glue with a custom ETL script is overkill for a real-time data stream. That's more suitable for batch processing jobs.
upvoted 0 times
Leonora
9 months ago
B) AWS Glue with a custom ETL script is overkill for a real-time data stream. That's more suitable for batch processing jobs.
upvoted 0 times
...
Lyndia
10 months ago
C) An Amazon Kinesis Client Library to transform the data and save it to an Amazon ES cluster.
upvoted 0 times
...
Stephaine
10 months ago
A) Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data.
upvoted 0 times
...
...
Jeanice
10 months ago
Interesting point, Hortencia. Can you explain why you think option C) is better than option A)?
upvoted 0 times
...
Stacey
10 months ago
C) The Amazon Kinesis Client Library approach seems a bit more complex than the other options. Saving to Elasticsearch might not be the primary concern here.
upvoted 0 times
Lera
9 months ago
C) An Amazon Kinesis Client Library to transform the data and save it to an Amazon ES cluster.
upvoted 0 times
...
Davida
9 months ago
A) Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data.
upvoted 0 times
...
...
Bea
10 months ago
D) Amazon Kinesis Data Firehose seems like a good option too. It can transform the data and load it into S3, which could be useful for further analysis.
upvoted 0 times
...
Hortencia
10 months ago
I disagree, I believe option C) with Amazon Kinesis Client Library and Amazon ES cluster would provide the least latency for querying the data stream.
upvoted 0 times
...
Jeanice
10 months ago
I think option A) with Amazon Kinesis Data Analytics and AWS Lambda would be the best choice for real-time insights.
upvoted 0 times
...
Buddy
11 months ago
A) Amazon Kinesis Data Analytics looks like the best solution here. It can process the GZIP data stream in real-time using SQL, which should minimize latency.
upvoted 0 times
Tawna
9 months ago
D) Amazon Kinesis Data Firehose to transform the data and put it into an Amazon S3 bucket.
upvoted 0 times
...
Herman
9 months ago
C) An Amazon Kinesis Client Library to transform the data and save it to an Amazon ES cluster.
upvoted 0 times
...
Anna
10 months ago
B) AWS Glue with a custom ETL script to transform the data.
upvoted 0 times
...
Catalina
10 months ago
A) Amazon Kinesis Data Analytics with an AWS Lambda function to transform the data.
upvoted 0 times
...
...
Jacqueline
11 months ago
I'm not sure, but I think option D could also work well for real-time insights with low latency.
upvoted 0 times
...
Sheron
11 months ago
I disagree, I believe option C is better as saving data to Amazon ES cluster can provide faster querying.
upvoted 0 times
...
Merlyn
11 months ago
I think option A is the best choice because using AWS Lambda can help reduce latency.
upvoted 0 times
...

Save Cancel