New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Google Professional Machine Learning Engineer Exam - Topic 1 Question 105 Discussion

Actual exam question for Google's Professional Machine Learning Engineer exam
Question #: 105
Topic #: 1
[All Professional Machine Learning Engineer Questions]

You have built a model that is trained on data stored in Parquet files. You access the data through a Hive table hosted on Google Cloud. You preprocessed these data with PySpark and exported it as a CSV file into Cloud Storage. After preprocessing, you execute additional steps to train and evaluate your model. You want to parametrize this model training in Kubeflow Pipelines. What should you do?

Show Suggested Answer Hide Answer
Suggested Answer: C

The best option for parametrizing the model training in Kubeflow Pipelines is to add a ContainerOp to the pipeline that spins a Dataproc cluster, runs a transformation, and then saves the transformed data in Cloud Storage. This option has the following advantages:

It allows the data transformation to be performed as part of the Kubeflow Pipeline, which can ensure the consistency and reproducibility of the data processing and the model training. By adding a ContainerOp to the pipeline, you can define the parameters and the logic of the data transformation step, and integrate it with the other steps of the pipeline, such as the model training and evaluation.

It leverages the scalability and performance of Dataproc, which is a fully managed service that runs Apache Spark and Apache Hadoop clusters on Google Cloud. By spinning a Dataproc cluster, you can run the PySpark transformation on the Parquet files stored in the Hive table, and take advantage of the parallelism and speed of Spark. Dataproc also supports various features and integrations, such as autoscaling, preemptible VMs, and connectors to other Google Cloud services, that can optimize the data processing and reduce the cost.

It simplifies the data storage and access, as the transformed data is saved in Cloud Storage, which is a scalable, durable, and secure object storage service. By saving the transformed data in Cloud Storage, you can avoid the overhead and complexity of managing the data in the Hive table or the Parquet files. Moreover, you can easily access the transformed data from Cloud Storage, using various tools and frameworks, such as TensorFlow, BigQuery, or Vertex AI.

The other options are less optimal for the following reasons:

Option A: Removing the data transformation step from the pipeline eliminates the parametrization of the model training, as the data processing and the model training are decoupled and independent. This option requires running the PySpark transformation separately from the Kubeflow Pipeline, which can introduce inconsistency and unreproducibility in the data processing and the model training. Moreover, this option requires managing the data in the Hive table or the Parquet files, which can be cumbersome and inefficient.

Option B: Containerizing the PySpark transformation step, and adding it to the pipeline introduces additional complexity and overhead. This option requires creating and maintaining a Docker image that can run the PySpark transformation, which can be challenging and time-consuming. Moreover, this option requires running the PySpark transformation on a single container, which can be slow and inefficient, as it does not leverage the parallelism and performance of Spark.

Option D: Deploying Apache Spark at a separate node pool in a Google Kubernetes Engine cluster, and adding a ContainerOp to the pipeline that invokes a corresponding transformation job for this Spark instance introduces additional complexity and cost. This option requires creating and managing a separate node pool in a Google Kubernetes Engine cluster, which is a fully managed service that runs Kubernetes clusters on Google Cloud. Moreover, this option requires deploying and running Apache Spark on the node pool, which can be tedious and costly, as it requires configuring and maintaining the Spark cluster, and paying for the node pool usage.


Contribute your Thoughts:

0/2000 characters
Lettie
2 months ago
A is definitely not the way to go.
upvoted 0 times
...
Keshia
2 months ago
Wait, can you really run PySpark in a ContainerOp?
upvoted 0 times
...
Tenesha
2 months ago
I think C is better for scalability.
upvoted 0 times
...
Elliot
3 months ago
D sounds complicated, not sure it's worth it.
upvoted 0 times
...
Nenita
3 months ago
B seems like the most straightforward option.
upvoted 0 times
...
Theodora
3 months ago
Deploying Spark in a separate node pool sounds complicated. I wonder if that’s really needed for this scenario or if there’s a simpler way.
upvoted 0 times
...
Rossana
4 months ago
I feel like option C sounds familiar. We practiced something similar where we had to set up a Dataproc cluster for transformations.
upvoted 0 times
...
Catalina
4 months ago
I’m a bit unsure about containerizing the PySpark step. It seems like a lot of extra work, but maybe it’s necessary for consistency?
upvoted 0 times
...
Irma
4 months ago
I remember we discussed how important it is to keep the data transformation step in the pipeline, so I think removing it might not be the best choice.
upvoted 0 times
...
Felicitas
4 months ago
I'm feeling pretty confident about this one. The solution seems to be to add a ContainerOp to the pipeline that spins up a Dataproc cluster, runs the transformation, and saves the data in Cloud Storage. That way, we can keep the data transformation step separate from the model training.
upvoted 0 times
...
Inocencia
4 months ago
Okay, let's see. I think the key here is to containerize the PySpark transformation step and add it to the Kubeflow pipeline. That way, we can parameterize the model training and make it more reusable.
upvoted 0 times
...
Carey
5 months ago
Hmm, I'm a bit confused by the different cloud services and technologies mentioned. I'll need to think through the steps carefully to determine the best approach.
upvoted 0 times
...
Inocencia
5 months ago
This looks like a tricky question. I'm not sure if I fully understand the requirements, but I think the key is to figure out how to integrate the PySpark transformation step into the Kubeflow pipeline.
upvoted 0 times
...
Alaine
5 months ago
C) Add a ContainerOp to your pipeline that spins a Dataproc cluster, runs a transformation, and then saves the transformed data in Cloud Storage. I like the idea of using a managed service like Dataproc for the heavy lifting.
upvoted 0 times
...
Bronwyn
5 months ago
D) Deploy Apache Spark at a separate node pool in a Google Kubernetes Engine cluster. This way, we can leverage the scalability and flexibility of GKE for our Spark workloads. Plus, it keeps our pipeline clean and modular.
upvoted 0 times
Bettina
2 months ago
Agreed! Modular pipelines are the way to go.
upvoted 0 times
...
Tarra
2 months ago
Plus, running Spark separately is more efficient.
upvoted 0 times
...
William
2 months ago
Exactly! It keeps everything organized.
upvoted 0 times
...
Ashleigh
3 months ago
I like option D too! GKE is great for scaling.
upvoted 0 times
...
...
Stefania
5 months ago
But wouldn't it be better to spin a Dataproc cluster and run the transformation there?
upvoted 0 times
...
Nan
6 months ago
I agree with Jolene. Containerizing the transformation step would make it easier to parametrize the model training.
upvoted 0 times
...
Jolene
6 months ago
I think we should containerize the PySpark transformation step and add it to the pipeline.
upvoted 0 times
...
Jin
6 months ago
B) Containerize the PySpark transformation step, and add it to your pipeline. Seems like the most straightforward approach to me. I don't want to introduce unnecessary complexity by spinning up a Dataproc cluster or managing a separate Spark instance.
upvoted 0 times
...

Save Cancel