Google Professional Data Engineer Exam - Topic 3 Question 119 Discussion

Actual exam question for Google's Professional Data Engineer exam

Question #: 119
Topic #: 3

[All Professional Data Engineer Questions]

You have thousands of Apache Spark jobs running in your on-premises Apache Hadoop cluster. You want to migrate the jobs to Google Cloud. You want to use managed services to run your jobs instead of maintaining a long-lived Hadoop cluster yourself. You have a tight timeline and want to keep code changes to a minimum. What should you do?

ACopy your data to Compute Engine disks. Manage and run your jobs directly on those instances.

BMove your data to Cloud Storage. Run your jobs on Dataproc.

CMove your data to BigQuery. Convert your Spark scripts to a SQL-based processing approach.

DRewrite your jobs in Apache Beam. Run your jobs in Dataflow.

Show Suggested Answer

Suggested Answer: B

Dataproc's Compatibility with Apache Spark:Dataproc is a managed service for running Hadoop and Spark clusters on Google Cloud. This means it is designed to seamlessly run Apache Spark jobs with minimal code changes. Your existing Spark jobs should run on Dataproc with little to no modification.

Cloud Storage as a Scalable Data Lake:Cloud Storage provides a highly scalable and durable storage solution for your data. It's designed to handle large volumes of data that Spark jobs typically process.

Minimizing Operational Overhead:By using Dataproc, you eliminate the need to manage and maintain a Hadoop cluster yourself. Google Cloud handles the infrastructure, allowing you to focus on your data processing tasks.

Tight Timeline and Minimal Code Changes:This option directly addresses the requirements of the question. It offers a quick and easy way to migrate your Spark jobs to Google Cloud with minimal disruption to your existing codebase.

Why other options are not suitable:

A . Copy your data to Compute Engine disks. Manage and run your jobs directly on those instances:This option requires you to manage the underlying infrastructure yourself, which contradicts the requirement of using managed services.

C . Move your data to BigQuery. Convert your Spark scripts to a SQL-based processing approach:While BigQuery is a powerful data warehouse, converting Spark scripts to SQL would require substantial code changes and might not be feasible within a tight timeline.

D . Rewrite your jobs in Apache Beam. Run your jobs in Dataflow:Rewriting jobs in Apache Beam would be a significant undertaking and not suitable for a quick migration with minimal code changes.

by Paulina at Apr 03, 2026, 04:32 PM

Limited Time Offer

25%

2 months ago

I remember we discussed how Dataproc is designed for running Spark jobs, so option B seems like a good fit. But I'm not entirely sure about the data transfer process.

upvoted 0 times

...