New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Google Professional Data Engineer Exam - Topic 3 Question 98 Discussion

Actual exam question for Google's Professional Data Engineer exam
Question #: 98
Topic #: 3
[All Professional Data Engineer Questions]

You are migrating your on-premises data warehouse to BigQuery. As part of the migration, you want to facilitate cross-team collaboration to get the most value out of the organization's dat

a. You need to design an architecture that would allow teams within the organization to securely publish, discover, and subscribe to read-only data in a self-service manner. You need to minimize costs while also maximizing data freshness What should you do?

Show Suggested Answer Hide Answer
Suggested Answer: C

To provide a cost-effective storage and processing solution that allows data scientists to explore data similarly to using the on-premises HDFS cluster with SQL on the Hive query engine, deploying a Dataproc cluster is the best choice. Here's why:

Compatibility with Hive:

Dataproc is a fully managed Apache Spark and Hadoop service that provides native support for Hive, making it easy for data scientists to run SQL queries on the data as they would in an on-premises Hadoop environment.

This ensures that the transition to Google Cloud is smooth, with minimal changes required in the workflow.

Cost-Effective Storage:

Storing the ORC files in Cloud Storage is cost-effective and scalable, providing a reliable and durable storage solution that integrates seamlessly with Dataproc.

Cloud Storage allows you to store large datasets at a lower cost compared to other storage options.

Hive Integration:

Dataproc supports running Hive directly, which is essential for data scientists familiar with SQL on the Hive query engine.

This setup enables the use of existing Hive queries and scripts without significant modifications.

Steps to Implement:

Copy ORC Files to Cloud Storage:

Transfer the ORC files from the on-premises HDFS cluster to Cloud Storage, ensuring they are organized in a similar directory structure.

Deploy Dataproc Cluster:

Set up a Dataproc cluster configured to run Hive. Ensure that the cluster has access to the ORC files stored in Cloud Storage.

Configure Hive:

Configure Hive on Dataproc to read from the ORC files in Cloud Storage. This can be done by setting up external tables in Hive that point to the Cloud Storage location.

Provide Access to Data Scientists:

Grant the data scientist team access to the Dataproc cluster and the necessary permissions to interact with the Hive tables.


Dataproc Documentation

Hive on Dataproc

Google Cloud Storage Documentation

Contribute your Thoughts:

0/2000 characters
Merilyn
3 months ago
A and D both sound good, but D might be easier for collaboration.
upvoted 0 times
...
Eladia
4 months ago
Wait, is Analytics Hub really that effective for this?
upvoted 0 times
...
Graham
4 months ago
C could lead to higher costs with all that copying.
upvoted 0 times
...
Emily
4 months ago
I disagree, B seems more straightforward for individual teams.
upvoted 0 times
...
Jeff
4 months ago
I think option A is the best for secure sharing.
upvoted 0 times
...
Desmond
5 months ago
I recall that using the Data Transfer Service might not be the most cost-effective solution. I wonder if it really maximizes data freshness like the other options.
upvoted 0 times
...
Karina
5 months ago
I practiced a similar question about data sharing, and I think using Analytics Hub could be the best option for collaboration. It sounds efficient.
upvoted 0 times
...
Barrie
5 months ago
I'm not entirely sure, but I think creating a new dataset for each team might lead to more overhead. It feels like it could complicate things.
upvoted 0 times
...
Talia
5 months ago
I remember we discussed using authorized datasets in class. It seems like a good way to manage access while keeping costs down.
upvoted 0 times
...
Elmer
5 months ago
I'm pretty confident that option D, using Analytics Hub, is the best solution here. It allows for secure, self-service data sharing across the organization.
upvoted 0 times
...
Lauran
5 months ago
Okay, let me think this through. We need to minimize costs and maximize data freshness, so creating separate datasets in each team's project (option B) doesn't sound ideal. I'm leaning towards option A or D.
upvoted 0 times
...
Gail
5 months ago
Hmm, the key here is to design an architecture that allows for secure, self-service data sharing. I think option D with Analytics Hub might be the way to go.
upvoted 0 times
...
Anthony
5 months ago
This question seems straightforward, but I want to make sure I understand the requirements fully before answering.
upvoted 0 times
...
Oliva
1 year ago
Option D is the way to go. Analytics Hub is the data-sharing equivalent of a one-stop-shop. It's like having a personal shopper for your data needs!
upvoted 0 times
...
Vallie
1 year ago
Option D all the way! Analytics Hub is the way to go. It's like having a big data party where everyone's invited, but the bouncers (security policies) make sure only the right people get in.
upvoted 0 times
Kiley
1 year ago
That's true, but having separate datasets in each team's project might give them more control over their own data. It's a tough choice.
upvoted 0 times
...
Paola
1 year ago
But with Analytics Hub, you can have a centralized platform for data sharing, making it easier for teams to collaborate.
upvoted 0 times
...
Samira
1 year ago
I think creating authorized datasets for sharing in each team's project would be a better option to ensure data security.
upvoted 0 times
...
Anglea
1 year ago
Option D all the way! Analytics Hub is the way to go. It's like having a big data party where everyone's invited, but the bouncers (security policies) make sure only the right people get in.
upvoted 0 times
...
...
Adria
1 year ago
I'd go with Option D as well. It seems like the most efficient and cost-effective solution to enable cross-team collaboration while ensuring data freshness.
upvoted 0 times
Ulysses
1 year ago
B) Create a new dataset for sharing in each individual team's project. Grant the subscribing team the bigquery. dataViewer role on the dataset.
upvoted 0 times
...
Laura
1 year ago
That sounds like a good idea. Option D seems like the way to go.
upvoted 0 times
...
Rickie
1 year ago
D) Use Analytics Hub to facilitate data sharing.
upvoted 0 times
...
Jacquelyne
1 year ago
A) Create authorized datasets to publish shared data in the subscribing team's project.
upvoted 0 times
...
...
Moon
1 year ago
I disagree, I believe option B is more efficient. It gives each team control over their own data sharing.
upvoted 0 times
...
Luisa
1 year ago
I think option A is the best choice. It allows for secure data sharing and minimizes costs.
upvoted 0 times
...
Tennie
1 year ago
Option D seems like the best choice here. Analytics Hub is designed specifically for secure data sharing, and it allows teams to discover and subscribe to data in a self-service way.
upvoted 0 times
Lilli
1 year ago
I think using Analytics Hub would definitely streamline the process and make it easier for teams to work together.
upvoted 0 times
...
Melissa
1 year ago
Creating authorized datasets in each team's project could get messy. Analytics Hub seems like a cleaner option.
upvoted 0 times
...
Micheal
1 year ago
Agreed, Analytics Hub sounds like the most efficient solution for facilitating cross-team collaboration in data sharing.
upvoted 0 times
...
Lashon
1 year ago
I agree, Analytics Hub sounds like the most efficient solution for cross-team collaboration.
upvoted 0 times
...
Joanna
1 year ago
I think using Analytics Hub would definitely streamline the process of sharing data across teams.
upvoted 0 times
...
Adaline
1 year ago
Option D seems like the best choice here. Analytics Hub is designed specifically for secure data sharing, and it allows teams to discover and subscribe to data in a self-service way.
upvoted 0 times
...
Paris
1 year ago
Option D seems like the best choice here. Analytics Hub is designed specifically for secure data sharing, and it allows teams to discover and subscribe to data in a self-service way.
upvoted 0 times
...
...

Save Cancel