Google Professional Data Engineer Exam - Topic 3 Question 45 Discussion

Actual exam question for Google's Professional Data Engineer exam

Question #: 45
Topic #: 3

[All Professional Data Engineer Questions]

You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis. Every hour, thousands of transactions are updated with a new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this dat

a. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? Choose 2 answers.

ADenormalize the data as must as possible.

BPreserve the structure of the data as much as possible.

CUse BigQuery UPDATE to further reduce the size of the dataset.

DDevelop a data pipeline where status updates are appended to BigQuery instead of updated.

ECopy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery's support for external data sources to query.

Show Suggested Answer

Suggested Answer: A, E

by Carma at May 04, 2022, 01:51 PM

Limited Time Offer

25%

Off

Get Premium Professional Data Engineer Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Chauncey

4 months ago

Daily snapshots to Cloud Storage? Seems a bit excessive.

upvoted 0 times

...

Ty

4 months ago

Appending updates instead of overwriting is smart!

upvoted 0 times

...

Justine

4 months ago

Wait, can BigQuery handle that much data efficiently?

upvoted 0 times

...

Stanford

4 months ago

I think preserving structure is key for analysis.

upvoted 0 times

...

Francesco

5 months ago

Denormalizing sounds like a good idea for performance!

upvoted 0 times

...

Phuong

5 months ago

I feel like appending status updates instead of updating might make querying easier, but I can't quite remember the trade-offs.

upvoted 0 times

...

Linwood

5 months ago

I think preserving the structure is important for analysis, but I also recall a practice question where denormalization helped with performance.

upvoted 0 times

...

Dyan

5 months ago

I remember we discussed denormalization in class, but I'm not sure if it's the best approach for time-series data.

upvoted 0 times

...

Nydia

5 months ago

The idea of using external data sources like Avro files sounds familiar, but I'm not clear on how that impacts performance in BigQuery.

upvoted 0 times

...

Emelda

5 months ago

The Google-recommended best practices are important here. I think option C is the way to go - converting to a StatefulSet and using a PodDisruptionBudget of 80%. That should give me the control and reliability I need for this deployment.

upvoted 0 times

...

Breana

5 months ago

Ugh, I'm drawing a blank on this one. Biometrics aren't my strongest area, and I'm not sure I fully understand the limitations of retina scanning. I'll have to guess and hope for the best.

upvoted 0 times

...

Chu

5 months ago

I'm pretty sure this is about interns and medical students. Makes sense that the attending needs to verify their documentation.

upvoted 0 times

...

Trinidad

5 months ago

Okay, I think I have a good handle on this. I'd recommend going with the Kubernetes Engine option to minimize operational overhead and handle the unpredictable workload.

upvoted 0 times

...