New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Google Professional Data Engineer Exam - Topic 1 Question 82 Discussion

Actual exam question for Google's Professional Data Engineer exam
Question #: 82
Topic #: 1
[All Professional Data Engineer Questions]

You are creating a data model in BigQuery that will hold retail transaction dat

a. Your two largest tables, sales_transation_header and sales_transation_line. have a tightly coupled immutable relationship. These tables are rarely modified after load and are frequently joined when queried. You need to model the sales_transation_header and sales_transation_line tables to improve the performance of data analytics queries. What should you do?

Show Suggested Answer Hide Answer
Suggested Answer: B

BigQuery supports nested and repeated fields, which are complex data types that can represent hierarchical and one-to-many relationships within a single table. By using nested and repeated fields, you can denormalize your data model and reduce the number of joins required for your queries. This can improve the performance and efficiency of your data analytics queries, as joins can be expensive and require shuffling data across nodes. Nested and repeated fields also preserve the data integrity and avoid data duplication. In this scenario, the sales_transaction_header and sales_transaction_line tables have a tightly coupled immutable relationship, meaning that each header row corresponds to one or more line rows, and the data is rarely modified after load. Therefore, it makes sense to create a single sales_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields. This way, you can query the sales transaction data without joining two tables, and use dot notation or array functions to access the nested and repeated fields. For example, the sales_transaction table could have the following schema:

Table

Field name

Type

Mode

id

INTEGER

NULLABLE

order_time

TIMESTAMP

NULLABLE

customer_id

INTEGER

NULLABLE

line_items

RECORD

REPEATED

line_items.sku

STRING

NULLABLE

line_items.quantity

INTEGER

NULLABLE

line_items.price

FLOAT

NULLABLE

To query the total amount of each order, you could use the following SQL statement:

SQL

SELECT id, SUM(line_items.quantity * line_items.price) AS total_amount

FROM sales_transaction

GROUP BY id;

AI-generated code. Review and use carefully.More info on FAQ.


Use nested and repeated fields

BigQuery explained: Working with joins, nested & repeated data

Arrays in BigQuery --- How to improve query performance and optimise storage

Contribute your Thoughts:

0/2000 characters
Shawna
3 months ago
Wait, duplicating data in C? That sounds risky!
upvoted 0 times
...
Portia
3 months ago
B is definitely the way to go, nested fields are powerful!
upvoted 0 times
...
Rodolfo
3 months ago
A JSON approach? Not sure that’s efficient for analytics.
upvoted 0 times
...
India
4 months ago
I disagree, C could work too if you need faster access.
upvoted 0 times
...
Jerlene
4 months ago
Option B seems like the best choice for performance.
upvoted 0 times
...
Eveline
4 months ago
I think option D is too basic for this scenario. We learned that specifying the order of tables in the WHERE clause doesn't really optimize performance like the other options could.
upvoted 0 times
...
Beatriz
4 months ago
I feel like option C might lead to data redundancy, which isn't ideal. But I can't recall if duplicating data is ever a good practice in this context.
upvoted 0 times
...
Cyndy
4 months ago
I'm not entirely sure, but I think creating a JSON data type like in option A might complicate querying. We practiced a similar question, and I remember that flattening data can help.
upvoted 0 times
...
Gracia
5 months ago
I remember we discussed the benefits of using nested and repeated fields in BigQuery for tightly coupled tables. It seems like option B could be the best choice for performance.
upvoted 0 times
...
Yuette
5 months ago
I'm feeling confident about this one. Option B is the way to go - it allows us to maintain the tight coupling between the header and line data while optimizing the data model.
upvoted 0 times
...
Jamal
5 months ago
I'm leaning towards Option D. Keeping the tables separate and optimizing the query order seems like a straightforward way to address the performance concerns.
upvoted 0 times
...
Fausto
5 months ago
Option B seems like the most logical choice to me. Storing the sales transaction line data as nested and repeated fields should improve query performance.
upvoted 0 times
...
Mitsue
5 months ago
Hmm, I'm a bit confused by the different options. I'll need to review the details of each one to understand the pros and cons.
upvoted 0 times
...
Reid
5 months ago
This looks like a tricky question. I'll need to think through the different options carefully to determine the best approach.
upvoted 0 times
...
Steffanie
5 months ago
Okay, I think I've got it. Option D is the winner here. It checks for traffic from both the Internet and Suspicious IPs, and it's specifically looking for AWS databases as the destination.
upvoted 0 times
...
Caprice
5 months ago
Okay, I've got this. The key is to identify the question that is not on the "do not ask" list. I'll carefully evaluate each option and select the one that stands out as appropriate.
upvoted 0 times
...
Raymon
2 years ago
I concur. Nested fields in BigQuery are known to enhance performance.
upvoted 0 times
...
Kip
2 years ago
I'd argue for B. Nesting repeated fields is efficient with tightly coupled data.
upvoted 0 times
...
Laurel
2 years ago
Agreed. Combining the data could be better.
upvoted 0 times
...
Maryann
2 years ago
Option D seems irrelevant. Starting with WHERE clause optimization isn't the right approach here.
upvoted 0 times
...
Raymon
2 years ago
True. Immutable relationship, rarely modified after load...sounds like optimizing for joins is key.
upvoted 0 times
...
Laurel
2 years ago
I think the question is trying to optimize query performance in BigQuery.
upvoted 0 times
...

Save Cancel