Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks Certified Associate Developer for Apache Spark 3.0 Topic 3 Question 52 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.0 exam
Question #: 52
Topic #: 3
[All Databricks Certified Associate Developer for Apache Spark 3.0 Questions]

The code block displayed below contains an error. The code block should combine data from DataFrames itemsDf and transactionsDf, showing all rows of DataFrame itemsDf that have a matching

value in column itemId with a value in column transactionsId of DataFrame transactionsDf. Find the error.

Code block:

itemsDf.join(itemsDf.itemId==transactionsDf.transactionId)

Show Suggested Answer Hide Answer
Suggested Answer: C

Correct code block:

transactionsDf.join(broadcast(itemsDf), 'transactionId', 'left_semi')

This Question: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that 'the query should be executed in an optimized way'. You also have qualitative information about the size of itemsDf and

transactionsDf. Given that itemsDf is 'very small' and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the 'very small'

DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard

it. Another answer option wraps the broadcast() operator around transactionsDf - the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can

likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([...]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([...]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An

outer join would include columns from both DataFrames, where a left semi join only includes columns from the 'left' table, here transactionsDf, just as asked for by the question. So, the correct

answer is the one that uses the left_semi join.


Contribute your Thoughts:

Arlen
5 days ago
The project planning phase is the best time to identify risks and develop a mitigation plan. That's when you have all the details and can really dive into potential issues. Waiting until execution would be too late.
upvoted 0 times
...
Cheryll
6 days ago
This looks like a straightforward question about the characteristics of an authoritative record. I'll need to carefully review the options and think about which ones best fit that definition.
upvoted 0 times
...
Kati
12 days ago
Okay, I see the code snippet. I'll go through it line by line and keep a mental count of the statements. This should be straightforward.
upvoted 0 times
...
Gerry
13 days ago
Alright, I think I've got it. Decreasing the session timeout and increasing the heartbeat interval should help reduce the rebalance time. Time to mark those two options.
upvoted 0 times
...
Hannah
15 days ago
Okay, I think I've got a handle on this. The key is to focus on the activation process-step and how it applies to the different types of records, including those with errors. I'll carefully consider each option and try to identify the two correct answers.
upvoted 0 times
...
Kris
5 months ago
The join expression is malformed for sure. It's missing the key parameter to specify the columns to join on. Option E is the way to go.
upvoted 0 times
Merilyn
4 months ago
Yes, option E is the way to go.
upvoted 0 times
...
Billi
4 months ago
Actually, the join expression is malformed.
upvoted 0 times
...
Thaddeus
5 months ago
No, the join method is inappropriate.
upvoted 0 times
...
Berry
5 months ago
I think the join statement is incomplete.
upvoted 0 times
...
...
Tish
6 months ago
Haha, good one! I guess the developers were trying to join the DataFrames with a one-liner, but that's not how it works. They need to use the proper join or merge methods.
upvoted 0 times
...
Jacob
6 months ago
I think option D is the correct answer. The merge method would be more appropriate here since we're trying to combine data from two DataFrames based on a common column.
upvoted 0 times
...
Elly
6 months ago
The join statement is definitely incomplete. We need to specify the keys for the join, like itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.transactionId).
upvoted 0 times
Gearldine
5 months ago
The join expression is malformed.
upvoted 0 times
...
Agustin
5 months ago
A) The join statement is incomplete.
upvoted 0 times
...
...
Rosann
6 months ago
I believe the correct answer is D) The merge method should be used instead of join.
upvoted 0 times
...
Nicolette
6 months ago
I agree with Nina, the join method is inappropriate.
upvoted 0 times
...
Nina
6 months ago
I think the error is that the join statement is incomplete.
upvoted 0 times
...

Save Cancel