Independence Day Deal! Unlock 25% OFF Today – Limited-Time Offer - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks Certified Associate Developer for Apache Spark 3.0 Topic 2 Question 61 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.0 exam
Question #: 61
Topic #: 2
[All Databricks Certified Associate Developer for Apache Spark 3.0 Questions]

The code block displayed below contains an error. The code block is intended to join DataFrame itemsDf with the larger DataFrame transactionsDf on column itemId. Find the error.

Code block:

transactionsDf.join(itemsDf, "itemId", how="broadcast")

Show Suggested Answer Hide Answer
Suggested Answer: C

Correct code block:

transactionsDf.join(broadcast(itemsDf), 'transactionId', 'left_semi')

This Question: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that 'the query should be executed in an optimized way'. You also have qualitative information about the size of itemsDf and

transactionsDf. Given that itemsDf is 'very small' and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the 'very small'

DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard

it. Another answer option wraps the broadcast() operator around transactionsDf - the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can

likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([...]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([...]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An

outer join would include columns from both DataFrames, where a left semi join only includes columns from the 'left' table, here transactionsDf, just as asked for by the question. So, the correct

answer is the one that uses the left_semi join.


Contribute your Thoughts:

Dulce
2 months ago
Ah, the old 'broadcast' join trick. Classic Spark trolling right there. Someone needs to read the docs a little closer.
upvoted 0 times
Merlyn
5 days ago
User1: The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.
upvoted 0 times
...
Truman
16 days ago
User2: Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.
upvoted 0 times
...
Reita
1 months ago
User1: The syntax is wrong, how= should be removed from the code block.
upvoted 0 times
...
...
Julio
2 months ago
Haha, good one! Trying to broadcast the larger DataFrame, that'll really slow things down. Rookie mistake.
upvoted 0 times
Tabetha
26 days ago
User 3: The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.
upvoted 0 times
...
Regenia
1 months ago
User 2: Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.
upvoted 0 times
...
Dominga
1 months ago
User 1: The syntax is wrong, how= should be removed from the code block.
upvoted 0 times
...
...
Lashawnda
2 months ago
But isn't the issue that the larger DataFrame is being broadcasted instead of the smaller one?
upvoted 0 times
...
Veronika
2 months ago
Broadcast is a Spark optimization, not a join type. This code is just trying to do a regular join, not a broadcast join.
upvoted 0 times
Daisy
25 days ago
C) Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.
upvoted 0 times
...
Shawna
1 months ago
A) The syntax is wrong, how= should be removed from the code block.
upvoted 0 times
...
...
Mollie
2 months ago
I agree with Allene, the syntax needs to be fixed.
upvoted 0 times
...
Lizbeth
2 months ago
The syntax is definitely wrong. How is not a valid argument for the join method. Should be transactionsDf.join(itemsDf, on='itemId', how='some_join_type').
upvoted 0 times
...
Allene
3 months ago
I think the error is in the syntax, how= should be removed.
upvoted 0 times
...

Save Cancel