Databricks Exam Databricks Certified Associate Developer for Apache Spark 3.0 Topic 3 Question 16 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.0 exam

Question #: 16
Topic #: 3

[All Databricks Certified Associate Developer for Apache Spark 3.0 Questions]

Which of the following code blocks prints out in how many rows the expression Inc. appears in the string-type column supplier of DataFrame itemsDf?

A1. counter = 0
2.
3. for index, row in itemsDf.iterrows():
4. if 'Inc.' in row['supplier']:
5. counter = counter + 1
6.
7. print(counter)

B1. counter = 0
2.
3. def count(x):
4. if 'Inc.' in x['supplier']:
5. counter = counter + 1
6.
7. itemsDf.foreach(count)
8. print(counter)

Cprint(itemsDf.foreach(lambda x: 'Inc.' in x))

Dprint(itemsDf.foreach(lambda x: 'Inc.' in x).sum())

E1. accum=sc.accumulator(0)
2.
3. def check_if_inc_in_supplier(row):
4. if 'Inc.' in row['supplier']:
5. accum.add(1)
6.
7. itemsDf.foreach(check_if_inc_in_supplier)
8. print(accum.value)

Show Suggested Answer

Suggested Answer: E

Correct code block:

accum=sc.accumulator(0)

def check_if_inc_in_supplier(row):

if 'Inc.' in row['supplier']:

accum.add(1)

itemsDf.foreach(check_if_inc_in_supplier)

print(accum.value)

To answer this Question: correctly, you need to know both about the DataFrame.foreach() method and accumulators.

When Spark runs the code, it executes it on the executors. The executors do not have any information about variables outside of their scope. This is whhy simply using a Python variable counter,

like in the two examples that start with counter = 0, will not work. You need to tell the executors explicitly that counter is a special shared variable, an Accumulator, which is managed by the driver

and can be accessed by all executors for the purpose of adding to it.

If you have used Pandas in the past, you might be familiar with the iterrows() command. Notice that there is no such command in PySpark.

The two examples that start with print do not work, since DataFrame.foreach() does not have a return value.

More info: pyspark.sql.DataFrame.foreach --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 22 (Databricks import instructions)

by Yolando at May 07, 2022, 06:02 AM

Limited Time Offer

25%

Off

Get Premium Databricks Certified Associate Developer for Apache Spark 3.0 Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Currently there are no comments in this discussion, be the first to comment!