Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Topic 2 Question 43 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.0 exam
Question #: 43
Topic #: 2
[All Databricks Certified Associate Developer for Apache Spark 3.0 Questions]

In which order should the code blocks shown below be run in order to create a DataFrame that shows the mean of column predError of DataFrame transactionsDf per column storeId and productId,

where productId should be either 2 or 3 and the returned DataFrame should be sorted in ascending order by column storeId, leaving out any nulls in that column?

DataFrame transactionsDf:

1. +-------------+---------+-----+-------+---------+----+

2. |transactionId|predError|value|storeId|productId| f|

3. +-------------+---------+-----+-------+---------+----+

4. | 1| 3| 4| 25| 1|null|

5. | 2| 6| 7| 2| 2|null|

6. | 3| 3| null| 25| 3|null|

7. | 4| null| null| 3| 2|null|

8. | 5| null| null| null| 2|null|

9. | 6| 3| 2| 25| 2|null|

10. +-------------+---------+-----+-------+---------+----+

1. .mean("predError")

2. .groupBy("storeId")

3. .orderBy("storeId")

4. transactionsDf.filter(transactionsDf.storeId.isNotNull())

5. .pivot("productId", [2, 3])

Show Suggested Answer Hide Answer
Suggested Answer: D

Correct code block:

transactionsDf.filter(transactionsDf.storeId.isNotNull()).groupBy('storeId').pivot('productId', [2, 3]).mean('predError').orderBy('storeId')

Output of correct code block:

+-------+----+----+

|storeId| 2| 3|

+-------+----+----+

| 2| 6.0|null|

| 3|null|null|

| 25| 3.0| 3.0|

+-------+----+----+

This Question: is quite convoluted and requires you to think hard about the correct order of operations. The pivot method also makes an appearance - a method that you may not know all

that much

about (yet).

At the first position in all answers is code block 4, so the Question: is essentially just about the ordering of the remaining 4 code blocks.

The Question: states that the returned DataFrame should be sorted by column storeId. So, it should make sense to have code block 3 which includes the orderBy operator at the very end of

the code

block. This leaves you with only two answer options.

Now, it is useful to know more about the context of pivot in PySpark. A common pattern is groupBy, pivot, and then another aggregating function, like mean. In the documentation linked below you

can see that pivot is a method of pyspark.sql.GroupedData - meaning that before pivoting, you have to use groupBy. The only answer option matching this requirement is the one in which code

block 2 (which includes groupBy) is stated before code block 5 (which includes pivot).

More info: pyspark.sql.GroupedData.pivot --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 43 (Databricks import instructions)


Contribute your Thoughts:

Currently there are no comments in this discussion, be the first to comment!


Save Cancel