Deal of the Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Certified Associate Developer for Apache Spark 3.0 Exam

Certification Provider: Databricks
Exam Name: Databricks Certified Associate Developer for Apache Spark 3.0
Duration: 120 Minutes
Number of questions in our database: 180
Exam Version: Mar. 23, 2023
Databricks Certified Associate Developer for Apache Spark 3.0 Exam Official Topics:
  • Topic 1: Navigate the Spark UI and describe how the catalyst optimizer, partitioning, and caching affect Spark's execution performance
  • Topic 2: Apply the Structured Streaming API to perform analytics on streaming data/ Define the major components of Spark architecture and execution hierarchy
  • Topic 3: Describe how DataFrames are built, transformed, and evaluated in Spark/ Apply the DataFrame API to explore, preprocess, join, and ingest data in Spark

Free Databricks Databricks Certified Associate Developer for Apache Spark 3.0 Exam Actual Questions

The questions for Databricks Certified Associate Developer for Apache Spark 3.0 were last updated On Mar. 23, 2023

Question #1

The code block shown below should return a copy of DataFrame transactionsDf with an added column cos. This column should have the values in column value converted to degrees and having

the cosine of those converted values taken, rounded to two decimals. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Code block:

transactionsDf.__1__(__2__, round(__3__(__4__(__5__)),2))

Reveal Solution Hide Solution
Correct Answer: C

Correct code block:

transactionsDf.withColumn('cos', round(cos(degrees(transactionsDf.value)),2))

This Question: is especially confusing because col, 'cos' are so similar. Similar-looking answer options can also appear in the exam and, just like in this question, you need to pay attention to

the

details to identify what the correct answer option is.

The first answer option to throw out is the one that starts with withColumnRenamed: The Question: speaks specifically of adding a column. The withColumnRenamed operator only renames

an

existing column, however, so you cannot use it here.

Next, you will have to decide what should be in gap 2, the first argument of transactionsDf.withColumn(). Looking at the documentation (linked below), you can find out that the first argument of

withColumn actually needs to be a string with the name of the column to be added. So, any answer that includes col('cos') as the option for gap 2 can be disregarded.

This leaves you with two possible answers. The real difference between these two answers is where the cos and degree methods are, either in gaps 3 and 4, or vice-versa. From the QUESTION

NO: you

can find out that the new column should have 'the values in column value converted to degrees and having the cosine of those converted values taken'. This prescribes you a clear order of

operations: First, you convert values from column value to degrees and then you take the cosine of those values. So, the inner parenthesis (gap 4) should contain the degree method and then,

logically, gap 3 holds the cos method. This leaves you with just one possible correct answer.

More info: pyspark.sql.DataFrame.withColumn --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 49 (Databricks import instructions)


Question #2

Which of the following code blocks returns the number of unique values in column storeId of DataFrame transactionsDf?

Reveal Solution Hide Solution
Correct Answer: A

transactionsDf.select('storeId').dropDuplicates().count()

Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column.

transactionsDf.select(count('storeId')).dropDuplicates()

No. transactionsDf.select(count('storeId')) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context.

transactionsDf.dropDuplicates().agg(count('storeId'))

Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates

instead.

transactionsDf.distinct().select('storeId').count()

Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count

not represent the number of unique values in that column.

transactionsDf.select(distinct('storeId')).count()

False. There is no distinct method in pyspark.sql.functions.


Question #3

The code block displayed below contains one or more errors. The code block should load parquet files at location filePath into a DataFrame, only loading those files that have been modified before

2029-03-20 05:44:46. Spark should enforce a schema according to the schema shown below. Find the error.

Schema:

1. root

2. |-- itemId: integer (nullable = true)

3. |-- attributes: array (nullable = true)

4. | |-- element: string (containsNull = true)

5. |-- supplier: string (nullable = true)

Code block:

1. schema = StructType([

2. StructType("itemId", IntegerType(), True),

3. StructType("attributes", ArrayType(StringType(), True), True),

4. StructType("supplier", StringType(), True)

5. ])

6.

7. spark.read.options("modifiedBefore", "2029-03-20T05:44:46").schema(schema).load(filePath)

Reveal Solution Hide Solution
Correct Answer: D

Correct code block:

schema = StructType([

StructField('itemId', IntegerType(), True),

StructField('attributes', ArrayType(StringType(), True), True),

StructField('supplier', StringType(), True)

])

spark.read.options(modifiedBefore='2029-03-20T05:44:46').schema(schema).parquet(filePath)

This Question: is more difficult than what you would encounter in the exam. In the exam, for this Question: type, only one error needs to be identified and not 'one or multiple' as in the

question.

Columns in the schema definition use the wrong object type, the modification date threshold is specified incorrectly, and Spark cannot identify the file format.

Correct! Columns in the schema definition should use the StructField type. Building a schema from pyspark.sql.types, as here using classes like StructType and StructField, is one of multiple ways

of expressing a schema in Spark. A StructType always contains a list of StructFields (see documentation linked below). So, nesting StructType and StructType as shown in the Question: is

wrong.

The modification date threshold should be specified by a keyword argument like options(modifiedBefore='2029-03-20T05:44:46') and not two consecutive non-keyword arguments as in the original

code block (see documentation linked below).

Spark cannot identify the file format correctly, because either it has to be specified by using the DataFrameReader.format(), as an argument to DataFrameReader.load(), or directly by calling, for

example, DataFrameReader.parquet().

Columns in the schema are unable to handle empty values and the modification date threshold is specified incorrectly.

No. If StructField would be used for the columns instead of StructType (see above), the third argument specified whether the column is nullable. The original schema shows that columns should be

nullable and this is specified correctly by the third argument being True in the schema in the code block.

It is correct, however, that the modification date threshold is specified incorrectly (see above).

The attributes array is specified incorrectly, Spark cannot identify the file format, and the syntax of the call to Spark's DataFrameReader is incorrect.

Wrong. The attributes array is specified correctly, following the syntax for ArrayType (see linked documentation below). That Spark cannot identify the file format is correct, see correct answer

above. In addition, the DataFrameReader is called correctly through the SparkSession spark.

Columns in the schema definition use the wrong object type and the syntax of the call to Spark's DataFrameReader is incorrect.

Incorrect, the object types in the schema definition are correct and syntax of the call to Spark's DataFrameReader is correct.

The data type of the schema is incompatible with the schema() operator and the modification date threshold is specified incorrectly.

False. The data type of the schema is StructType and an accepted data type for the DataFrameReader.schema() method. It is correct however that the modification date threshold is specified

incorrectly (see correct answer above).


Question #4

Which of the following code blocks returns the number of unique values in column storeId of DataFrame transactionsDf?

Reveal Solution Hide Solution
Correct Answer: A

transactionsDf.select('storeId').dropDuplicates().count()

Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column.

transactionsDf.select(count('storeId')).dropDuplicates()

No. transactionsDf.select(count('storeId')) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context.

transactionsDf.dropDuplicates().agg(count('storeId'))

Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates

instead.

transactionsDf.distinct().select('storeId').count()

Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count

not represent the number of unique values in that column.

transactionsDf.select(distinct('storeId')).count()

False. There is no distinct method in pyspark.sql.functions.


Question #5

Which of the following code blocks reads in the two-partition parquet file stored at filePath, making sure all columns are included exactly once even though each partition has a different schema?

Schema of first partition:

1. root

2. |-- transactionId: integer (nullable = true)

3. |-- predError: integer (nullable = true)

4. |-- value: integer (nullable = true)

5. |-- storeId: integer (nullable = true)

6. |-- productId: integer (nullable = true)

7. |-- f: integer (nullable = true)

Schema of second partition:

1. root

2. |-- transactionId: integer (nullable = true)

3. |-- predError: integer (nullable = true)

4. |-- value: integer (nullable = true)

5. |-- storeId: integer (nullable = true)

6. |-- rollId: integer (nullable = true)

7. |-- f: integer (nullable = true)

8. |-- tax_id: integer (nullable = false)

Reveal Solution Hide Solution
Correct Answer: B

This is a very tricky Question: and involves both knowledge about merging as well as schemas when reading parquet files.

spark.read.option('mergeSchema', 'true').parquet(filePath)

Correct. Spark's DataFrameReader's mergeSchema option will work well here, since columns that appear in both partitions have matching data types. Note that mergeSchema would fail if one or

more columns with the same name that appear in both partitions would have different data types.

spark.read.parquet(filePath)

Incorrect. While this would read in data from both partitions, only the schema in the parquet file that is read in first would be considered, so some columns that appear only in the second partition

(e.g. tax_id) would be lost.

nx = 0

for file in dbutils.fs.ls(filePath):

if not file.name.endswith('.parquet'):

continue

df_temp = spark.read.parquet(file.path)

if nx == 0:

df = df_temp

else:

df = df.union(df_temp)

nx = nx+1

df

Wrong. The key idea of this solution is the DataFrame.union() command. While this command merges all data, it requires that both partitions have the exact same number of columns with identical

data types.

spark.read.parquet(filePath, mergeSchema='y')

False. While using the mergeSchema option is the correct way to solve this problem and it can even be called with DataFrameReader.parquet() as in the code block, it accepts the value True as a

boolean or string variable. But 'y' is not a valid option.

nx = 0

for file in dbutils.fs.ls(filePath):

if not file.name.endswith('.parquet'):

continue

df_temp = spark.read.parquet(file.path)

if nx == 0:

df = df_temp

else:

df = df.join(df_temp, how='outer')

nx = nx+1

df

No. This provokes a full outer join. While the resulting DataFrame will have all columns of both partitions, columns that appear in both partitions will be duplicated - the Question: says all

columns that

are included in the partitions should appear exactly once.

More info: Merging different schemas in Apache Spark | by Thiago Cordon | Data Arena | Medium

Static notebook | Dynamic notebook: See test 3, Question: 37 (Databricks import instructions)



Unlock all Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions with Advanced Practice Test Features:
  • Select Question Types you want
  • Set your Desired Pass Percentage
  • Allocate Time (Hours : Minutes)
  • Create Multiple Practice tests with Limited Questions
  • Customer Support
Get Full Access Now
Disscuss Databricks Databricks Certified Associate Developer for Apache Spark 3.0 Topics, Questions or Ask Anything Related

Save Cancel