Happy Columbus Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: CBD2022
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Certified Associate Developer for Apache Spark 3.0 Exam

Certification Provider: Databricks
Exam Name: Databricks Certified Associate Developer for Apache Spark 3.0
Duration: 120 Minutes
Number of questions in our database: 180
Exam Version: Sep. 25, 2022
Databricks Certified Associate Developer for Apache Spark 3.0 Exam Official Topics:
  • Topic 1: Navigate the Spark UI and describe how the catalyst optimizer, partitioning, and caching affect Spark's execution performance
  • Topic 2: Apply the Structured Streaming API to perform analytics on streaming data/ Define the major components of Spark architecture and execution hierarchy
  • Topic 3: Describe how DataFrames are built, transformed, and evaluated in Spark/ Apply the DataFrame API to explore, preprocess, join, and ingest data in Spark

Free Databricks Databricks Certified Associate Developer for Apache Spark 3.0 Exam Actual Questions

The questions for Databricks Certified Associate Developer for Apache Spark 3.0 were last updated On Sep. 25, 2022

Question #1

Which of the elements in the labeled panels represent the operation performed for broadcast variables?

Larger image

Reveal Solution Hide Solution
Correct Answer: C

2,3

Correct! Both panels 2 and 3 represent the operation performed for broadcast variables. While a broadcast operation may look like panel 3, with the driver being the bottleneck, it most probably

looks like panel 2.

This is because the torrent protocol sits behind Spark's broadcast implementation. In the torrent protocol, each executor will try to fetch missing broadcast variables from the driver or other nodes,

preventing the driver from being the bottleneck.

1,2

Wrong. While panel 2 may represent broadcasting, panel 1 shows bi-directional communication which does not occur in broadcast operations.

3

No. While broadcasting may materialize like shown in panel 3, its use of the torrent protocol also enables communciation as shown in panel 2 (see first explanation).

1,3,4

No. While panel 2 shows broadcasting, panel 1 shows bi-directional communication -- not a characteristic of broadcasting. Panel 4 shows uni-directional communication, but in the wrong direction.

Panel 4 resembles more an accumulator variable than a broadcast variable.

2,5

Incorrect. While panel 2 shows broadcasting, panel 5 includes bi-directional communication -- not a characteristic of broadcasting.

More info: Broadcast Join with Spark -- henning.kropponline.de


Question #2

Which of the following code blocks creates a new 6-column DataFrame by appending the rows of the 6-column DataFrame yesterdayTransactionsDf to the rows of the 6-column DataFrame

todayTransactionsDf, ignoring that both DataFrames have different column names?

Reveal Solution Hide Solution
Correct Answer: E

todayTransactionsDf.union(yesterdayTransactionsDf)

Correct. The union command appends rows of yesterdayTransactionsDf to the rows of todayTransactionsDf, ignoring that both DataFrames have different column names. The resulting DataFrame

will have the column names of DataFrame todayTransactionsDf.

todayTransactionsDf.unionByName(yesterdayTransactionsDf)

No. unionByName specifically tries to match columns in the two DataFrames by name and only appends values in columns with identical names across the two DataFrames. In the form presented

above, the command is a great fit for joining DataFrames that have exactly the same columns, but in a different order. In this case though, the command will fail because the two DataFrames have

different columns.

todayTransactionsDf.unionByName(yesterdayTransactionsDf, allowMissingColumns=True)

No. The unionByName command is described in the previous explanation. However, with the allowMissingColumns argument set to True, it is no longer an issue that the two DataFrames have

different column names. Any columns that do not have a match in the other DataFrame will be filled with null where there is no value. In the case at hand, the resulting DataFrame will have 7 or more

columns though, so it this command is not the right answer.

union(todayTransactionsDf, yesterdayTransactionsDf)

No, there is no union method in pyspark.sql.functions.

todayTransactionsDf.concat(yesterdayTransactionsDf)

Wrong, the DataFrame class does not have a concat method.

More info: pyspark.sql.DataFrame.union --- PySpark 3.1.2 documentation, pyspark.sql.DataFrame.unionByName --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 18 (Databricks import instructions)


Question #3

Which of the following code blocks writes DataFrame itemsDf to disk at storage location filePath, making sure to substitute any existing data at that location?

Reveal Solution Hide Solution
Correct Answer: A

itemsDf.write.mode('overwrite').parquet(filePath)

Correct! itemsDf.write returns a pyspark.sql.DataFrameWriter instance whose overwriting behavior can be modified via the mode setting or by passing mode='overwrite' to the parquet() command.

Although the parquet format is not prescribed for solving this question, parquet() is a valid operator to initiate Spark to write the data to disk.

itemsDf.write.mode('overwrite').path(filePath)

No. A pyspark.sql.DataFrameWriter instance does not have a path() method.

itemsDf.write.option('parquet').mode('overwrite').path(filePath)

Incorrect, see above. In addition, a file format cannot be passed via the option() method.

itemsDf.write(filePath, mode='overwrite')

Wrong. Unfortunately, this is too simple. You need to obtain access to a DataFrameWriter for the DataFrame through calling itemsDf.write upon which you can apply further methods to control how

Spark data should be written to disk. You cannot, however, pass arguments to itemsDf.write directly.

itemsDf.write().parquet(filePath, mode='overwrite')

False. See above.

More info: pyspark.sql.DataFrameWriter.parquet --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 56 (Databricks import instructions)


Question #4

The code block shown below should show information about the data type that column storeId of DataFrame transactionsDf contains. Choose the answer that correctly fills the blanks in the code

block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__

Reveal Solution Hide Solution
Correct Answer: B

Correct code block:

transactionsDf.select('storeId').printSchema()

The difficulty of this Question: is that it is hard to solve with the stepwise first-to-last-gap approach that has worked well for similar questions, since the answer options are so different from

one

another. Instead, you might want to eliminate answers by looking for patterns of frequently wrong answers.

A first pattern that you may recognize by now is that column names are not expressed in quotes. For this reason, the answer that includes storeId should be eliminated.

By now, you may have understood that the DataFrame.limit() is useful for returning a specified amount of rows. It has nothing to do with specific columns. For this reason, the answer that resolves to

limit('storeId') can be eliminated.

Given that we are interested in information about the data type, you should Question: whether the answer that resolves to limit(1).columns provides you with this information. While

DataFrame.columns is a valid call, it will only report back column names, but not column types. So, you can eliminate this option.

The two remaining options either use the printSchema() or print_schema() command. You may remember that DataFrame.printSchema() is the only valid command of the two. The select('storeId')

part just returns the storeId column of transactionsDf - this works here, since we are only interested in that column's type anyways.

More info: pyspark.sql.DataFrame.printSchema --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 57 (Databricks import instructions)


Question #5

The code block shown below should return a DataFrame with only columns from DataFrame transactionsDf for which there is a corresponding transactionId in DataFrame itemsDf. DataFrame

itemsDf is very small and much smaller than DataFrame transactionsDf. The query should be executed in an optimized way. Choose the answer that correctly fills the blanks in the code block to

accomplish this.

__1__.__2__(__3__, __4__, __5__)

Reveal Solution Hide Solution
Correct Answer: C

Correct code block:

transactionsDf.join(broadcast(itemsDf), 'transactionId', 'left_semi')

This Question: is extremely difficult and exceeds the difficulty of questions in the exam by far.

A first indication of what is asked from you here is the remark that 'the query should be executed in an optimized way'. You also have qualitative information about the size of itemsDf and

transactionsDf. Given that itemsDf is 'very small' and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the 'very small'

DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard

it. Another answer option wraps the broadcast() operator around transactionsDf - the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can

likewise be disregarded.

When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([...]). The DataFrame

class has no broadcast() method, so this answer option can be eliminated as well.

All two remaining answer options resolve to transactionsDf.join([...]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An

outer join would include columns from both DataFrames, where a left semi join only includes columns from the 'left' table, here transactionsDf, just as asked for by the question. So, the correct

answer is the one that uses the left_semi join.



Unlock all Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions with Advanced Practice Test Features:
  • Select Question Types you want
  • Set your Desired Pass Percentage
  • Allocate Time (Hours : Minutes)
  • Create Multiple Practice tests with Limited Questions
  • Customer Support
Get Full Access Now
Disscuss Databricks Databricks Certified Associate Developer for Apache Spark 3.0 Topics, Questions or Ask Anything Related

Save Cancel