Databricks Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Topic 1 Question 1 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.0 exam

Question #: 1
Topic #: 1

[All Databricks Certified Associate Developer for Apache Spark 3.0 Questions]

The code block shown below should return the number of columns in the CSV file stored at location filePath. From the CSV file, only lines should be read that do not start with a # character. Choose

the answer that correctly fills the blanks in the code block to accomplish this.

Code block:

__1__(__2__.__3__.csv(filePath, __4__).__5__)

A1. size
2. spark
3. read()
4. escape='#'
5. columns

B1. DataFrame
2. spark
3. read()
4. escape='#'
5. shape[0]

C1. len
2. pyspark
3. DataFrameReader
4. comment='#'
5. columns

D1. size
2. pyspark
3. DataFrameReader
4. comment='#'
5. columns

E1. len
2. spark
3. read
4. comment='#'
5. columns

Show Suggested Answer

Suggested Answer: E

Correct code block:

len(spark.read.csv(filePath, comment='#').columns)

This is a challenging Question: with difficulties in an unusual context: The boundary between DataFrame and the DataFrameReader. It is unlikely that a Question: of this difficulty level

appears in the

exam. However, solving it helps you get more comfortable with the DataFrameReader, a subject you will likely have to deal with in the exam.

Before dealing with the inner parentheses, it is easier to figure out the outer parentheses, gaps 1 and 5. Given the code block, the object in gap 5 would have to be evaluated by the object in gap 1,

returning the number of columns in the read-in CSV. One answer option includes DataFrame in gap 1 and shape[0] in gap 2. Since DataFrame cannot be used to evaluate shape[0], we can discard

this answer option.

Other answer options include size in gap 1. size() is not a built-in Python command, so if we use it, it would have to come from somewhere else. pyspark.sql.functions includes a size() method, but

this method only returns the length of an array or map stored within a column (documentation linked below). So, using a size() method is not an option here. This leaves us with two potentially valid

answers.

We have to pick between gaps 2 and 3 being spark.read or pyspark.DataFrameReader. Looking at the documentation (linked below), the DataFrameReader is actually a child class of pyspark.sql,

which means that we cannot import it using pyspark.DataFrameReader. Moreover, spark.read makes sense because on Databricks, spark references current Spark session

(pyspark.sql.SparkSession) and spark.read therefore returns a DataFrameReader (also see documentation below). Finally, there is only one correct answer option remaining.

More info:

- pyspark.sql.functions.size --- PySpark 3.1.2 documentation

- pyspark.sql.DataFrameReader.csv --- PySpark 3.1.2 documentation

- pyspark.sql.SparkSession.read --- PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, Question: 50 (Databricks import instructions)

by Howard at May 06, 2022, 11:32 PM

Limited Time Offer

25%

Off

Get Premium Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Currently there are no comments in this discussion, be the first to comment!