Databricks Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Topic 1 Question 42 Discussion

Actual exam question for Databricks's Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam

Question #: 42
Topic #: 1

[All Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions]

Which of the following is one of the big performance advantages that Spark has over Hadoop?

ASpark achieves great performance by storing data in the DAG format, whereas Hadoop can only use parquet files.

BSpark achieves higher resiliency for queries since, different from Hadoop, it can be deployed on Kubernetes.

CSpark achieves great performance by storing data and performing computation in memory, whereas large jobs in Hadoop require a large amount of relatively slow disk I/O operations.

DSpark achieves great performance by storing data in the HDFS format, whereas Hadoop can only use parquet files.

ESpark achieves performance gains for developers by extending Hadoop's DataFrames with a user-friendly API.

Show Suggested Answer

Suggested Answer: B

This is a very tricky Question: and involves both knowledge about merging as well as schemas when reading parquet files.

spark.read.option('mergeSchema', 'true').parquet(filePath)

Correct. Spark's DataFrameReader's mergeSchema option will work well here, since columns that appear in both partitions have matching data types. Note that mergeSchema would fail if one or

more columns with the same name that appear in both partitions would have different data types.

spark.read.parquet(filePath)

Incorrect. While this would read in data from both partitions, only the schema in the parquet file that is read in first would be considered, so some columns that appear only in the second partition

(e.g. tax_id) would be lost.

nx = 0

for file in dbutils.fs.ls(filePath):

if not file.name.endswith('.parquet'):

continue

df_temp = spark.read.parquet(file.path)

if nx == 0:

df = df_temp

else:

df = df.union(df_temp)

nx = nx+1

Wrong. The key idea of this solution is the DataFrame.union() command. While this command merges all data, it requires that both partitions have the exact same number of columns with identical

data types.

spark.read.parquet(filePath, mergeSchema='y')

False. While using the mergeSchema option is the correct way to solve this problem and it can even be called with DataFrameReader.parquet() as in the code block, it accepts the value True as a

boolean or string variable. But 'y' is not a valid option.

nx = 0

for file in dbutils.fs.ls(filePath):

if not file.name.endswith('.parquet'):

continue

df_temp = spark.read.parquet(file.path)

if nx == 0:

df = df_temp

else:

df = df.join(df_temp, how='outer')

nx = nx+1

No. This provokes a full outer join. While the resulting DataFrame will have all columns of both partitions, columns that appear in both partitions will be duplicated - the Question: says all

columns that

are included in the partitions should appear exactly once.

More info: Merging different schemas in Apache Spark | by Thiago Cordon | Data Arena | Medium

Static notebook | Dynamic notebook: See test 3, Question: 37 (Databricks import instructions)

by Benedict at Sep 02, 2023, 01:29 PM

Limited Time Offer

25%

24 days ago

I think the answer is C.

upvoted 0 times

...