Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Topic 1 Question 42 Discussion

Actual exam question for Databricks's Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam
Question #: 42
Topic #: 1
[All Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Questions]

Which of the following is one of the big performance advantages that Spark has over Hadoop?

Show Suggested Answer Hide Answer
Suggested Answer: B

This is a very tricky Question: and involves both knowledge about merging as well as schemas when reading parquet files.

spark.read.option('mergeSchema', 'true').parquet(filePath)

Correct. Spark's DataFrameReader's mergeSchema option will work well here, since columns that appear in both partitions have matching data types. Note that mergeSchema would fail if one or

more columns with the same name that appear in both partitions would have different data types.

spark.read.parquet(filePath)

Incorrect. While this would read in data from both partitions, only the schema in the parquet file that is read in first would be considered, so some columns that appear only in the second partition

(e.g. tax_id) would be lost.

nx = 0

for file in dbutils.fs.ls(filePath):

if not file.name.endswith('.parquet'):

continue

df_temp = spark.read.parquet(file.path)

if nx == 0:

df = df_temp

else:

df = df.union(df_temp)

nx = nx+1

df

Wrong. The key idea of this solution is the DataFrame.union() command. While this command merges all data, it requires that both partitions have the exact same number of columns with identical

data types.

spark.read.parquet(filePath, mergeSchema='y')

False. While using the mergeSchema option is the correct way to solve this problem and it can even be called with DataFrameReader.parquet() as in the code block, it accepts the value True as a

boolean or string variable. But 'y' is not a valid option.

nx = 0

for file in dbutils.fs.ls(filePath):

if not file.name.endswith('.parquet'):

continue

df_temp = spark.read.parquet(file.path)

if nx == 0:

df = df_temp

else:

df = df.join(df_temp, how='outer')

nx = nx+1

df

No. This provokes a full outer join. While the resulting DataFrame will have all columns of both partitions, columns that appear in both partitions will be duplicated - the Question: says all

columns that

are included in the partitions should appear exactly once.

More info: Merging different schemas in Apache Spark | by Thiago Cordon | Data Arena | Medium

Static notebook | Dynamic notebook: See test 3, Question: 37 (Databricks import instructions)


Contribute your Thoughts:

Dean
9 hours ago
Ah, the age-old Spark vs. Hadoop debate. I wonder if the exam question author has a Spark tattoo or something.
upvoted 0 times
...
Sherita
3 days ago
Wait, Spark stores data in the DAG format? That's news to me. I thought it was all about those fancy DataFrames.
upvoted 0 times
...
Peggie
7 days ago
Deploying Spark on Kubernetes does sound interesting, but I'm not sure how that directly improves performance compared to Hadoop. Hmm, needs more clarification.
upvoted 0 times
...
Alita
10 days ago
Spark's in-memory processing is definitely a big advantage over Hadoop's disk-based approach. I've seen the performance difference first-hand.
upvoted 0 times
...
Holley
11 days ago
I'm not sure, but I think it's between C and E.
upvoted 0 times
...
Christene
14 days ago
I agree with Trinidad, storing data and computation in memory is a big advantage.
upvoted 0 times
...
Trinidad
24 days ago
I think the answer is C.
upvoted 0 times
...

Save Cancel