Databricks Exam Databricks Certified Associate Developer for Apache Spark 3.0 Topic 2 Question 31 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.0 exam

Question #: 31
Topic #: 2

[All Databricks Certified Associate Developer for Apache Spark 3.0 Questions]

Which of the following code blocks reads in the two-partition parquet file stored at filePath, making sure all columns are included exactly once even though each partition has a different schema?

Schema of first partition:

1. root

2. |-- transactionId: integer (nullable = true)

3. |-- predError: integer (nullable = true)

4. |-- value: integer (nullable = true)

5. |-- storeId: integer (nullable = true)

6. |-- productId: integer (nullable = true)

7. |-- f: integer (nullable = true)

Schema of second partition:

1. root

2. |-- transactionId: integer (nullable = true)

3. |-- predError: integer (nullable = true)

4. |-- value: integer (nullable = true)

5. |-- storeId: integer (nullable = true)

6. |-- rollId: integer (nullable = true)

7. |-- f: integer (nullable = true)

8. |-- tax_id: integer (nullable = false)

Aspark.read.parquet(filePath, mergeSchema='y')

Bspark.read.option('mergeSchema', 'true').parquet(filePath)

Cspark.read.parquet(filePath)

D1. nx = 0
2. for file in dbutils.fs.ls(filePath):
3. if not file.name.endswith('.parquet'):
4. continue
5. df_temp = spark.read.parquet(file.path)
6. if nx == 0:
7. df = df_temp
8. else:
9. df = df.union(df_temp)
10. nx = nx+1
11. df

E1. nx = 0
2. for file in dbutils.fs.ls(filePath):
3. if not file.name.endswith('.parquet'):
4. continue
5. df_temp = spark.read.parquet(file.path)
6. if nx == 0:
7. df = df_temp
8. else:
9. df = df.join(df_temp, how='outer')
10. nx = nx+1
11. df

Show Suggested Answer

Suggested Answer: B

This is a very tricky Question: and involves both knowledge about merging as well as schemas when reading parquet files.

spark.read.option('mergeSchema', 'true').parquet(filePath)

Correct. Spark's DataFrameReader's mergeSchema option will work well here, since columns that appear in both partitions have matching data types. Note that mergeSchema would fail if one or

more columns with the same name that appear in both partitions would have different data types.

spark.read.parquet(filePath)

Incorrect. While this would read in data from both partitions, only the schema in the parquet file that is read in first would be considered, so some columns that appear only in the second partition

(e.g. tax_id) would be lost.

nx = 0

for file in dbutils.fs.ls(filePath):

if not file.name.endswith('.parquet'):

continue

df_temp = spark.read.parquet(file.path)

if nx == 0:

df = df_temp

else:

df = df.union(df_temp)

nx = nx+1

Wrong. The key idea of this solution is the DataFrame.union() command. While this command merges all data, it requires that both partitions have the exact same number of columns with identical

data types.

spark.read.parquet(filePath, mergeSchema='y')

False. While using the mergeSchema option is the correct way to solve this problem and it can even be called with DataFrameReader.parquet() as in the code block, it accepts the value True as a

boolean or string variable. But 'y' is not a valid option.

nx = 0

for file in dbutils.fs.ls(filePath):

if not file.name.endswith('.parquet'):

continue

df_temp = spark.read.parquet(file.path)

if nx == 0:

df = df_temp

else:

df = df.join(df_temp, how='outer')

nx = nx+1

No. This provokes a full outer join. While the resulting DataFrame will have all columns of both partitions, columns that appear in both partitions will be duplicated - the Question: says all

columns that

are included in the partitions should appear exactly once.

More info: Merging different schemas in Apache Spark | by Thiago Cordon | Data Arena | Medium

Static notebook | Dynamic notebook: See test 3, Question: 37 (Databricks import instructions)

by Tresa at Dec 26, 2022, 06:24 PM

Limited Time Offer

25%

Off

Get Premium Databricks Certified Associate Developer for Apache Spark 3.0 Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Currently there are no comments in this discussion, be the first to comment!