Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Certified Associate Developer for Apache Spark 3.5 Exam - Topic 7 Question 13 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.5 exam
Question #: 13
Topic #: 7
[All Databricks Certified Associate Developer for Apache Spark 3.5 Questions]

A data engineer is asked to build an ingestion pipeline for a set of Parquet files delivered by an upstream team on a nightly basis. The data is stored in a directory structure with a base path of "/path/events/data". The upstream team drops daily data into the underlying subdirectories following the convention year/month/day.

A few examples of the directory structure are:

Which of the following code snippets will read all the data within the directory structure?

Show Suggested Answer Hide Answer
Suggested Answer: B

To read all files recursively within a nested directory structure, Spark requires the recursiveFileLookup option to be explicitly enabled. According to Databricks official documentation, when dealing with deeply nested Parquet files in a directory tree (as shown in this example), you should set:

df = spark.read.option('recursiveFileLookup', 'true').parquet('/path/events/data/')

This ensures that Spark searches through all subdirectories under /path/events/data/ and reads any Parquet files it finds, regardless of the folder depth.

Option A is incorrect because while it includes an option, inferSchema is irrelevant here and does not enable recursive file reading.

Option C is incorrect because wildcards may not reliably match deep nested structures beyond one directory level.

Option D is incorrect because it will only read files directly within /path/events/data/ and not subdirectories like /2023/01/01.

Databricks documentation reference:

'To read files recursively from nested folders, set the recursiveFileLookup option to true. This is useful when data is organized in hierarchical folder structures' --- Databricks documentation on Parquet files ingestion and options.


Contribute your Thoughts:

0/2000 characters
Margarita
19 hours ago
I remember practicing a similar question where we had to read nested directories, and I think option B could be useful for that, but I’m not confident about the "recursiveFileLookup" part.
upvoted 0 times
...
Son
6 days ago
I think option A might be the right choice since it reads from the base path directly, but I'm not entirely sure if it captures all subdirectories.
upvoted 0 times
...

Save Cancel