A data engineer is asked to build an ingestion pipeline for a set of Parquet files delivered by an upstream team on a nightly basis. The data is stored in a directory structure with a base path of "/path/events/data". The upstream team drops daily data into the underlying subdirectories following the convention year/month/day.
A few examples of the directory structure are:

Which of the following code snippets will read all the data within the directory structure?
To read all files recursively within a nested directory structure, Spark requires the recursiveFileLookup option to be explicitly enabled. According to Databricks official documentation, when dealing with deeply nested Parquet files in a directory tree (as shown in this example), you should set:
df = spark.read.option('recursiveFileLookup', 'true').parquet('/path/events/data/')
This ensures that Spark searches through all subdirectories under /path/events/data/ and reads any Parquet files it finds, regardless of the folder depth.
Option A is incorrect because while it includes an option, inferSchema is irrelevant here and does not enable recursive file reading.
Option C is incorrect because wildcards may not reliably match deep nested structures beyond one directory level.
Option D is incorrect because it will only read files directly within /path/events/data/ and not subdirectories like /2023/01/01.
Databricks documentation reference:
'To read files recursively from nested folders, set the recursiveFileLookup option to true. This is useful when data is organized in hierarchical folder structures' --- Databricks documentation on Parquet files ingestion and options.
Margarita
19 hours agoSon
6 days ago