New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Certified Associate Developer for Apache Spark 3.5 Exam - Topic 1 Question 5 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.5 exam
Question #: 5
Topic #: 1
[All Databricks Certified Associate Developer for Apache Spark 3.5 Questions]

1 of 55. A data scientist wants to ingest a directory full of plain text files so that each record in the output DataFrame contains the entire contents of a single file and the full path of the file the text was read from.

The first attempt does read the text files, but each record contains a single line. This code is shown below:

txt_path = "/datasets/raw_txt/*"

df = spark.read.text(txt_path) # one row per line by default

df = df.withColumn("file_path", input_file_name()) # add full path

Which code change can be implemented in a DataFrame that meets the data scientist's requirements?

Show Suggested Answer Hide Answer
Suggested Answer: A

By default, the spark.read.text() method reads a text file one line per record. This means that each line in a text file becomes one row in the resulting DataFrame.

To read each file as a single record, Apache Spark provides the option wholetext, which, when set to True, causes Spark to treat the entire file contents as one single string per row.

Correct usage:

df = spark.read.option('wholetext', True).text(txt_path)

This way, each record in the DataFrame will contain the full content of one file instead of one line per record.

To also include the file path, the function input_file_name() can be used to create an additional column that stores the complete path of the file being read:

from pyspark.sql.functions import input_file_name

df = spark.read.option('wholetext', True).text(txt_path)

.withColumn('file_path', input_file_name())

This approach satisfies both requirements from the question:

Each record holds the entire contents of a file.

Each record also contains the file path from which the text was read.

Why the other options are incorrect:

B or D (lineSep) -- The lineSep option only defines the delimiter between lines. It does not combine the entire file content into a single record.

C (wholetext=False) -- This is the default behavior, which still reads one record per line rather than per file.

Reference (Databricks Apache Spark 3.5 -- Python / Study Guide):

PySpark API Reference: DataFrameReader.text --- describes the wholetext option.

PySpark Functions: input_file_name() --- adds a column with the source file path.

Databricks Certified Associate Developer for Apache Spark Exam Guide (June 2025): Section ''Using Spark DataFrame APIs'' --- covers reading files and handling DataFrames.


Contribute your Thoughts:

0/2000 characters
Chaya
9 hours ago
I disagree, B) seems more appropriate for handling lines.
upvoted 0 times
...
Nana
6 days ago
A) is the way to go! Whole text option is what we need.
upvoted 0 times
...
Stephaine
11 days ago
Who needs to read entire files when you can just read one line at a time? That's the true data scientist way!
upvoted 0 times
...
Reynalda
16 days ago
Option D is wrong. The lineSep option is used to specify the line separator, not to read the entire file.
upvoted 0 times
...
Novella
21 days ago
Option C is incorrect. Setting wholetext to False would defeat the purpose of reading the entire file.
upvoted 0 times
...
Lucia
26 days ago
Option B is not correct. The lineSep option is used to specify the line separator, not to read the entire file.
upvoted 0 times
...
Tamesha
1 month ago
I’m a bit confused about the options. I remember something about `lineSep` affecting how lines are interpreted, but I can't recall if it would help here.
upvoted 0 times
...
Lai
1 month ago
This question seems similar to one we practiced where we had to adjust how files were read. I think we used `wholetext` in that case too.
upvoted 0 times
...
Berry
1 month ago
I'm not entirely sure, but I feel like the `lineSep` option is more about how to split lines rather than reading the whole file. So, that might not be it.
upvoted 0 times
...
Felicia
2 months ago
Okay, I think I've got it. A) Add the option wholetext to the text() function should do the trick. That will read the entire contents of each file into a single row, which is what the data scientist is looking for. Feels pretty straightforward.
upvoted 0 times
...
James
2 months ago
Hmm, I'm not totally sure about this one. I'm leaning towards A, but I want to double-check the documentation to make sure I understand how the wholetext option works. Don't want to guess and get it wrong.
upvoted 0 times
...
Shaquana
2 months ago
The key here is that we want each record to contain the full text of a file, not just a single line. So I believe the wholetext option is the way to go. It should read the entire contents of each file into a single row.
upvoted 0 times
...
Cristal
2 months ago
I think I remember that the `wholetext` option is what we need to read the entire file as a single record. So, maybe option A?
upvoted 0 times
...
Daron
2 months ago
Option A is the correct answer. The wholetext option will read the entire contents of each file as a single record.
upvoted 0 times
...
Ligia
2 months ago
This question is tricky! I think A is the right choice.
upvoted 0 times
...
Lemuel
3 months ago
I agree, A makes sense. Whole text option is needed.
upvoted 0 times
...
Caren
3 months ago
I feel confident about A too. It’s clear what the data scientist wants.
upvoted 0 times
...
Raul
3 months ago
I'm a bit confused on this one. I'm not sure if the wholetext option is the right approach or if we need to do something else with the lineSep. I'll have to think it through carefully.
upvoted 0 times
...
Ettie
3 months ago
I think the answer is A) Add the option wholetext to the text() function. That should read the entire contents of each file into a single row.
upvoted 0 times
...

Save Cancel