1 of 55. A data scientist wants to ingest a directory full of plain text files so that each record in the output DataFrame contains the entire contents of a single file and the full path of the file the text was read from.
The first attempt does read the text files, but each record contains a single line. This code is shown below:
txt_path = "/datasets/raw_txt/*"
df = spark.read.text(txt_path) # one row per line by default
df = df.withColumn("file_path", input_file_name()) # add full path
Which code change can be implemented in a DataFrame that meets the data scientist's requirements?
By default, the spark.read.text() method reads a text file one line per record. This means that each line in a text file becomes one row in the resulting DataFrame.
To read each file as a single record, Apache Spark provides the option wholetext, which, when set to True, causes Spark to treat the entire file contents as one single string per row.
Correct usage:
df = spark.read.option('wholetext', True).text(txt_path)
This way, each record in the DataFrame will contain the full content of one file instead of one line per record.
To also include the file path, the function input_file_name() can be used to create an additional column that stores the complete path of the file being read:
from pyspark.sql.functions import input_file_name
df = spark.read.option('wholetext', True).text(txt_path)
.withColumn('file_path', input_file_name())
This approach satisfies both requirements from the question:
Each record holds the entire contents of a file.
Each record also contains the file path from which the text was read.
Why the other options are incorrect:
B or D (lineSep) -- The lineSep option only defines the delimiter between lines. It does not combine the entire file content into a single record.
C (wholetext=False) -- This is the default behavior, which still reads one record per line rather than per file.
Reference (Databricks Apache Spark 3.5 -- Python / Study Guide):
PySpark API Reference: DataFrameReader.text --- describes the wholetext option.
PySpark Functions: input_file_name() --- adds a column with the source file path.
Databricks Certified Associate Developer for Apache Spark Exam Guide (June 2025): Section ''Using Spark DataFrame APIs'' --- covers reading files and handling DataFrames.
Chaya
9 hours agoNana
6 days agoStephaine
11 days agoReynalda
16 days agoNovella
21 days agoLucia
26 days agoTamesha
1 month agoLai
1 month agoBerry
1 month agoFelicia
2 months agoJames
2 months agoShaquana
2 months agoCristal
2 months agoDaron
2 months agoLigia
2 months agoLemuel
3 months agoCaren
3 months agoRaul
3 months agoEttie
3 months ago