1 of 55. A data scientist wants to ingest a directory full of plain text files so that each record in the output DataFrame contains the entire contents of a single file and the full path of the file the text was read from.
The first attempt does read the text files, but each record contains a single line. This code is shown below:
txt_path = "/datasets/raw_txt/*"
df = spark.read.text(txt_path) # one row per line by default
df = df.withColumn("file_path", input_file_name()) # add full path
Which code change can be implemented in a DataFrame that meets the data scientist's requirements?
By default, the spark.read.text() method reads a text file one line per record. This means that each line in a text file becomes one row in the resulting DataFrame.
To read each file as a single record, Apache Spark provides the option wholetext, which, when set to True, causes Spark to treat the entire file contents as one single string per row.
Correct usage:
df = spark.read.option('wholetext', True).text(txt_path)
This way, each record in the DataFrame will contain the full content of one file instead of one line per record.
To also include the file path, the function input_file_name() can be used to create an additional column that stores the complete path of the file being read:
from pyspark.sql.functions import input_file_name
df = spark.read.option('wholetext', True).text(txt_path)
.withColumn('file_path', input_file_name())
This approach satisfies both requirements from the question:
Each record holds the entire contents of a file.
Each record also contains the file path from which the text was read.
Why the other options are incorrect:
B or D (lineSep) -- The lineSep option only defines the delimiter between lines. It does not combine the entire file content into a single record.
C (wholetext=False) -- This is the default behavior, which still reads one record per line rather than per file.
Reference (Databricks Apache Spark 3.5 -- Python / Study Guide):
PySpark API Reference: DataFrameReader.text --- describes the wholetext option.
PySpark Functions: input_file_name() --- adds a column with the source file path.
Databricks Certified Associate Developer for Apache Spark Exam Guide (June 2025): Section ''Using Spark DataFrame APIs'' --- covers reading files and handling DataFrames.
Mitsue
2 months agoShawnna
2 months agoGrover
2 months agoEvangelina
2 months agoAshton
2 months agoLindsey
2 months agoChaya
3 months agoNana
3 months agoStephaine
3 months agoReynalda
4 months agoNovella
4 months agoLucia
4 months agoTamesha
4 months agoLai
4 months agoBerry
4 months agoFelicia
5 months agoJames
5 months agoShaquana
5 months agoCristal
5 months agoDaron
5 months agoLigia
5 months agoLemuel
6 months agoCaren
6 months agoRaul
6 months agoEttie
6 months agoVeda
20 days agoRuthann
26 days agoLeana
1 month agoAudra
1 month agoVirgina
1 month ago