35 of 55.
A data engineer is building a Structured Streaming pipeline and wants it to recover from failures or intentional shutdowns by continuing where it left off.
How can this be achieved?
In Structured Streaming, checkpoints store state information (offsets, progress, and metadata) needed to resume a stream after a failure or restart.
Correct usage:
Set the checkpointLocation option when writing the streaming output:
streaming_df.writeStream
.format('delta')
.option('checkpointLocation', '/path/to/checkpoint/dir')
.start('/path/to/output')
Spark uses this checkpoint directory to recover progress automatically and maintain exactly-once semantics.
Why the other options are incorrect:
A/D: recoveryLocation is not a valid Spark configuration option.
B: Checkpointing must be configured in writeStream, not during readStream.
PySpark Structured Streaming Guide --- Checkpointing and recovery.
Databricks Exam Guide (June 2025): Section ''Structured Streaming'' --- explains checkpointing and fault-tolerant streaming recovery.
Currently there are no comments in this discussion, be the first to comment!