41 of 55. A data engineer is working on the DataFrame df1 and wants the Name with the highest count to appear first (descending order by count), followed by the next highest, and so on.
The DataFrame has columns:
id | Name | count | timestamp
---------------------------------
1 | USA | 10
2 | India | 20
3 | England | 50
4 | India | 50
5 | France | 20
6 | India | 10
7 | USA | 30
8 | USA | 40
Which code fragment should the engineer use to sort the data in the Name and count columns?
To sort a Spark DataFrame by multiple columns, use .orderBy() (or .sort()) with column expressions.
Correct syntax for descending and ascending mix:
from pyspark.sql.functions import col
df1.orderBy(col('count').desc(), col('Name').asc())
This sorts primarily by count in descending order and secondarily by Name in ascending order (alphabetically).
Why the other options are incorrect:
B/C: Default sort order is ascending; won't place highest counts first.
D: Reverses sorting logic --- sorts Name descending, not required.
PySpark DataFrame API --- orderBy() and col() for sorting with direction.
Databricks Exam Guide (June 2025): Section ''Using Spark DataFrame APIs'' --- sorting, ordering, and column expressions.
===========
A Data Analyst needs to retrieve employees with 5 or more years of tenure.
Which code snippet filters and shows the list?
To filter rows based on a condition and display them in Spark, use filter(...).show():
employees_df.filter(employees_df.tenure >= 5).show()
Option A is correct and shows the results.
Option B filters but doesn't display them.
Option C uses Python's built-in filter, not Spark.
Option D collects the results to the driver, which is unnecessary if .show() is sufficient.
Final Answer: A
An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.
The initial code is:

def in_spanish_inner(df: pd.Series) -> pd.Series:
model = get_translation_model(target_lang='es')
return df.apply(model)
in_spanish = sf.pandas_udf(in_spanish_inner, StringType())
How can the MLOps engineer change this code to reduce how many times the language model is loaded?
The provided code defines a Pandas UDF of type Series-to-Series, where a new instance of the language model is created on each call, which happens per batch. This is inefficient and results in significant overhead due to repeated model initialization.
To reduce the frequency of model loading, the engineer should convert the UDF to an iterator-based Pandas UDF (Iterator[pd.Series] -> Iterator[pd.Series]). This allows the model to be loaded once per executor and reused across multiple batches, rather than once per call.
From the official Databricks documentation:
''Iterator of Series to Iterator of Series UDFs are useful when the UDF initialization is expensive... For example, loading a ML model once per executor rather than once per row/batch.''
--- Databricks Official Docs: Pandas UDFs
Correct implementation looks like:
python
CopyEdit
@pandas_udf('string')
def translate_udf(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:
model = get_translation_model(target_lang='es')
for batch in batch_iter:
yield batch.apply(model)
This refactor ensures the get_translation_model() is invoked once per executor process, not per batch, significantly improving pipeline performance.
44 of 55. A data engineer is working on a real-time analytics pipeline using Spark Structured Streaming. They want the system to process incoming data in micro-batches at a fixed interval of 5 seconds.
Which code snippet fulfills this requirement?
A.
query = df.writeStream \
.outputMode("append") \
.trigger(processingTime="5 seconds") \
.start()
B.
query = df.writeStream \
.outputMode("append") \
.trigger(continuous="5 seconds") \
.start()
C.
query = df.writeStream \
.outputMode("append") \
.trigger(once=True) \
.start()
D.
query = df.writeStream \
.outputMode("append") \
.start()
To process data in fixed micro-batch intervals, use the .trigger(processingTime='interval') option in Structured Streaming.
Correct usage:
query = df.writeStream \
.outputMode('append') \
.trigger(processingTime='5 seconds') \
.start()
This instructs Spark to process available data every 5 seconds.
Why the other options are incorrect:
B: continuous triggers are for continuous processing mode (different execution model).
C: once=True runs the stream a single time (batch mode).
D: Default trigger runs as fast as possible, not fixed intervals.
PySpark Structured Streaming Guide --- Trigger types: processingTime, once, continuous.
Databricks Exam Guide (June 2025): Section ''Structured Streaming'' --- controlling streaming triggers and batch intervals.
===========
What is the risk associated with this operation when converting a large Pandas API on Spark DataFrame back to a Pandas DataFrame?
When you convert a large pyspark.pandas (aka Pandas API on Spark) DataFrame to a local Pandas DataFrame using .toPandas(), Spark collects all partitions to the driver.
From the Spark documentation:
''Be careful when converting large datasets to Pandas. The entire dataset will be pulled into the driver's memory.''
Thus, for large datasets, this can cause memory overflow or out-of-memory errors on the driver.
Final Answer: D
Mozell
6 days agoDottie
13 days agoJerilyn
20 days agoGeoffrey
28 days agoAleta
1 month agoErinn
1 month agoMargarett
2 months agoMarsha
2 months agoDeane
2 months agoStephaine
2 months agoFelicidad
3 months agoHortencia
3 months agoLuisa
3 months agoBrandee
3 months agoLazaro
4 months agoGayla
4 months agoLonna
4 months agoUna
4 months agoBeckie
5 months agoLorrie
5 months ago