New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Certified Associate Developer for Apache Spark 3.5 Exam - Topic 7 Question 8 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.5 exam
Question #: 8
Topic #: 7
[All Databricks Certified Associate Developer for Apache Spark 3.5 Questions]

An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.

The initial code is:

def in_spanish_inner(df: pd.Series) -> pd.Series:

model = get_translation_model(target_lang='es')

return df.apply(model)

in_spanish = sf.pandas_udf(in_spanish_inner, StringType())

How can the MLOps engineer change this code to reduce how many times the language model is loaded?

Show Suggested Answer Hide Answer
Suggested Answer: D

The provided code defines a Pandas UDF of type Series-to-Series, where a new instance of the language model is created on each call, which happens per batch. This is inefficient and results in significant overhead due to repeated model initialization.

To reduce the frequency of model loading, the engineer should convert the UDF to an iterator-based Pandas UDF (Iterator[pd.Series] -> Iterator[pd.Series]). This allows the model to be loaded once per executor and reused across multiple batches, rather than once per call.

From the official Databricks documentation:

''Iterator of Series to Iterator of Series UDFs are useful when the UDF initialization is expensive... For example, loading a ML model once per executor rather than once per row/batch.''

--- Databricks Official Docs: Pandas UDFs

Correct implementation looks like:

python

CopyEdit

@pandas_udf('string')

def translate_udf(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:

model = get_translation_model(target_lang='es')

for batch in batch_iter:

yield batch.apply(model)

This refactor ensures the get_translation_model() is invoked once per executor process, not per batch, significantly improving pipeline performance.


Contribute your Thoughts:

0/2000 characters
Nakita
9 hours ago
Wait, why load the model every time? That’s wild!
upvoted 0 times
...
Annice
6 days ago
D) seems like overkill for this problem.
upvoted 0 times
...
Jillian
11 days ago
C) sounds interesting, but will it really help?
upvoted 0 times
...
Mariann
16 days ago
I think B) is the way to go!
upvoted 0 times
...
Bonita
21 days ago
A) Convert to a PySpark UDF for better performance.
upvoted 0 times
...
Dudley
26 days ago
Ah, the joys of optimizing data pipelines! I hope the MLOps engineer has a good sense of humor.
upvoted 0 times
...
Abraham
1 month ago
D) looks interesting, but I'm not sure if it's the most straightforward solution here.
upvoted 0 times
...
Tarra
1 month ago
Hmm, I wonder if the model can be cached somehow to improve performance even further.
upvoted 0 times
...
Rikki
1 month ago
I’m a bit confused about the difference between Series-Scalar and Iterator UDFs. I think one of them might help with the model loading issue, but I can't recall which.
upvoted 0 times
...
Alpha
2 months ago
I practiced a similar question where we had to optimize UDFs, and I feel like using mapInPandas might be the right approach here.
upvoted 0 times
...
Gilma
2 months ago
The key here is to find a way to load the model only once, rather than on every call. I think options B or D are the most promising approaches to achieve that.
upvoted 0 times
...
Cyndy
2 months ago
Option C, running the function in a mapInPandas() call, could also be a good solution. That way the model is only loaded once per partition, which could improve performance.
upvoted 0 times
...
Edward
2 months ago
B) seems like the best option to reduce the number of times the language model is loaded.
upvoted 0 times
...
Gertude
2 months ago
I think converting the Pandas UDF to a PySpark UDF could help, but I need to double-check if that actually reduces model loading times.
upvoted 0 times
...
Derick
2 months ago
I remember we discussed how loading models repeatedly can slow down performance, but I'm not sure which option would best address that.
upvoted 0 times
...
Rolf
3 months ago
I'm leaning towards option D - converting to an Iterator[Series]-Iterator[Series] UDF. That might allow for even more optimization by loading the model once per batch of data.
upvoted 0 times
...
Magda
3 months ago
I think option A is the best. PySpark UDFs load models once, right?
upvoted 0 times
...
Effie
3 months ago
I'm a bit confused by the different UDF types mentioned. I'll need to review the differences between Series-Series, Series-Scalar, and Iterator[Series]-Iterator[Series] UDFs to decide the best approach here.
upvoted 0 times
...
Maryln
3 months ago
This looks like a performance optimization question. I'd start by considering option B - converting the Pandas UDF to a Series-Scalar UDF. That way, the model can be loaded once and reused for each row.
upvoted 0 times
...

Save Cancel