Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Certified Associate Developer for Apache Spark 3.5 Exam - Topic 7 Question 8 Discussion

Actual exam question for Databricks's Databricks Certified Associate Developer for Apache Spark 3.5 exam
Question #: 8
Topic #: 7
[All Databricks Certified Associate Developer for Apache Spark 3.5 Questions]

An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.

The initial code is:

def in_spanish_inner(df: pd.Series) -> pd.Series:

model = get_translation_model(target_lang='es')

return df.apply(model)

in_spanish = sf.pandas_udf(in_spanish_inner, StringType())

How can the MLOps engineer change this code to reduce how many times the language model is loaded?

Show Suggested Answer Hide Answer
Suggested Answer: D

The provided code defines a Pandas UDF of type Series-to-Series, where a new instance of the language model is created on each call, which happens per batch. This is inefficient and results in significant overhead due to repeated model initialization.

To reduce the frequency of model loading, the engineer should convert the UDF to an iterator-based Pandas UDF (Iterator[pd.Series] -> Iterator[pd.Series]). This allows the model to be loaded once per executor and reused across multiple batches, rather than once per call.

From the official Databricks documentation:

''Iterator of Series to Iterator of Series UDFs are useful when the UDF initialization is expensive... For example, loading a ML model once per executor rather than once per row/batch.''

--- Databricks Official Docs: Pandas UDFs

Correct implementation looks like:

python

CopyEdit

@pandas_udf('string')

def translate_udf(batch_iter: Iterator[pd.Series]) -> Iterator[pd.Series]:

model = get_translation_model(target_lang='es')

for batch in batch_iter:

yield batch.apply(model)

This refactor ensures the get_translation_model() is invoked once per executor process, not per batch, significantly improving pipeline performance.


Contribute your Thoughts:

0/2000 characters
Henriette
6 days ago
Overall, I feel like reducing model loads is key. Any option that does that is worth considering!
upvoted 0 times
...
Francesco
12 days ago
D sounds interesting too. Iterators could help manage memory better.
upvoted 0 times
...
Davida
17 days ago
I’m leaning towards B. Scalar UDFs seem more efficient for single values.
upvoted 0 times
...
Frederica
22 days ago
Yeah, but option C could also work. It might optimize the pipeline better.
upvoted 0 times
...
Rosalyn
27 days ago
This question is tricky! I feel like it tests our understanding of UDFs.
upvoted 0 times
...
Nakita
2 months ago
Wait, why load the model every time? That’s wild!
upvoted 0 times
...
Annice
2 months ago
D) seems like overkill for this problem.
upvoted 0 times
...
Jillian
2 months ago
C) sounds interesting, but will it really help?
upvoted 0 times
...
Mariann
2 months ago
I think B) is the way to go!
upvoted 0 times
...
Bonita
2 months ago
A) Convert to a PySpark UDF for better performance.
upvoted 0 times
...
Dudley
2 months ago
Ah, the joys of optimizing data pipelines! I hope the MLOps engineer has a good sense of humor.
upvoted 0 times
...
Abraham
3 months ago
D) looks interesting, but I'm not sure if it's the most straightforward solution here.
upvoted 0 times
...
Tarra
3 months ago
Hmm, I wonder if the model can be cached somehow to improve performance even further.
upvoted 0 times
...
Rikki
3 months ago
I’m a bit confused about the difference between Series-Scalar and Iterator UDFs. I think one of them might help with the model loading issue, but I can't recall which.
upvoted 0 times
...
Alpha
3 months ago
I practiced a similar question where we had to optimize UDFs, and I feel like using mapInPandas might be the right approach here.
upvoted 0 times
...
Gilma
3 months ago
The key here is to find a way to load the model only once, rather than on every call. I think options B or D are the most promising approaches to achieve that.
upvoted 0 times
...
Cyndy
3 months ago
Option C, running the function in a mapInPandas() call, could also be a good solution. That way the model is only loaded once per partition, which could improve performance.
upvoted 0 times
...
Edward
4 months ago
B) seems like the best option to reduce the number of times the language model is loaded.
upvoted 0 times
...
Gertude
4 months ago
I think converting the Pandas UDF to a PySpark UDF could help, but I need to double-check if that actually reduces model loading times.
upvoted 0 times
...
Derick
4 months ago
I remember we discussed how loading models repeatedly can slow down performance, but I'm not sure which option would best address that.
upvoted 0 times
...
Rolf
4 months ago
I'm leaning towards option D - converting to an Iterator[Series]-Iterator[Series] UDF. That might allow for even more optimization by loading the model once per batch of data.
upvoted 0 times
...
Magda
4 months ago
I think option A is the best. PySpark UDFs load models once, right?
upvoted 0 times
...
Effie
5 months ago
I'm a bit confused by the different UDF types mentioned. I'll need to review the differences between Series-Series, Series-Scalar, and Iterator[Series]-Iterator[Series] UDFs to decide the best approach here.
upvoted 0 times
...
Maryln
5 months ago
This looks like a performance optimization question. I'd start by considering option B - converting the Pandas UDF to a Series-Scalar UDF. That way, the model can be loaded once and reused for each row.
upvoted 0 times
Catherin
1 day ago
I agree, option B seems like the best choice. Reusing the model will save time.
upvoted 0 times
...
...

Save Cancel