Deal of The Day! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Exam Databricks Machine Learning Associate Topic 3 Question 22 Discussion

Actual exam question for Databricks's Databricks Machine Learning Associate exam
Question #: 22
Topic #: 3
[All Databricks Machine Learning Associate Questions]

A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.

Which of the following feature engineering tasks will be the least efficient to distribute?

Show Suggested Answer Hide Answer
Suggested Answer: E

To display visual histograms and summaries of the numeric features in a Spark DataFrame, the Databricks utility function dbutils.data.summarize can be used. This function provides a comprehensive summary, including visual histograms.

Correct code:

dbutils.data.summarize(spark_df)

Other options like spark_df.describe() and spark_df.summary() provide textual statistical summaries but do not include visual histograms.


Databricks Utilities Documentation

Contribute your Thoughts:

Daryl
2 months ago
Distributing feature engineering? Sounds like a job for the Avengers! Maybe we can get Thor to throw the one-hot encoders around the cluster for us. Or perhaps Hulk can just smash all the missing values into submission. Either way, this is gonna be an Endgame-level challenge.
upvoted 0 times
...
Tula
2 months ago
Ooh, true median imputation? Now we're talking! That's gotta be the most distributed-friendly option. Imagine all those servers just crunching away, finding the perfect median for each feature. It's like a mathematical orchestra!
upvoted 0 times
Lera
27 days ago
Yeah, it's like each server can independently calculate the median for different features efficiently.
upvoted 0 times
...
Cruz
1 months ago
I agree, it seems like a task that can easily be parallelized across multiple servers.
upvoted 0 times
...
Vernell
1 months ago
True median imputation is definitely the way to go for distributing feature engineering tasks.
upvoted 0 times
...
...
Dominga
2 months ago
Imputing missing values with the mean? That's a classic move, but I bet it's not the most efficient to distribute. Imagine all those little means flying around the cluster, colliding and causing mayhem. Nah, I'll go with the binary indicator features. Keeps things simple, you know?
upvoted 0 times
Honey
1 months ago
Creating binary indicator features does sound like a straightforward task to distribute.
upvoted 0 times
...
Julio
1 months ago
I think target encoding might be a bit tricky to distribute efficiently.
upvoted 0 times
...
Cyndy
2 months ago
I agree, one-hot encoding can be easily distributed across multiple nodes.
upvoted 0 times
...
...
Eveline
3 months ago
Target encoding? Really? That's going to be a nightmare to distribute. Imagine trying to coordinate all those little target values across the cluster. I'd rather just impute the missing values with the true median and call it a day.
upvoted 0 times
Ezekiel
1 months ago
D: Definitely, it's a more efficient way to handle missing values in a distributed pipeline.
upvoted 0 times
...
Leslie
1 months ago
C: I think imputing missing values with the true median is a simpler option.
upvoted 0 times
...
Evangelina
1 months ago
B: Yeah, it would be a nightmare to coordinate all those target values.
upvoted 0 times
...
Octavio
2 months ago
A: I agree, target encoding seems like a headache to distribute.
upvoted 0 times
...
...
Michell
3 months ago
I disagree. I believe creating binary indicator features for missing values would be the least efficient task to distribute because it involves checking for missing values in each feature separately.
upvoted 0 times
...
Wai
3 months ago
Hmm, one-hot encoding seems like the obvious choice here. I mean, how hard can it be to distribute that process? It's not like we're training a neural network or anything. Just slap it on a few more servers and voila!
upvoted 0 times
Tresa
2 months ago
User 4: Imputing missing feature values with the true median
upvoted 0 times
...
Kris
2 months ago
User 3: Imputing missing feature values with the mean
upvoted 0 times
...
Sena
2 months ago
User 2: Target encoding categorical features
upvoted 0 times
...
Theresia
2 months ago
User 1: One-hot encoding categorical features
upvoted 0 times
...
...
Dortha
4 months ago
I agree with Lizbeth. Target encoding involves calculating statistics based on the target variable, which can be tricky to distribute efficiently.
upvoted 0 times
...
Lizbeth
4 months ago
I think target encoding categorical features will be the least efficient to distribute.
upvoted 0 times
...

Save Cancel