Databricks Exam Databricks Machine Learning Associate Topic 3 Question 22 Discussion

Actual exam question for Databricks's Databricks Machine Learning Associate exam

Question #: 22
Topic #: 3

[All Databricks Machine Learning Associate Questions]

A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.

Which of the following feature engineering tasks will be the least efficient to distribute?

AOne-hot encoding categorical features

BTarget encoding categorical features

CImputing missing feature values with the mean

DImputing missing feature values with the true median

ECreating binary indicator features for missing values

Show Suggested Answer

Suggested Answer: E

To display visual histograms and summaries of the numeric features in a Spark DataFrame, the Databricks utility function dbutils.data.summarize can be used. This function provides a comprehensive summary, including visual histograms.

Correct code:

dbutils.data.summarize(spark_df)

Other options like spark_df.describe() and spark_df.summary() provide textual statistical summaries but do not include visual histograms.

Databricks Utilities Documentation

by Sharen at Dec 06, 2024, 09:11 PM

Limited Time Offer

25%

Off

Get Premium Databricks Machine Learning Associate Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Mitzie

2 days ago

I think target encoding is harder to distribute effectively.

upvoted 0 times

...

Georgene

8 days ago

One-hot encoding is pretty straightforward.

upvoted 0 times

...

Valentin

13 days ago

I practiced a similar question, and I think creating binary indicators for missing values is pretty efficient, so maybe it's not the least efficient task.

upvoted 0 times

...

Lindsey

19 days ago

Imputing missing values with the mean seems straightforward, but I feel like using the median might be less efficient to distribute due to its calculation.

upvoted 0 times

...

Shawnda

24 days ago

I think target encoding might be tricky to distribute because it relies on the target variable, which could create dependencies.

upvoted 0 times

...

Ezekiel

1 month ago

I remember we discussed how one-hot encoding can be done in parallel, but I'm not sure if it's the least efficient.

upvoted 0 times

...

Roselle

1 month ago

Imputing with the mean or median seems like it would be the most efficient to distribute, since it's a simple calculation that can be done independently on each partition of the data.

upvoted 0 times

...

Raina

1 month ago

I'm pretty confident that creating binary indicator features for missing values would be the easiest to distribute. That's a pretty straightforward task that can be parallelized nicely.

upvoted 0 times

...

Lashanda

1 month ago

Okay, let's see. I think the target encoding might be the least efficient to distribute since it requires calculating statistics across the entire dataset.

upvoted 0 times

...

Francene

1 month ago

Hmm, I'm a bit unsure about this. I know one-hot encoding can get computationally expensive, but I'm not sure how the other options compare.

upvoted 0 times

...

Vanna

1 month ago

This seems like a tricky one. I'll need to think through the different feature engineering tasks and how they might scale.

upvoted 0 times

...

Daryl

6 months ago

Distributing feature engineering? Sounds like a job for the Avengers! Maybe we can get Thor to throw the one-hot encoders around the cluster for us. Or perhaps Hulk can just smash all the missing values into submission. Either way, this is gonna be an Endgame-level challenge.

upvoted 0 times

...

Tula

6 months ago

Ooh, true median imputation? Now we're talking! That's gotta be the most distributed-friendly option. Imagine all those servers just crunching away, finding the perfect median for each feature. It's like a mathematical orchestra!

upvoted 0 times

Lera

5 months ago

Yeah, it's like each server can independently calculate the median for different features efficiently.

upvoted 0 times

...

Cruz

5 months ago

I agree, it seems like a task that can easily be parallelized across multiple servers.

upvoted 0 times

...

Vernell

5 months ago

True median imputation is definitely the way to go for distributing feature engineering tasks.

upvoted 0 times

...

Dominga

6 months ago

Imputing missing values with the mean? That's a classic move, but I bet it's not the most efficient to distribute. Imagine all those little means flying around the cluster, colliding and causing mayhem. Nah, I'll go with the binary indicator features. Keeps things simple, you know?

upvoted 0 times

Honey

5 months ago

Creating binary indicator features does sound like a straightforward task to distribute.

upvoted 0 times

...

Julio

5 months ago

I think target encoding might be a bit tricky to distribute efficiently.

upvoted 0 times

...

Cyndy

6 months ago

I agree, one-hot encoding can be easily distributed across multiple nodes.

upvoted 0 times

...

Eveline

7 months ago

Target encoding? Really? That's going to be a nightmare to distribute. Imagine trying to coordinate all those little target values across the cluster. I'd rather just impute the missing values with the true median and call it a day.

upvoted 0 times

Ezekiel

5 months ago

D: Definitely, it's a more efficient way to handle missing values in a distributed pipeline.

upvoted 0 times

...

Leslie

5 months ago

C: I think imputing missing values with the true median is a simpler option.

upvoted 0 times

...

Evangelina

5 months ago

B: Yeah, it would be a nightmare to coordinate all those target values.

upvoted 0 times

...

Octavio

6 months ago

A: I agree, target encoding seems like a headache to distribute.

upvoted 0 times

...

Michell

7 months ago

I disagree. I believe creating binary indicator features for missing values would be the least efficient task to distribute because it involves checking for missing values in each feature separately.

upvoted 0 times

...

Wai

7 months ago

Hmm, one-hot encoding seems like the obvious choice here. I mean, how hard can it be to distribute that process? It's not like we're training a neural network or anything. Just slap it on a few more servers and voila!

upvoted 0 times