Databricks Exam Databricks Machine Learning Associate Topic 2 Question 20 Discussion

Actual exam question for Databricks's Databricks Machine Learning Associate exam

Question #: 20
Topic #: 2

[All Databricks Machine Learning Associate Questions]

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.

Which of the following explanations justifies this suggestion?

AOne-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

BOne-hot encoding is dependent on the target variable's values which differ for each apaplication.

COne-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

DOne-hot encoding is not a common strategy for representing categorical feature variables numerically.

Show Suggested Answer

Suggested Answer: A

In Spark ML, a transformer is an algorithm that can transform one DataFrame into another DataFrame. It takes a DataFrame as input and produces a new DataFrame as output. This transformation can involve adding new columns, modifying existing ones, or applying feature transformations. Examples of transformers in Spark MLlib include feature transformers like StringIndexer, VectorAssembler, and StandardScaler.

Databricks documentation on transformers: Transformers in Spark ML

by Stephanie at Oct 20, 2024, 05:27 PM

Limited Time Offer

25%

Off

Get Premium Databricks Machine Learning Associate Questions as Interactive Web-Based Practice Test or PDF

Contribute your Thoughts:

Submit Cancel

Alex

2 days ago

Totally agree, it can lead to high dimensionality issues!

upvoted 0 times

...

Asuncion

8 days ago

One-hot encoding can be tricky for some algorithms.

upvoted 0 times

...

Tamar

13 days ago

I’m a bit confused about option D; I thought one-hot encoding was actually a common method for handling categorical variables.

upvoted 0 times

...

Herman

19 days ago

I feel like we practiced a question similar to this, and I think one-hot encoding can be computationally heavy, which might relate to option C.

upvoted 0 times

...

Yuonne

24 days ago

I think option A makes sense because some algorithms, like decision trees, might not need one-hot encoding at all.

upvoted 0 times

...

Lelia

1 month ago

I remember discussing one-hot encoding in class, but I’m not sure if it’s always the best choice for every algorithm.

upvoted 0 times

...

Jesusita

1 month ago

I'm a little confused by this question. The options don't seem to clearly explain why we shouldn't one-hot encode the categorical variables. I'll have to review my notes on feature engineering to make sure I understand the tradeoffs here.

upvoted 0 times

...

Flo

1 month ago

Okay, let me see. I think the key here is that the data scientist is suggesting we shouldn't one-hot encode the categorical variables in the feature repository. That makes me think option A is the best explanation - one-hot encoding can be problematic for some algorithms.

upvoted 0 times

...

Leeann

1 month ago

Hmm, I'm a bit unsure here. I know one-hot encoding is a common way to handle categorical variables, but the question is suggesting we shouldn't do it. I'll have to think this through carefully.

upvoted 0 times

...

Alida

1 month ago

I'm pretty confident about this one. I think the answer is A - one-hot encoding can be problematic for some machine learning algorithms.

upvoted 0 times

...

Emily

6 months ago

B) Yep, that makes the most sense. The target variable can change, so one-hot encoding shouldn't be in the feature repo.

upvoted 0 times

...

Felice

6 months ago

Ha! 'Not a common strategy', that's a funny way to put it. I wonder what the 'uncommon' strategies are.

upvoted 0 times

Galen

4 months ago

User 4: Yeah, it's not supported by most machine learning libraries either.

upvoted 0 times

...

Layla

4 months ago

User 3: It's computationally intensive and should only be used on small samples of training sets.

upvoted 0 times

...

Tony

5 months ago

User 2: I agree, it can be problematic for some machine learning algorithms.

upvoted 0 times

...

Hermila

5 months ago

User 1: One-hot encoding is not a common strategy for representing categorical variables numerically.

upvoted 0 times

...

India

6 months ago

C) Computationally intensive, huh? I guess one-hot encoding can be a bit heavy for the training set, so it's better to leave it for individual problems.

upvoted 0 times

...

Jettie

6 months ago

E) Ah, I see. Some machine learning algorithms may not play well with one-hot encoding, so it's a good idea to avoid it in the feature repository.

upvoted 0 times

Stephanie

5 months ago

C) One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

upvoted 0 times

...

Maurine

5 months ago

E) That's a good point. It's important to consider how the target variable values can impact the effectiveness of one-hot encoding.

upvoted 0 times

...

Gaston

5 months ago

B) One-hot encoding is dependent on the target variable's values which differ for each application.

upvoted 0 times

...

Glenna

7 months ago

B) Hmm, that makes sense. The target variable can vary across different applications, so one-hot encoding shouldn't be done at the feature repository level.

upvoted 0 times

Chaya

5 months ago

E) One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

upvoted 0 times

...

Mitsue

5 months ago

C) One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

upvoted 0 times

...