New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Machine Learning Associate Exam - Topic 2 Question 36 Discussion

Actual exam question for Databricks's Databricks Machine Learning Associate exam
Question #: 36
Topic #: 2
[All Databricks Machine Learning Associate Questions]

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrame train_df to train the model.

The Spark DataFrame train_df has the following schema:

The machine learning engineer shares the following code block:

Which of the following changes does the machine learning engineer need to make to complete the task?

Show Suggested Answer Hide Answer
Suggested Answer: B

In Spark ML, the linear regression model expects the feature column to be a vector type. However, if the features column in the DataFrame train_df is not already in this format (such as being a column of type UDT or a non-vectorized type), the engineer needs to convert it to a vector column using a transformer like VectorAssembler. This is a critical step in preparing the data for modeling as Spark ML models require input features to be combined into a single vector column.

Reference

Spark MLlib documentation for LinearRegression: https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression


Contribute your Thoughts:

0/2000 characters
Iola
9 hours ago
A Pipeline would definitely help streamline the process!
upvoted 0 times
...
Margurite
6 days ago
Wait, do they really need to split the features column? Seems unnecessary.
upvoted 0 times
...
Cherry
11 days ago
Totally agree, option B is the way to go!
upvoted 0 times
...
Pearlene
16 days ago
Option C seems too easy. There's gotta be a catch somewhere. I'd double-check that code if I were the engineer.
upvoted 0 times
...
Sharmaine
21 days ago
Haha, I bet the engineer is wishing they had a crystal ball to see the right answer. Good thing we have these practice exams to prepare!
upvoted 0 times
...
Jose
26 days ago
I think Option D is the way to go. Using a Pipeline makes the whole process a lot cleaner and more organized.
upvoted 0 times
...
Elenore
1 month ago
I think splitting the features into separate columns could be an option, but it seems like overkill for a linear regression model.
upvoted 0 times
...
Lenna
1 month ago
I feel like we had a practice question about transforming DataFrames, but I can't recall if it was specifically about calling the transform method.
upvoted 0 times
...
Lizette
1 month ago
I'm not entirely sure, but I think using a Pipeline might be necessary for organizing the steps in the model training process.
upvoted 0 times
...
Whitney
2 months ago
I remember we discussed the importance of converting the features column into a vector format for Spark ML. That seems crucial here.
upvoted 0 times
...
Brent
2 months ago
Splitting the features column into individual columns seems like overkill for a linear regression model. I think I'll stick with option B and make sure the features column is a vector. That should do the trick.
upvoted 0 times
...
Ryan
2 months ago
I'm a little unsure about the Pipeline part. Should I be using a Pipeline to fit the model? Option D is making me second-guess myself. I'll need to research that a bit more.
upvoted 0 times
...
Magda
2 months ago
Option B looks good to me. Gotta get those features into a vector format for the model to work properly.
upvoted 0 times
...
Gertude
2 months ago
Okay, let me think this through. The code block looks good, so I don't think I need to make any changes. I'll go with option C and see if that works.
upvoted 0 times
...
Lemuel
2 months ago
They need to convert the features column to be a vector.
upvoted 0 times
...
Lauran
3 months ago
I think they need to convert the features column to be a vector.
upvoted 0 times
...
Royal
3 months ago
I'm a bit confused here. The code block shows that the features column is already a vector, so I'm not sure if I need to do anything else. Maybe I should double-check the documentation to be sure.
upvoted 0 times
...
Kimbery
3 months ago
Hmm, this looks like a pretty straightforward linear regression problem. I'd start by making sure the features column is a vector, so I'd go with option B.
upvoted 0 times
Cristal
3 months ago
But what about using a Pipeline? Option D could be useful too.
upvoted 0 times
...
...

Save Cancel