New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Machine Learning Associate Exam - Topic 1 Question 8 Discussion

Actual exam question for Databricks's Databricks Machine Learning Associate exam
Question #: 8
Topic #: 1
[All Databricks Machine Learning Associate Questions]

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column discount is less than or equal 0.

Which of the following code blocks will accomplish this task?

Show Suggested Answer Hide Answer
Suggested Answer: C

To filter rows in a Spark DataFrame based on a condition, the filter method is used. In this case, the condition is that the value in the 'discount' column should be less than or equal to 0. The correct syntax uses the filter method along with the col function from pyspark.sql.functions.

Correct code:

from pyspark.sql.functions import col filtered_df = spark_df.filter(col('discount') <= 0)

Option A and D use Pandas syntax, which is not applicable in PySpark. Option B is closer but misses the use of the col function.


PySpark SQL Documentation

Contribute your Thoughts:

0/2000 characters
In
3 months ago
D has a syntax error, not sure how that got in there.
upvoted 0 times
...
Christoper
3 months ago
Definitely going with C, it’s the cleanest!
upvoted 0 times
...
Yaeko
3 months ago
Wait, is A even valid for Spark? Seems off.
upvoted 0 times
...
Tamar
4 months ago
I think B looks good too, but it's not Spark syntax.
upvoted 0 times
...
Garry
4 months ago
Option C is the correct way to filter in Spark.
upvoted 0 times
...
Dottie
4 months ago
I think option A is definitely wrong because loc is not used in Spark, but I’m not completely confident about the others.
upvoted 0 times
...
Carmen
4 months ago
I feel like option B might be a common way to filter data, but I can't recall if it works with Spark DataFrames.
upvoted 0 times
...
Aliza
4 months ago
I'm not entirely sure, but I remember something about using loc in Pandas. Is that applicable here, or is it just for Pandas DataFrames?
upvoted 0 times
...
Reena
5 months ago
I think option C looks familiar because we practiced using the filter method in Spark. It seems like the right approach for this task.
upvoted 0 times
...
Dick
5 months ago
This is a great question to test our understanding of Spark DataFrame operations. I think option C is the best choice here, as it directly uses the Spark-specific `filter()` method to apply the condition on the 'discount' column.
upvoted 0 times
...
Ming
5 months ago
I'm a little confused by the different options presented here. They all look like they could potentially work, but I'm not sure which one is the most efficient or Spark-idiomatic approach. I'll have to think this through carefully.
upvoted 0 times
...
Erick
5 months ago
Okay, I've got this! Option B is the way to go - it's the classic way to filter a DataFrame in Pandas, and Spark has a similar syntax, so I'm confident that will work here.
upvoted 0 times
...
Rashida
5 months ago
Hmm, I'm a bit unsure about this one. I know we need to filter the DataFrame, but I'm not totally familiar with the Spark syntax. I might have to double-check the documentation to make sure I'm using the right method.
upvoted 0 times
...
Thad
5 months ago
This looks like a straightforward Spark DataFrame filtering task. I think I'll go with option C - it seems the most direct way to filter the DataFrame based on the condition in the question.
upvoted 0 times
...
Tijuana
5 months ago
This seems like a straightforward memory leak issue. I'd start by looking for OutOfMemoryError messages in the logs, and then create a heap dump to analyze.
upvoted 0 times
...
Son
5 months ago
I'm pretty confident the answer is option D. The parameter to enable YARN log aggregation is "yarn.log-aggregation-enable".
upvoted 0 times
...
Darci
5 months ago
Okay, I've got this. Running services as unprivileged accounts and using the principle of least privilege are definitely the way to go. That should help prevent unauthorized access.
upvoted 0 times
...
Louvenia
5 months ago
I feel like both Archive and Infrequent Access match what I've studied, but I might need to double-check my notes.
upvoted 0 times
...
Theron
5 months ago
Yeah, I'm leaning towards A, B, and E as being suitable. I just hope I don't mix them up in the exam!
upvoted 0 times
...
Chantell
5 months ago
Hmm, I'm not sure about this one. I'll have to think it through carefully, the options seem pretty similar.
upvoted 0 times
...
Miesha
2 years ago
I agree with Sharen, option B seems like the most straightforward solution.
upvoted 0 times
...
Annalee
2 years ago
Hmm, I'm torn between B and C. Maybe I'll flip a coin to decide. Or, you know, use a random number generator. Data scientists love those, right?
upvoted 0 times
Mitsue
2 years ago
Tegan: Good idea. Let's test them out and compare the results.
upvoted 0 times
...
Tegan
2 years ago
Alisha: Maybe we should both try out our options and see which one works better.
upvoted 0 times
...
Alisha
2 years ago
I'm leaning towards C) spark_df.filter (col(\discount\) <= 0) actually.
upvoted 0 times
...
Theodora
2 years ago
I think B) spark_df[spark_df[\discount\] <= 0] is the correct option.
upvoted 0 times
...
...
Sharen
2 years ago
But with option B, we can directly filter the DataFrame based on the condition.
upvoted 0 times
...
Lashon
2 years ago
I disagree, I believe the correct answer is C.
upvoted 0 times
...
Karl
2 years ago
As a data scientist, I'd choose C. It's more readable and maintainable than the other options.
upvoted 0 times
Mayra
2 years ago
I agree with User1, B looks like the right choice.
upvoted 0 times
...
Lakeesha
2 years ago
I think B is the correct option.
upvoted 0 times
...
...
Whitley
2 years ago
I think B is the way to go. It's a simple and straightforward indexing operation on the DataFrame.
upvoted 0 times
Eleonora
2 years ago
Agreed, it's a simple and straightforward indexing operation on the DataFrame.
upvoted 0 times
...
Alida
2 years ago
I think B is the way to go.
upvoted 0 times
...
...
Dalene
2 years ago
Option C looks good to me. It's a direct Spark DataFrame filter operation on the 'discount' column.
upvoted 0 times
Cherry
2 years ago
I would go with Option B as well. It seems like a straightforward way to filter the DataFrame.
upvoted 0 times
...
Lorrie
2 years ago
I think Option B would work too. It filters the DataFrame based on the condition in the 'discount' column.
upvoted 0 times
...
Ammie
2 years ago
Option C looks good to me. It's a direct Spark DataFrame filter operation on the 'discount' column.
upvoted 0 times
...
...
Sharen
2 years ago
I think the correct answer is B.
upvoted 0 times
...

Save Cancel