New Year Sale 2026! Hurry Up, Grab the Special Discount - Save 25% - Ends In 00:00:00 Coupon code: SAVE25
Welcome to Pass4Success

- Free Preparation Discussions

Databricks Certified Professional Data Scientist Exam - Topic 5 Question 22 Discussion

Actual exam question for Databricks's Databricks Certified Professional Data Scientist exam
Question #: 22
Topic #: 5
[All Databricks Certified Professional Data Scientist Questions]

You are working in a classification model for a book, written by HadoopExam Learning Resources and decided to use building a text classification model

for determining whether this book is for Hadoop or Cloud computing. You have to select the proper features (feature selection) hence, to cut down on the size of the feature space, you will use the mutual information of each word with the label of hadoop or cloud to select the 1000 best features to use as input to a Naive Bayes model. When you compare the performance of a model built with the 250 best features to a model built with the 1000 best features, you notice that the model with only 250 features performs slightly better on our test data.

What would help you choose better features for your model?

Show Suggested Answer Hide Answer
Suggested Answer: A

Contribute your Thoughts:

0/2000 characters
Lucia
4 months ago
Less redundancy in features sounds smart!
upvoted 0 times
...
Skye
4 months ago
I disagree, reducing training data doesn't seem right.
upvoted 0 times
...
Ollie
4 months ago
Surprised that 250 features outperformed 1000!
upvoted 0 times
...
Art
4 months ago
I think including word frequency could help too.
upvoted 0 times
...
Maryann
5 months ago
Mutual information is key for feature selection!
upvoted 0 times
...
Ahmed
5 months ago
I recall that sometimes fewer features can lead to better results, like in our last exercise. Maybe focusing on the top features is key, but I’m not entirely sure.
upvoted 0 times
...
Kerry
5 months ago
I’m a bit confused about whether including word frequency would actually help. It seems like it could add noise instead of clarity.
upvoted 0 times
...
Malcolm
5 months ago
I think we practiced a question where reducing redundancy among features improved model performance, so maybe option A could be the right choice.
upvoted 0 times
...
Myra
5 months ago
I remember studying that mutual information helps identify relevant features, but I'm not sure how to balance it with redundancy.
upvoted 0 times
...
Paulene
5 months ago
Ugh, NetworkPolicies are not my strong suit. I'll need to review the documentation and examples carefully to make sure I implement this correctly.
upvoted 0 times
...
Iraida
5 months ago
Hmm, I'm not entirely sure about this one. The options seem to cover different aspects of Service Desk operations, but I'm not confident which one specifically relates to Quality Assurance. I'll have to think this through carefully.
upvoted 0 times
...
Ines
5 months ago
This question seems straightforward, but I want to make sure I understand the key differences between microcomputer-prepared and manually prepared data files. I'll need to carefully consider the potential disadvantages of each.
upvoted 0 times
...
Pok
5 months ago
Ah, I see what's going on here. The security zones need to match the interfaces on the target devices, or else the device will need to restart to apply the new configuration. I'm pretty confident option A is the correct answer.
upvoted 0 times
...

Save Cancel