Databricks Certified Professional Data Scientist Exam - Topic 5 Question 22 Discussion

Actual exam question for Databricks's Databricks Certified Professional Data Scientist exam

Question #: 22
Topic #: 5

[All Databricks Certified Professional Data Scientist Questions]

You are working in a classification model for a book, written by HadoopExam Learning Resources and decided to use building a text classification model

for determining whether this book is for Hadoop or Cloud computing. You have to select the proper features (feature selection) hence, to cut down on the size of the feature space, you will use the mutual information of each word with the label of hadoop or cloud to select the 1000 best features to use as input to a Naive Bayes model. When you compare the performance of a model built with the 250 best features to a model built with the 1000 best features, you notice that the model with only 250 features performs slightly better on our test data.

What would help you choose better features for your model?

AInclude least mutual information with other selected features as a feature selection criterion

BInclude the number of times each of the words appears in the book in your model

CDecrease the size of our training data

DEvaluate a model that only includes the top 100 words
Correlation measures the linear relationship (Pearson's correlation) or monotonic relationship (Spearman's correlation) between two variables, X and Y. Mutual information is more general and measures the reduction of uncertainty in Y after observing X. It is the KL distance between the joint density and the product of the individual densities. So Ml can measure non-monotonic relationships and other more complicated relationships
Mutual information is a quantification of the dependency between random variables. It is sometimes contrasted with linear correlation since mutual information captures nonlinear dependence.
Features with high mutual information with the predicted value are good. However a feature may have high mutual information because it is highly correlated with another feature that has already been selected. Choosing another feature with somewhat less mutual information with the predicted value, but low mutual information with other selected features, may be more beneficial. Hence it may help to also prefer features that are less redundant with other selected features.

Show Suggested Answer

Suggested Answer: A

by Sharen at May 04, 2022, 12:39 PM

Limited Time Offer

25%

6 months ago

Ah, I see what's going on here. The security zones need to match the interfaces on the target devices, or else the device will need to restart to apply the new configuration. I'm pretty confident option A is the correct answer.

upvoted 0 times

...