Mutual Information — ML Feature Selection
What is mutual information?
Information gain calculates the predictive power of a feature on an outcome(target).
https://en.wikipedia.org/wiki/Mutual_information
Mutual Information of X, Y is the sum of probability of X and probability of Y
Used:
It is commonly used in the construction of decision trees from a training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification.
Information gain can also be used for feature selection, by evaluating the gain of each variable in the context of the target variable. In this slightly different usage, the calculation is referred to as mutual information between the two random variables.
Information gain is the reduction in entropy or surprise by transforming a dataset and is often used in training decision trees.
Information gain is calculated by comparing the entropy of the dataset before and after a transformation.
Mutual information calculates the statistical dependence between two variables and is the name given to information gain when applied to variable selection
Difference between Correlation and Mutual Information
Correlation analysis provides a quantitative means of measuring the strength of a linear relationship between two vectors of data.
Mutual information is essentially the measure of how much “knowledge” one can gain of a certain variable by knowing the value of another variable
Python Implementation:
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.feature_selection import SelectKBest, SelectPercentile
Step1 — Get the Ranking of the Features
1- For classification implement the mutual_info_classif
mi is the array returned with mutual information
2. convert the array to a DataFrame
3.You can plot the highest Mutual information features… the higher the value the more better/ predictive the feature
Top 10 MI are
Step 2 — Get the Top Selected Features
Top 10 by SelectKBest are very similar
Transform the data back with SelectKBest and remove the unselected features
Regression Data
Only apply on numerical features , create a dataset with only numeric values
Regression Implementation is exactly the same we just use:
Step 1 — Get Highest Ranking Features
mutual_info_regression
You can plot the highest information features… the higher the value the more better/ predictive the feature
Step 2 — Select between the highest Ranking Features , for regression we use SelectPercentile
sel_=SelectPercentile(mutual_info_regression, percentile=10).fit(X_train,y_train)
Transform (not fit) back the model to the X_train and X_test to remove all the unnecessary features
Other Recommendations
F-test (chi2 and anova) is a recommended way over mutual Information