Title: Feature Selection Using Random Forest
Slug: feature_selection_using_random_forest
Summary: Feature Selection Using Random Forest with scikit-learn.
Date: 2016-12-01 12:00
Category: Machine Learning
Tags: Feature Selection
Authors: Chris Albon
Often in data science we have hundreds or even millions of features and we want a way to create a model that only includes the most important features. This has three benefits. First, we make our model more simple to interpret. Second, we can reduce the variance of the model, and therefore overfitting. Finally, we can reduce the computational cost (and time) of training a model. The process of identifying only the most relevant features is called "feature selection."
Random Forests are often used for feature selection in a data science workflow. The reason is because the tree-based strategies used by random forests naturally ranks by how well they improve the purity of the node. This mean decrease in impurity over all trees (called gini impurity). Nodes with the greatest decrease in impurity happen at the start of the trees, while notes with the least decrease in impurity occur at the end of trees. Thus, by pruning trees below a particular node, we can create a subset of the most important features.
In this tutorial we will:
Note: There are other definitions of importance, however in this tutorial we limit our discussion to gini importance.
In [1]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
The dataset used in this tutorial is the famous iris dataset. The Iris target data contains 50 samples from three species of Iris, y
and four feature variables, X
.
In [2]:
# Load the iris dataset
iris = datasets.load_iris()
# Create a list of feature names
feat_labels = ['Sepal Length','Sepal Width','Petal Length','Petal Width']
# Create X from the features
X = iris.data
# Create y from output
y = iris.target
In [3]:
# View the features
X[0:5]
Out[3]:
In [4]:
# View the target data
y
Out[4]:
In [5]:
# Split the data into 40% test and 60% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
In [6]:
# Create a random forest classifier
clf = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
# Train the classifier
clf.fit(X_train, y_train)
# Print the name and gini importance of each feature
for feature in zip(feat_labels, clf.feature_importances_):
print(feature)
The scores above are the importance scores for each variable. There are two things to note. First, all the importance scores add up to 100%. Second, Petal Length
and Petal Width
are far more important than the other two features. Combined, Petal Length
and Petal Width
have an importance of ~0.86! Clearly these are the most importance features.
In [7]:
# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.15)
# Train the selector
sfm.fit(X_train, y_train)
Out[7]:
In [8]:
# Print the names of the most important features
for feature_list_index in sfm.get_support(indices=True):
print(feat_labels[feature_list_index])
In [9]:
# Transform the data to create a new dataset containing only the most important features
# Note: We have to apply the transform to both the training X and test X data.
X_important_train = sfm.transform(X_train)
X_important_test = sfm.transform(X_test)
In [10]:
# Create a new random forest classifier for the most important features
clf_important = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
# Train the new classifier on the new dataset containing the most important features
clf_important.fit(X_important_train, y_train)
Out[10]:
In [11]:
# Apply The Full Featured Classifier To The Test Data
y_pred = clf.predict(X_test)
# View The Accuracy Of Our Full Feature (4 Features) Model
accuracy_score(y_test, y_pred)
Out[11]:
In [12]:
# Apply The Full Featured Classifier To The Test Data
y_important_pred = clf_important.predict(X_important_test)
# View The Accuracy Of Our Limited Feature (2 Features) Model
accuracy_score(y_test, y_important_pred)
Out[12]:
As can be seen by the accuracy scores, our original model which contained all four features is 93.3% accurate while the our 'limited' model which contained only two features is 88.3% accurate. Thus, for a small cost in accuracy we halved the number of features in the model.