Title: Select Important Features In Random Forest
Slug: select_important_features_in_random_forest
Summary: How to select important features in random forest in scikit-learn.
Date: 2017-09-21 12:00
Category: Machine Learning
Tags: Trees And Forests
Authors: Chris Albon


In [1]:
# Load libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.feature_selection import SelectFromModel

Load Iris Flower Data

In [2]:
# Load data
iris = datasets.load_iris()
X = iris.data
y = iris.target

Create Random Forest Classifier

In [3]:
# Create random forest classifier
clf = RandomForestClassifier(random_state=0, n_jobs=-1)

Select Features With Importance Greater Than Threshold

The higher the number, the more important the feature (all importance scores sum to one). By plotting these values we can add interpretability to our random forest models.

In [4]:
# Create object that selects features with importance greater than or equal to a threshold
selector = SelectFromModel(clf, threshold=0.3)

# Feature new feature matrix using selector
X_important = selector.fit_transform(X, y)

View Selected Important Features

In [7]:
# View first five observations of the features

array([[ 1.4,  0.2],
       [ 1.4,  0.2],
       [ 1.3,  0.2],
       [ 1.5,  0.2],
       [ 1.4,  0.2]])

Train Model With Selected Important Features

In [6]:
# Train random forest using most important featres
model = clf.fit(X_important, y)