This notebook illustrates finding feature importance in the Iris dataset. It is a version of the Scikit-Learn example Feature importances with forests of trees
The main point it shows is the convenience of using pandas
structures throughout the code.
First we load the dataset into a pandas.DataFrame
.
In [1]:
import pandas as pd
import numpy as np
from sklearn import datasets
import seaborn as sns
sns.set_style('whitegrid')
from ibex.sklearn import ensemble as pd_ensemble
%pylab inline
In [2]:
iris = datasets.load_iris()
features = iris['feature_names']
iris = pd.DataFrame(
np.c_[iris['data'], iris['target']],
columns=features+['class'])
iris.head()
Out[2]:
Now that all the data is in a DataFrame
, we can use the feature_importances_
attribute of a gradient boosting classifier. Note that in Ibex, this is a pandas.Series
.
In [3]:
pd_ensemble.GradientBoostingClassifier().fit(iris[features], iris['class']).feature_importances_
Out[3]:
Since the result is a Series
, we can use its plot
method directly, and it will handle all labels for us.
In [4]:
importances = pd_ensemble.GradientBoostingClassifier().fit(iris[features], iris['class']).feature_importances_
(importances / importances.max()).plot(kind='barh', color='0.75');
xlabel('feature importance');
ylabel('features');
figtext(
0,
-0.1,
'Relative feature importances for the Iris dataset, using gradient boosting classification');