This tutorial will show you how to analyze predictions of an XGBoost classifier (regression for XGBoost and most scikit-learn tree ensembles are also supported by eli5). We will use Titanic dataset, which is small and has not too many features, but is still interesting enough.
We are using XGBoost 0.81 and data downloaded from https://www.kaggle.com/c/titanic/data (it is also bundled in the eli5 repo: https://github.com/TeamHG-Memex/eli5/blob/master/notebooks/titanic-train.csv).
Let's start by loading the data:
In [1]:
import csv
import numpy as np
with open('titanic-train.csv', 'rt') as f:
data = list(csv.DictReader(f))
data[:1]
Out[1]:
Variable descriptions:
Next, shuffle data and separate features from what we are trying to predict: survival.
In [2]:
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
_all_xs = [{k: v for k, v in row.items() if k != 'Survived'} for row in data]
_all_ys = np.array([int(row['Survived']) for row in data])
all_xs, all_ys = shuffle(_all_xs, _all_ys, random_state=0)
train_xs, valid_xs, train_ys, valid_ys = train_test_split(
all_xs, all_ys, test_size=0.25, random_state=0)
print('{} items total, {:.1%} true'.format(len(all_xs), np.mean(all_ys)))
We do just minimal preprocessing: convert obviously contiuous Age and Fare variables to floats, and SibSp, Parch to integers. Missing Age values are removed.
In [3]:
for x in all_xs:
if x['Age']:
x['Age'] = float(x['Age'])
else:
x.pop('Age')
x['Fare'] = float(x['Fare'])
x['SibSp'] = int(x['SibSp'])
x['Parch'] = int(x['Parch'])
Let's first build a very simple classifier with xbgoost.XGBClassifier and sklearn.feature_extraction.DictVectorizer, and check its accuracy with 10-fold cross-validation:
In [4]:
from xgboost import XGBClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
clf = XGBClassifier()
vec = DictVectorizer()
pipeline = make_pipeline(vec, clf)
def evaluate(_clf):
scores = cross_val_score(_clf, all_xs, all_ys, scoring='accuracy', cv=10)
print('Accuracy: {:.3f} ± {:.3f}'.format(np.mean(scores), 2 * np.std(scores)))
_clf.fit(train_xs, train_ys) # so that parts of the original pipeline are fitted
evaluate(pipeline)
There is one tricky bit about the code above: one may be templed to just pass dense=True
to DictVectorizer
: after all, in this case the matrixes are small. But this is not a great solution, because we will loose the ability to distinguish features that are missing and features that have zero value.
In order to calculate a prediction, XGBoost sums predictions of all its trees.
The number of trees is controlled by n_estimators
argument and is 100 by default.
Each tree is not a great predictor on it's own, but by summing across all trees,
XGBoost is able to provide a robust estimate in many cases. Here is one of the trees:
In [5]:
booster = clf.get_booster()
original_feature_names = booster.feature_names
booster.feature_names = vec.get_feature_names()
print(booster.get_dump()[0])
# recover original feature names
booster.feature_names = original_feature_names
We see that this tree checks Sex, Age, Pclass, Fare and SibSp features. leaf
gives the decision of a single tree, and they are summed over all trees in the ensemble.
Let's check feature importances with eli5.show_weights
:
In [6]:
from eli5 import show_weights
show_weights(clf, vec=vec)
Out[6]:
There are several different ways to calculate feature importances. By default, "gain" is used, that is the average gain of the feature when it is used in trees. Other types are "weight" - the number of times a feature is used to split the data, and "cover" - the average coverage of the feature. You can pass it with importance_type
argument.
Now we know that two most important features are Sex=female and Pclass=3, but we still don't know how XGBoost decides what prediction to make based on their values.
To get a better idea of how our classifier works, let's examine individual predictions with eli5.show_prediction
:
In [7]:
from eli5 import show_prediction
show_prediction(clf, valid_xs[1], vec=vec, show_feature_values=True)
Out[7]:
Weight means how much each feature contributed to the final prediction across all trees. The idea for weight calculation is described in http://blog.datadive.net/interpreting-random-forests/; eli5 provides an independent implementation of this algorithm for XGBoost and most scikit-learn tree ensembles.
Here we see that classifier thinks it's good to be a female, but bad to travel third class.
Some features have "Missing" as value
(we are passing show_feature_values=True
to view the values):
that means that the feature was missing,
so in this case it's good to not have embarked in Southampton. This is where our decision to go with sparse matrices comes handy - we still see that Parch is zero, not missing.
It's possible to show only features that are present using feature_filter
argument: it's
a function that accepts feature name and value, and returns True value for features that should be shown:
In [8]:
no_missing = lambda feature_name, feature_value: not np.isnan(feature_value)
show_prediction(clf, valid_xs[1], vec=vec, show_feature_values=True, feature_filter=no_missing)
Out[8]:
Right now we treat Name field as categorical, like other text features. But in this dataset each name is unique, so XGBoost does not use this feature at all, because it's such a poor discriminator: it's absent from the weights table in section 3.
But Name still might contain some useful information. We don't want to guess how to best pre-process it and what features to extract, so let's use the most general character ngram vectorizer:
In [9]:
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
vec2 = FeatureUnion([
('Name', CountVectorizer(
analyzer='char_wb',
ngram_range=(3, 4),
preprocessor=lambda x: x['Name'],
max_features=100,
)),
('All', DictVectorizer()),
])
clf2 = XGBClassifier()
pipeline2 = make_pipeline(vec2, clf2)
evaluate(pipeline2)
In this case the pipeline is more complex, we slightly improved our result, but the improvement is not significant. Let's look at feature importances:
In [10]:
show_weights(clf2, vec=vec2)
Out[10]:
We see that now there is a lot of features that come from the Name field (in fact, a classifier based on Name alone gives about 0.79 accuracy). Name features listed in this way are not very informative, they make more sense when we check out predictions. We hide missing features here because there is a lot of missing features in text, but they are not very interesting:
In [11]:
from IPython.display import display
for idx in [4, 5, 7, 37, 81]:
display(show_prediction(clf2, valid_xs[idx], vec=vec2,
show_feature_values=True, feature_filter=no_missing))
Text features from the Name field are highlighted directly in text, and the sum of weights is shown in the weights table as "Name: Highlighted in text (sum)".
Looks like name classifier tried to infer both gender and status from the title: "Mr." is bad because women are saved first, and it's better to be "Mrs." (married) than "Miss.". Also name classifier is trying to pick some parts of names and surnames, especially endings, perhaps as a proxy for social status. It's especially bad to be "Mary" if you are from the third class.