Explaining XGBoost predictions on the Titanic dataset

This tutorial will show you how to analyze predictions of an XGBoost classifier (regression for XGBoost and most scikit-learn tree ensembles are also supported by eli5). We will use Titanic dataset, which is small and has not too many features, but is still interesting enough.

We are using XGBoost 0.81 and data downloaded from https://www.kaggle.com/c/titanic/data (it is also bundled in the eli5 repo: https://github.com/TeamHG-Memex/eli5/blob/master/notebooks/titanic-train.csv).

1. Training data

Let's start by loading the data:



In [1]:

    
import csv
import numpy as np

with open('titanic-train.csv', 'rt') as f:
    data = list(csv.DictReader(f))
data[:1]









    Out[1]:





[OrderedDict([('PassengerId', '1'),
              ('Survived', '0'),
              ('Pclass', '3'),
              ('Name', 'Braund, Mr. Owen Harris'),
              ('Sex', 'male'),
              ('Age', '22'),
              ('SibSp', '1'),
              ('Parch', '0'),
              ('Ticket', 'A/5 21171'),
              ('Fare', '7.25'),
              ('Cabin', ''),
              ('Embarked', 'S')])]

Variable descriptions:

Age: Age
Cabin: Cabin
Embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
Fare: Passenger Fare
Name: Name
Parch: Number of Parents/Children Aboard
Pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Sex: Sex
Sibsp: Number of Siblings/Spouses Aboard
Survived: Survival (0 = No; 1 = Yes)
Ticket: Ticket Number

Next, shuffle data and separate features from what we are trying to predict: survival.



In [2]:

    
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

_all_xs = [{k: v for k, v in row.items() if k != 'Survived'} for row in data]
_all_ys = np.array([int(row['Survived']) for row in data])

all_xs, all_ys = shuffle(_all_xs, _all_ys, random_state=0)
train_xs, valid_xs, train_ys, valid_ys = train_test_split(
    all_xs, all_ys, test_size=0.25, random_state=0)
print('{} items total, {:.1%} true'.format(len(all_xs), np.mean(all_ys)))









    



891 items total, 38.4% true

We do just minimal preprocessing: convert obviously contiuous Age and Fare variables to floats, and SibSp, Parch to integers. Missing Age values are removed.



In [3]:

    
for x in all_xs:
    if x['Age']:
        x['Age'] = float(x['Age'])
    else:
        x.pop('Age')
    x['Fare'] = float(x['Fare'])
    x['SibSp'] = int(x['SibSp'])
    x['Parch'] = int(x['Parch'])

2. Simple XGBoost classifier

Let's first build a very simple classifier with xbgoost.XGBClassifier and sklearn.feature_extraction.DictVectorizer, and check its accuracy with 10-fold cross-validation:



In [4]:

    
from xgboost import XGBClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

clf = XGBClassifier()
vec = DictVectorizer()
pipeline = make_pipeline(vec, clf)

def evaluate(_clf):
    scores = cross_val_score(_clf, all_xs, all_ys, scoring='accuracy', cv=10)
    print('Accuracy: {:.3f} ± {:.3f}'.format(np.mean(scores), 2 * np.std(scores)))
    _clf.fit(train_xs, train_ys)  # so that parts of the original pipeline are fitted
     
evaluate(pipeline)









    



Accuracy: 0.823 ± 0.071

There is one tricky bit about the code above: one may be templed to just pass dense=True to DictVectorizer: after all, in this case the matrixes are small. But this is not a great solution, because we will loose the ability to distinguish features that are missing and features that have zero value.

3. Explaining weights

In order to calculate a prediction, XGBoost sums predictions of all its trees. The number of trees is controlled by n_estimators argument and is 100 by default. Each tree is not a great predictor on it's own, but by summing across all trees, XGBoost is able to provide a robust estimate in many cases. Here is one of the trees:



In [5]:

    
booster = clf.get_booster()
original_feature_names = booster.feature_names
booster.feature_names = vec.get_feature_names()
print(booster.get_dump()[0])
# recover original feature names
booster.feature_names = original_feature_names









    



0:[Sex=female<-9.53674316e-07] yes=1,no=2,missing=1
	1:[Age<13] yes=3,no=4,missing=4
		3:[SibSp<2] yes=7,no=8,missing=7
			7:leaf=0.145454556
			8:leaf=-0.125
		4:[Fare<26.2687492] yes=9,no=10,missing=9
			9:leaf=-0.151515156
			10:leaf=-0.0727272779
	2:[Pclass=3<-9.53674316e-07] yes=5,no=6,missing=5
		5:[Fare<12.1750002] yes=11,no=12,missing=12
			11:leaf=0.0500000007
			12:leaf=0.175193802
		6:[Fare<24.8083496] yes=13,no=14,missing=14
			13:leaf=0.0365591422
			14:leaf=-0.151999995

We see that this tree checks Sex, Age, Pclass, Fare and SibSp features. leaf gives the decision of a single tree, and they are summed over all trees in the ensemble.

Let's check feature importances with eli5.show_weights:



In [6]:

    
from eli5 import show_weights
show_weights(clf, vec=vec)









    Out[6]:





    



    

    

    

    

    

    


    

    

    

    

    

    


    

    

    

    

    
        
    
    
        Weight
        Feature
    
    
    
    
        
            
                0.4278
                
            
            
                Sex=female
            
        
    
        
            
                0.1949
                
            
            
                Pclass=3
            
        
    
        
            
                0.0665
                
            
            
                Embarked=S
            
        
    
        
            
                0.0510
                
            
            
                Pclass=2
            
        
    
        
            
                0.0420
                
            
            
                SibSp
            
        
    
        
            
                0.0417
                
            
            
                Cabin=
            
        
    
        
            
                0.0385
                
            
            
                Embarked=C
            
        
    
        
            
                0.0358
                
            
            
                Ticket=1601
            
        
    
        
            
                0.0331
                
            
            
                Age
            
        
    
        
            
                0.0323
                
            
            
                Fare
            
        
    
        
            
                0.0220
                
            
            
                Pclass=1
            
        
    
        
            
                0.0143
                
            
            
                Parch
            
        
    
        
            
                0
                
            
            
                Name=Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)
            
        
    
        
            
                0
                
            
            
                Name=Roebling, Mr. Washington Augustus II
            
        
    
        
            
                0
                
            
            
                Name=Rosblom, Mr. Viktor Richard
            
        
    
        
            
                0
                
            
            
                Name=Ross, Mr. John Hugo
            
        
    
        
            
                0
                
            
            
                Name=Rush, Mr. Alfred George John
            
        
    
        
            
                0
                
            
            
                Name=Rouse, Mr. Richard Henry
            
        
    
        
            
                0
                
            
            
                Name=Ryerson, Miss. Emily Borie
            
        
    
        
            
                0
                
            
            
                Name=Ryerson, Miss. Susan Parker "Suzette"
            
        
    
    
        
            
                
                    … 1972 more …

There are several different ways to calculate feature importances. By default, "gain" is used, that is the average gain of the feature when it is used in trees. Other types are "weight" - the number of times a feature is used to split the data, and "cover" - the average coverage of the feature. You can pass it with importance_type argument.

Now we know that two most important features are Sex=female and Pclass=3, but we still don't know how XGBoost decides what prediction to make based on their values.

4. Explaining predictions

To get a better idea of how our classifier works, let's examine individual predictions with eli5.show_prediction:



In [7]:

    
from eli5 import show_prediction
show_prediction(clf, valid_xs[1], vec=vec, show_feature_values=True)









    Out[7]:





    



    

    

    

    

    

    


    

    

    

    
        

    

        
            
                
                
    
        
            
    
        y=1
    


    
    (probability 0.566, score 0.264)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
                Value
            
        
        
        
        
            
    
        +1.673
    
    
        Sex=female
    
    
        
            1.000
        
    

        
            
    
        +0.479
    
    
        Embarked=S
    
    
        
            Missing
        
    

        
            
    
        +0.070
    
    
        Fare
    
    
        
            7.879
        
    

        
        

        
        
            
    
        -0.004
    
    
        Cabin=
    
    
        
            1.000
        
    

        
            
    
        -0.006
    
    
        Parch
    
    
        
            0.000
        
    

        
            
    
        -0.009
    
    
        Pclass=2
    
    
        
            Missing
        
    

        
            
    
        -0.009
    
    
        Ticket=1601
    
    
        
            Missing
        
    

        
            
    
        -0.012
    
    
        Embarked=C
    
    
        
            Missing
        
    

        
            
    
        -0.071
    
    
        SibSp
    
    
        
            0.000
        
    

        
            
    
        -0.073
    
    
        Pclass=1
    
    
        
            Missing
        
    

        
            
    
        -0.147
    
    
        Age
    
    
        
            19.000
        
    

        
            
    
        -0.528
    
    
        <BIAS>
    
    
        
            1.000
        
    

        
            
    
        -1.100
    
    
        Pclass=3
    
    
        
            1.000

Weight means how much each feature contributed to the final prediction across all trees. The idea for weight calculation is described in http://blog.datadive.net/interpreting-random-forests/; eli5 provides an independent implementation of this algorithm for XGBoost and most scikit-learn tree ensembles.

Here we see that classifier thinks it's good to be a female, but bad to travel third class. Some features have "Missing" as value (we are passing show_feature_values=True to view the values): that means that the feature was missing, so in this case it's good to not have embarked in Southampton. This is where our decision to go with sparse matrices comes handy - we still see that Parch is zero, not missing.

It's possible to show only features that are present using feature_filter argument: it's a function that accepts feature name and value, and returns True value for features that should be shown:



In [8]:

    
no_missing = lambda feature_name, feature_value: not np.isnan(feature_value)
show_prediction(clf, valid_xs[1], vec=vec, show_feature_values=True, feature_filter=no_missing)









    Out[8]:





    



    

    

    

    

    

    


    

    

    

    
        

    

        
            
                
                
    
        
            
    
        y=1
    


    
    (probability 0.566, score 0.264)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
                Value
            
        
        
        
        
            
    
        +1.673
    
    
        Sex=female
    
    
        
            1.000
        
    

        
            
    
        +0.070
    
    
        Fare
    
    
        
            7.879
        
    

        
        

        
        
            
    
        -0.004
    
    
        Cabin=
    
    
        
            1.000
        
    

        
            
    
        -0.006
    
    
        Parch
    
    
        
            0.000
        
    

        
            
    
        -0.071
    
    
        SibSp
    
    
        
            0.000
        
    

        
            
    
        -0.147
    
    
        Age
    
    
        
            19.000
        
    

        
            
    
        -0.528
    
    
        <BIAS>
    
    
        
            1.000
        
    

        
            
    
        -1.100
    
    
        Pclass=3
    
    
        
            1.000

5. Adding text features

Right now we treat Name field as categorical, like other text features. But in this dataset each name is unique, so XGBoost does not use this feature at all, because it's such a poor discriminator: it's absent from the weights table in section 3.

But Name still might contain some useful information. We don't want to guess how to best pre-process it and what features to extract, so let's use the most general character ngram vectorizer:



In [9]:

    
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer

vec2 = FeatureUnion([
    ('Name', CountVectorizer(
        analyzer='char_wb',
        ngram_range=(3, 4),
        preprocessor=lambda x: x['Name'],
        max_features=100,
    )),
    ('All', DictVectorizer()),
])
clf2 = XGBClassifier()
pipeline2 = make_pipeline(vec2, clf2)
evaluate(pipeline2)









    



Accuracy: 0.839 ± 0.081

In this case the pipeline is more complex, we slightly improved our result, but the improvement is not significant. Let's look at feature importances:



In [10]:

    
show_weights(clf2, vec=vec2)









    Out[10]:





    



    

    

    

    

    

    


    

    

    

    

    

    


    

    

    

    

    
        
    
    
        Weight
        Feature
    
    
    
    
        
            
                0.3138
                
            
            
                Name__ Mr.
            
        
    
        
            
                0.0821
                
            
            
                All__Pclass=3
            
        
    
        
            
                0.0443
                
            
            
                Name__sso
            
        
    
        
            
                0.0294
                
            
            
                All__Sex=female
            
        
    
        
            
                0.0212
                
            
            
                Name__lia
            
        
    
        
            
                0.0205
                
            
            
                All__Fare
            
        
    
        
            
                0.0203
                
            
            
                All__Ticket=1601
            
        
    
        
            
                0.0197
                
            
            
                All__Embarked=S
            
        
    
        
            
                0.0187
                
            
            
                Name__ Ma
            
        
    
        
            
                0.0177
                
            
            
                All__Cabin=
            
        
    
        
            
                0.0172
                
            
            
                Name__ Mar
            
        
    
        
            
                0.0168
                
            
            
                Name__s, 
            
        
    
        
            
                0.0160
                
            
            
                Name__ Mr
            
        
    
        
            
                0.0157
                
            
            
                Name__son
            
        
    
        
            
                0.0138
                
            
            
                Name__ne 
            
        
    
        
            
                0.0137
                
            
            
                Name__ber
            
        
    
        
            
                0.0136
                
            
            
                All__SibSp
            
        
    
        
            
                0.0136
                
            
            
                Name__e, 
            
        
    
        
            
                0.0134
                
            
            
                All__Pclass=1
            
        
    
        
            
                0.0125
                
            
            
                All__Embarked=C
            
        
    
    
        
            
                
                    … 2072 more …

We see that now there is a lot of features that come from the Name field (in fact, a classifier based on Name alone gives about 0.79 accuracy). Name features listed in this way are not very informative, they make more sense when we check out predictions. We hide missing features here because there is a lot of missing features in text, but they are not very interesting:



In [11]:

    
from IPython.display import display

for idx in [4, 5, 7, 37, 81]:
    display(show_prediction(clf2, valid_xs[idx], vec=vec2,
                            show_feature_values=True, feature_filter=no_missing))









    





    



    

    

    

    

    

    


    

    

    

    
        

    

        

        
    
        
        
    
        
            
    
        y=1
    


    
    (probability 0.771, score 1.215)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
                Value
            
        
        
        
        
            
    
        +0.995
    
    
        Name: Highlighted in text (sum)
    
    
        
            
        
    

        
            
    
        +0.347
    
    
        All__Fare
    
    
        
            17.800
        
    

        
            
    
        +0.236
    
    
        All__Sex=female
    
    
        
            1.000
        
    

        
            
    
        +0.109
    
    
        All__Age
    
    
        
            18.000
        
    

        
        

        
        
            
    
        -0.029
    
    
        All__Cabin=
    
    
        
            1.000
        
    

        
            
    
        -0.069
    
    
        All__Parch
    
    
        
            0.000
        
    

        
            
    
        -0.150
    
    
        All__Embarked=S
    
    
        
            1.000
        
    

        
            
    
        -0.215
    
    
        All__SibSp
    
    
        
            1.000
        
    

        
            
    
        -0.539
    
    
        <BIAS>
    
    
        
            1.000
        
    

        
            
    
        -0.932
    
    
        All__Pclass=3
    
    
        
            1.000
        
    

        

        
    

    



    
        Name: Arnold-Franchi, Mrs. Josef (Josefine Franchi)
    


    

    

    

    


    

    

    

    

    

    


    

    

    

    

    

    










    





    



    

    

    

    

    

    


    

    

    

    
        

    

        

        
    
        
        
    
        
            
    
        y=0
    


    
    (probability 0.905, score -2.248)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
                Value
            
        
        
        
        
            
    
        +0.948
    
    
        Name: Highlighted in text (sum)
    
    
        
            
        
    

        
            
    
        +0.539
    
    
        <BIAS>
    
    
        
            1.000
        
    

        
            
    
        +0.387
    
    
        All__Parch
    
    
        
            0.000
        
    

        
            
    
        +0.221
    
    
        All__Age
    
    
        
            45.000
        
    

        
            
    
        +0.071
    
    
        All__Cabin=
    
    
        
            1.000
        
    

        
            
    
        +0.037
    
    
        All__SibSp
    
    
        
            0.000
        
    

        
        

        
        
            
    
        -0.067
    
    
        All__Pclass=1
    
    
        
            1.000
        
    

        
            
    
        -0.492
    
    
        All__Fare
    
    
        
            26.550
        
    

        

        
    

    



    
        Name: Romaine, Mr. Charles Hallace ("Mr C Rolmane")
    


    

    

    

    


    

    

    

    

    

    


    

    

    

    

    

    










    





    



    

    

    

    

    

    


    

    

    

    
        

    

        

        
    
        
        
    
        
            
    
        y=0
    


    
    (probability 0.941, score -2.762)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
                Value
            
        
        
        
        
            
    
        +1.946
    
    
        All__SibSp
    
    
        
            8.000
        
    

        
            
    
        +0.942
    
    
        All__Fare
    
    
        
            69.550
        
    

        
            
    
        +0.678
    
    
        All__Pclass=3
    
    
        
            1.000
        
    

        
            
    
        +0.539
    
    
        <BIAS>
    
    
        
            1.000
        
    

        
            
    
        +0.160
    
    
        All__Parch
    
    
        
            2.000
        
    

        
            
    
        +0.074
    
    
        All__Embarked=S
    
    
        
            1.000
        
    

        
            
    
        +0.029
    
    
        All__Cabin=
    
    
        
            1.000
        
    

        
        

        
        
            
    
        -0.669
    
    
        Name: Highlighted in text (sum)
    
    
        
            
        
    

        

        
    

    



    
        Name: Sage, Master. Thomas Henry
    


    

    

    

    


    

    

    

    

    

    


    

    

    

    

    

    










    





    



    

    

    

    

    

    


    

    

    

    
        

    

        

        
    
        
        
    
        
            
    
        y=1
    


    
    (probability 0.679, score 0.750)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
                Value
            
        
        
        
        
            
    
        +0.236
    
    
        All__Sex=female
    
    
        
            1.000
        
    

        
            
    
        +0.226
    
    
        All__Fare
    
    
        
            7.879
        
    

        
            
    
        +0.141
    
    
        Name: Highlighted in text (sum)
    
    
        
            
        
    

        
            
    
        +0.010
    
    
        All__SibSp
    
    
        
            0.000
        
    

        
        

        
        
            
    
        -0.029
    
    
        All__Cabin=
    
    
        
            1.000
        
    

        
            
    
        -0.041
    
    
        All__Parch
    
    
        
            0.000
        
    

        
            
    
        -0.539
    
    
        <BIAS>
    
    
        
            1.000
        
    

        
            
    
        -0.932
    
    
        All__Pclass=3
    
    
        
            1.000
        
    

        

        
    

    



    
        Name: Mockler, Miss. Helen Mary "Ellie"
    


    

    

    

    


    

    

    

    

    

    


    

    

    

    

    

    










    





    



    

    

    

    

    

    


    

    

    

    
        

    

        

        
    
        
        
    
        
            
    
        y=1
    


    
    (probability 0.660, score 0.663)

top features
        
    
    
        
        
            
                
                    Contribution^?
                
            
            Feature
            
                Value
            
        
        
        
        
            
    
        +0.236
    
    
        All__Sex=female
    
    
        
            1.000
        
    

        
            
    
        +0.161
    
    
        All__Fare
    
    
        
            23.250
        
    

        
            
    
        +0.158
    
    
        Name: Highlighted in text (sum)
    
    
        
            
        
    

        
            
    
        +0.152
    
    
        All__Embarked=Q
    
    
        
            1.000
        
    

        
            
    
        +0.010
    
    
        All__SibSp
    
    
        
            2.000
        
    

        
        

        
        
            
    
        -0.029
    
    
        All__Cabin=
    
    
        
            1.000
        
    

        
            
    
        -0.069
    
    
        All__Parch
    
    
        
            0.000
        
    

        
            
    
        -0.539
    
    
        <BIAS>
    
    
        
            1.000
        
    

        
            
    
        -0.932
    
    
        All__Pclass=3
    
    
        
            1.000
        
    

        

        
    

    



    
        Name: McCoy, Miss. Agnes

Text features from the Name field are highlighted directly in text, and the sum of weights is shown in the weights table as "Name: Highlighted in text (sum)".

Looks like name classifier tried to infer both gender and status from the title: "Mr." is bad because women are saved first, and it's better to be "Mrs." (married) than "Miss.". Also name classifier is trying to pick some parts of names and surnames, especially endings, perhaps as a proxy for social status. It's especially bad to be "Mary" if you are from the third class.

Weight	Feature
0.4278	Sex=female
0.1949	Pclass=3
0.0665	Embarked=S
0.0510	Pclass=2
0.0420	SibSp
0.0417	Cabin=
0.0385	Embarked=C
0.0358	Ticket=1601
0.0331	Age
0.0323	Fare
0.0220	Pclass=1
0.0143	Parch
0	Name=Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards)
0	Name=Roebling, Mr. Washington Augustus II
0	Name=Rosblom, Mr. Viktor Richard
0	Name=Ross, Mr. John Hugo
0	Name=Rush, Mr. Alfred George John
0	Name=Rouse, Mr. Richard Henry
0	Name=Ryerson, Miss. Emily Borie
0	Name=Ryerson, Miss. Susan Parker "Suzette"
… 1972 more …

Contribution^?	Feature	Value
+1.673	Sex=female	1.000
+0.479	Embarked=S	Missing
+0.070	Fare	7.879
-0.004	Cabin=	1.000
-0.006	Parch	0.000
-0.009	Pclass=2	Missing
-0.009	Ticket=1601	Missing
-0.012	Embarked=C	Missing
-0.071	SibSp	0.000
-0.073	Pclass=1	Missing
-0.147	Age	19.000
-0.528	<BIAS>	1.000
-1.100	Pclass=3	1.000

Weight	Feature
0.3138	Name__ Mr.
0.0821	All__Pclass=3
0.0443	Name__sso
0.0294	All__Sex=female
0.0212	Name__lia
0.0205	All__Fare
0.0203	All__Ticket=1601
0.0197	All__Embarked=S
0.0187	Name__ Ma
0.0177	All__Cabin=
0.0172	Name__ Mar
0.0168	Name__s,
0.0160	Name__ Mr
0.0157	Name__son
0.0138	Name__ne
0.0137	Name__ber
0.0136	All__SibSp
0.0136	Name__e,
0.0134	All__Pclass=1
0.0125	All__Embarked=C
… 2072 more …

Contribution^?	Feature	Value
+0.995	Name: Highlighted in text (sum)
+0.347	All__Fare	17.800
+0.236	All__Sex=female	1.000
+0.109	All__Age	18.000
-0.029	All__Cabin=	1.000
-0.069	All__Parch	0.000
-0.150	All__Embarked=S	1.000
-0.215	All__SibSp	1.000
-0.539	<BIAS>	1.000
-0.932	All__Pclass=3	1.000

Contribution^?	Feature	Value
+0.948	Name: Highlighted in text (sum)
+0.539	<BIAS>	1.000
+0.387	All__Parch	0.000
+0.221	All__Age	45.000
+0.071	All__Cabin=	1.000
+0.037	All__SibSp	0.000
-0.067	All__Pclass=1	1.000
-0.492	All__Fare	26.550

Contribution^?	Feature	Value
+1.946	All__SibSp	8.000
+0.942	All__Fare	69.550
+0.678	All__Pclass=3	1.000
+0.539	<BIAS>	1.000
+0.160	All__Parch	2.000
+0.074	All__Embarked=S	1.000
+0.029	All__Cabin=	1.000
-0.669	Name: Highlighted in text (sum)