TPOT tutorial on the Titanic dataset

The Titanic machine learning competition on Kaggle is one of the most popular beginner's competitions on the platform. We will use that competition here to demonstrate the implementation of TPOT.



In [1]:

    
# Import required libraries
from tpot import TPOT
from sklearn.cross_validation import train_test_split
import pandas as pd 
import numpy as np



In [2]:

    
# Load the data
titanic = pd.read_csv('data/titanic_train.csv')
titanic.head(5)









    Out[2]:






  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

Data Exploration



In [3]:

    
titanic.groupby('Sex').Survived.value_counts()









    Out[3]:





Sex     Survived
female  1           233
        0            81
male    0           468
        1           109
Name: Survived, dtype: int64



In [4]:

    
titanic.groupby(['Pclass','Sex']).Survived.value_counts()









    Out[4]:





Pclass  Sex     Survived
1       female  1            91
                0             3
        male    0            77
                1            45
2       female  1            70
                0             6
        male    0            91
                1            17
3       female  0            72
                1            72
        male    0           300
                1            47
Name: Survived, dtype: int64



In [5]:

    
id = pd.crosstab([titanic.Pclass, titanic.Sex], titanic.Survived.astype(float))
id.div(id.sum(1).astype(float), 0)

Data Munging

The first and most important step in using TPOT on any data set is to rename the target class/response variable to class.



In [6]:

    
titanic.rename(columns={'Survived': 'class'}, inplace=True)

At present, TPOT requires all the data to be in numerical format. As we can see below, our data set has 5 categorical variables which contain non-numerical values: Name, Sex, Ticket, Cabin and Embarked.



In [7]:

    
titanic.dtypes









    Out[7]:





PassengerId      int64
class            int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

We then check the number of levels that each of the five categorical variables have.



In [8]:

    
for cat in ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']:
    print("Number of levels in category '{0}': \b {1:2.2f} ".format(cat, titanic[cat].unique().size))









    



Number of levels in category 'Name':  891.00 
Number of levels in category 'Sex':  2.00 
Number of levels in category 'Ticket':  681.00 
Number of levels in category 'Cabin':  148.00 
Number of levels in category 'Embarked':  4.00

As we can see, Sex and Embarked have few levels. Let's find out what they are.



In [9]:

    
for cat in ['Sex', 'Embarked']:
    print("Levels for catgeory '{0}': {1}".format(cat, titanic[cat].unique()))









    



Levels for catgeory 'Sex': ['male' 'female']
Levels for catgeory 'Embarked': ['S' 'C' 'Q' nan]

We then code these levels manually into numerical values. For nan i.e. the missing values, we simply replace them with a placeholder value (-999). In fact, we perform this replacement for the entire data set.



In [10]:

    
titanic['Sex'] = titanic['Sex'].map({'male':0,'female':1})
titanic['Embarked'] = titanic['Embarked'].map({'S':0,'C':1,'Q':2})



In [11]:

    
titanic = titanic.fillna(-999)
pd.isnull(titanic).any()









    Out[11]:





PassengerId    False
class          False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin          False
Embarked       False
dtype: bool

Since Name and Ticket have so many levels, we drop them from our analysis for the sake of simplicity. For Cabin, we encode the levels as digits using Scikit-learn's MultiLabelBinarizer and treat them as new features.



In [12]:

    
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
CabinTrans = mlb.fit_transform([{str(val)} for val in titanic['Cabin'].values])



In [13]:

    
CabinTrans









    Out[13]:





array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ..., 
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

Drop the unused features from the dataset.



In [14]:

    
titanic_new = titanic.drop(['Name','Ticket','Cabin','class'], axis=1)



In [15]:

    
assert (len(titanic['Cabin'].unique()) == len(mlb.classes_)), "Not Equal" #check correct encoding done

We then add the encoded features to form the final dataset to be used with TPOT.



In [16]:

    
titanic_new = np.hstack((titanic_new.values,CabinTrans))



In [17]:

    
np.isnan(titanic_new).any()









    Out[17]:





False

Keeping in mind that the final dataset is in the form of a numpy array, we can check the number of features in the final dataset as follows.



In [18]:

    
titanic_new[0].size









    Out[18]:





156

Finally we store the class labels, which we need to predict, in a separate variable.



In [19]:

    
titanic_class = titanic['class'].values

Data Analysis using TPOT

To begin our analysis, we need to divide our training data into training and validation sets. The validation set is just to give us an idea of the test set error. The model selection and tuning is entirely taken care of by TPOT, so if we want to, we can skip creating this validation set.



In [20]:

    
training_indices, validation_indices = training_indices, testing_indices = train_test_split(titanic.index, stratify = titanic_class, train_size=0.75, test_size=0.25)
training_indices.size, validation_indices.size









    Out[20]:





(668, 223)

After that, we proceed to calling the fit, score and export functions on our training dataset. To get a better idea of how these functions work, refer the TPOT documentation here.

An important TPOT parameter to set is the number of generations. Since our aim is to just illustrate the use of TPOT, we have set it to 5. On a standard laptop with 4GB RAM, it roughly takes 5 minutes per generation to run. For each added generation, it should take 5 mins more. Thus, for the default value of 100, total run time could be roughly around 8 hours.



In [21]:

    
tpot = TPOT(generations=5, verbosity=2)
tpot.fit(titanic_new[training_indices], titanic_class[training_indices])









    









    



Generation 1 - Current best internal CV score: 0.76691






    









    



Generation 2 - Current best internal CV score: 0.77875






    









    



Generation 3 - Current best internal CV score: 0.80916






    









    



Generation 4 - Current best internal CV score: 0.80916






    









    



Generation 5 - Current best internal CV score: 0.80916






    









    



Best pipeline: _logistic_regression(input_df, 10.0, 32, False)



In [22]:

    
tpot.score(titanic_new[validation_indices], titanic.loc[validation_indices, 'class'].values)









    Out[22]:





0.76192497029366835



In [23]:

    
tpot.export('tpot_titanic_pipeline.py')

Let's have a look at the generated code. As we can see, the random forest classifier performed the best on the given dataset out of all the other models that TPOT currently evaluates on. If we ran TPOT for more generations, then the score should improve further.



In [ ]:

    
# %load tpot_titanic_pipeline.py
import numpy as np
import pandas as pd

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indices, testing_indices = train_test_split(tpot_data.index, stratify = tpot_data['class'].values, train_size=0.75, test_size=0.25)

result1 = tpot_data.copy()

# Perform classification with a logistic regression classifier
lrc1 = LogisticRegression(C=10.0, dual=False, penalty="l1")
lrc1.fit(result1.loc[training_indices].drop('class', axis=1).values, result1.loc[training_indices, 'class'].values)
result1['lrc1-classification'] = lrc1.predict(result1.drop('class', axis=1).values)

Make predictions on the submission data



In [25]:

    
# Read in the submission dataset
titanic_sub = pd.read_csv('data/titanic_test.csv')
titanic_sub.describe()









    



/Users/randal_olson/anaconda/lib/python3.5/site-packages/numpy/lib/function_base.py:3823: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)






    Out[25]:






  
    
      
      PassengerId
      Pclass
      Age
      SibSp
      Parch
      Fare
    
  
  
    
      count
      418.000000
      418.000000
      332.000000
      418.000000
      418.000000
      417.000000
    
    
      mean
      1100.500000
      2.265550
      30.272590
      0.447368
      0.392344
      35.627188
    
    
      std
      120.810458
      0.841838
      14.181209
      0.896760
      0.981429
      55.907576
    
    
      min
      892.000000
      1.000000
      0.170000
      0.000000
      0.000000
      0.000000
    
    
      25%
      996.250000
      1.000000
      NaN
      0.000000
      0.000000
      NaN
    
    
      50%
      1100.500000
      3.000000
      NaN
      0.000000
      0.000000
      NaN
    
    
      75%
      1204.750000
      3.000000
      NaN
      1.000000
      0.000000
      NaN
    
    
      max
      1309.000000
      3.000000
      76.000000
      8.000000
      9.000000
      512.329200

The most important step here is to check for new levels in the categorical variables of the submission dataset that are absent in the training set. We identify them and set them to our placeholder value of '-999', i.e., we treat them as missing values. This ensures training consistency, as otherwise the model does not know what to do with the new levels in the submission dataset.



In [26]:

    
for var in ['Cabin']: #,'Name','Ticket']:
    new = list(set(titanic_sub[var]) - set(titanic[var]))
    titanic_sub.ix[titanic_sub[var].isin(new), var] = -999

We then carry out the data munging steps as done earlier for the training dataset.



In [27]:

    
titanic_sub['Sex'] = titanic_sub['Sex'].map({'male':0,'female':1})
titanic_sub['Embarked'] = titanic_sub['Embarked'].map({'S':0,'C':1,'Q':2})



In [28]:

    
titanic_sub = titanic_sub.fillna(-999)
pd.isnull(titanic_sub).any()









    Out[28]:





PassengerId    False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin          False
Embarked       False
dtype: bool

While calling MultiLabelBinarizer for the submission data set, we first fit on the training set again to learn the levels and then transform the submission dataset values. This further ensures that only those levels that were present in the training dataset are transformed. If new levels are still found in the submission dataset then it will return an error and we need to go back and check our earlier step of replacing new levels with the placeholder value.



In [29]:

    
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
SubCabinTrans = mlb.fit([{str(val)} for val in titanic['Cabin'].values]).transform([{str(val)} for val in titanic_sub['Cabin'].values])
titanic_sub = titanic_sub.drop(['Name','Ticket','Cabin'], axis=1)



In [30]:

    
# Form the new submission data set
titanic_sub_new = np.hstack((titanic_sub.values,SubCabinTrans))



In [31]:

    
np.any(np.isnan(titanic_sub_new))









    Out[31]:





False



In [32]:

    
# Ensure equal number of features in both the final training and submission dataset
assert (titanic_new.shape[1] == titanic_sub_new.shape[1]), "Not Equal"



In [33]:

    
# Generate the predictions
submission = tpot.predict(titanic_sub_new)



In [34]:

    
# Create the submission file
final = pd.DataFrame({'PassengerId': titanic_sub['PassengerId'], 'Survived': submission})
final.to_csv('data/submission.csv', index = False)



In [35]:

    
final.shape









    Out[35]:





(418, 2)

There we go! We have successfully generated the predictions for the 418 data points in the submission dataset, and we're good to go ahead to submit these predictions on Kaggle.



In [ ]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	Survived	0.0	1.0
Pclass	Sex
1	female	0.031915	0.968085
1	male	0.631148	0.368852
2	female	0.078947	0.921053
2	male	0.842593	0.157407
3	female	0.500000	0.500000
3	male	0.864553	0.135447

	PassengerId	Pclass	Age	SibSp	Parch	Fare
count	418.000000	418.000000	332.000000	418.000000	418.000000	417.000000
mean	1100.500000	2.265550	30.272590	0.447368	0.392344	35.627188
std	120.810458	0.841838	14.181209	0.896760	0.981429	55.907576
min	892.000000	1.000000	0.170000	0.000000	0.000000	0.000000
25%	996.250000	1.000000	NaN	0.000000	0.000000	NaN
50%	1100.500000	3.000000	NaN	0.000000	0.000000	NaN
75%	1204.750000	3.000000	NaN	1.000000	0.000000	NaN
max	1309.000000	3.000000	76.000000	8.000000	9.000000	512.329200