Titanic: this time with cross-validation

And also a Neural Net



In [1]:

    
%matplotlib inline



In [2]:

    
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats as stats
matplotlib.style.use('ggplot')

NameToCategory

Here, I create a function that's used for creating a "Title" feature by pulling out Mr Miss, etc.

For groups that only a few ended up being bucketed into, I made a catch-all Other group.



In [3]:

Scrub

This function reads in a csv, and cleans up the data to fill in missing data



In [4]:

    
import math

def scrub(filePath):
    data = pd.read_csv(filePath)
    char_cabin = data['Cabin'].astype(str)
    new_cabin = np.array([cabin[0] for cabin in char_cabin])
    data['Cabin'] = pd.Categorical(new_cabin)

    data['Fare'] = data['Fare'].fillna(data['Fare'].mean())
    
    c1Median = data.Age[data.Pclass == 1].median()
    c2Median = data.Age[data.Pclass == 2].median()
    c3Median = data.Age[data.Pclass == 3].median()

    def medianFor(row):
        if (row['Pclass'] == 1):
            return c1Median
        elif (row['Pclass'] == 2):
            return c2Median
        elif (row['Pclass'] == 3):
            return c3Median
        else:
            raise Exception('Goofed')
    
    def updateAge(row):
        if (math.isnan(row['Age'])):
            median = medianFor(row)
            row['Age'] = median
        return row
    
    # Update the missing ages with the median
    data = data.apply(updateAge, axis=1)
    
    new_embarked = np.where(data['Embarked'].isnull()
                           , 'S'
                           , data['Embarked'])
    
    data['Embarked'] = new_embarked
    
    data['Title'] = data['Name'].apply(nameToCategory)
    
    
    return data

Here, I make a few functions that are used to munge the data into a certain shape.



In [5]:

    
from sklearn import linear_model
from sklearn import preprocessing

label_encoder = label_encoder = preprocessing.LabelEncoder()

def trainFeaturesFor(df):
    encoded_sex = label_encoder.fit_transform(df["Sex"])
    encoded_class = label_encoder.fit_transform(df["Pclass"])
    encoded_cabin = label_encoder.fit_transform(df["Cabin"])
    encoded_title = label_encoder.fit_transform(df["Title"])
    encoded_parch = label_encoder.fit_transform(df["Parch"])

    train_features = pd.DataFrame([ encoded_class
                                  , encoded_cabin
                                  , encoded_sex
                                  , encoded_title
                                  , encoded_parch
                                  , df["Age"]
                                  ]).T
    return train_features

def trainModel(df):
    train_features = trainFeaturesFor(df)
    log_model = linear_model.LogisticRegression()
    log_model.fit( X = train_features
                 , y = df["Survived"])
    return log_model

Splitting the data into train and test



In [6]:

    
from sklearn.model_selection import train_test_split

completeDf = scrub('train.csv')
train, test = train_test_split( completeDf
                               , test_size=0.2
                               , random_state=1)

[len(train), len(test)]









    Out[6]:





[712, 179]

Training and scoring the model



In [7]:

    
log_model = trainModel(train)

preds = log_model.predict(X=trainFeaturesFor(train))

log_train_score = log_model.score( X=trainFeaturesFor(train)
                                  , y=train['Survived'])

print(log_train_score)
pd.crosstab(preds,train["Survived"])









    



0.800561797753






    Out[7]:






  
    
      Survived
      0
      1
    
    
      row_0
      
      
    
  
  
    
      0
      384
      83
    
    
      1
      59
      186

Scoring the model on the test data



In [8]:

    
test_preds = log_model.predict(X=trainFeaturesFor(test))

log_score = log_model.score( X=trainFeaturesFor(test)
                            , y=test['Survived'])
print(log_score)
pd.crosstab(test_preds, test["Survived"])









    



0.787709497207






    Out[8]:






  
    
      Survived
      0
      1
    
    
      row_0
      
      
    
  
  
    
      0
      90
      22
    
    
      1
      16
      51

Neural Net



In [9]:

    
def nn_scrub(df):
    df = df.drop(['Name', 'Ticket', 'Fare', 'PassengerId'], axis=1)
    
    df = pd.get_dummies(df)
    return df



In [10]:

    
nn_df = scrub('train.csv')
nn_df = nn_scrub(nn_df)

nn_df.tail()









    Out[10]:






  
    
      
      Survived
      Pclass
      Age
      SibSp
      Parch
      Sex_female
      Sex_male
      Cabin_A
      Cabin_B
      Cabin_C
      ...
      Cabin_T
      Cabin_n
      Embarked_C
      Embarked_Q
      Embarked_S
      Title_Master
      Title_Miss
      Title_Mr
      Title_Mrs
      Title_Other
    
  
  
    
      886
      0
      2
      27.0
      0
      0
      0
      1
      0
      0
      0
      ...
      0
      1
      0
      0
      1
      0
      0
      0
      0
      1
    
    
      887
      1
      1
      19.0
      0
      0
      1
      0
      0
      1
      0
      ...
      0
      0
      0
      0
      1
      0
      1
      0
      0
      0
    
    
      888
      0
      3
      24.0
      1
      2
      1
      0
      0
      0
      0
      ...
      0
      1
      0
      0
      1
      0
      1
      0
      0
      0
    
    
      889
      1
      1
      26.0
      0
      0
      0
      1
      0
      0
      1
      ...
      0
      0
      1
      0
      0
      0
      0
      1
      0
      0
    
    
      890
      0
      3
      32.0
      0
      0
      0
      1
      0
      0
      0
      ...
      0
      1
      0
      1
      0
      0
      0
      1
      0
      0
    
  

5 rows × 24 columns

Splitting into train and test



In [11]:

    
from sklearn.model_selection import train_test_split

nn_train, nn_test = train_test_split( nn_df
                                     , test_size=0.2
                                     , random_state=3)

[len(train), len(test)]









    Out[11]:





[712, 179]



In [12]:

    
from sklearn.neural_network import MLPClassifier

X = nn_train.drop('Survived', axis=1)
y = nn_train.Survived

clf = MLPClassifier( solver='lbfgs'
                    , activation='logistic'
                    , alpha=1e-3
                    , hidden_layer_sizes=(25, 25)
                    , random_state=1)
clf.fit(X, y)









    Out[12]:





MLPClassifier(activation='logistic', alpha=0.001, batch_size='auto',
       beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(25, 25), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)



In [13]:

    
sizes = [5, 10, 15, 20, 25]

bestScore = 0
def classifierFor(alpha, hiddenLayer, X, y):
    tempClf = MLPClassifier( solver='lbfgs'
                            , activation='logistic'
                            , alpha=alpha
                            , hidden_layer_sizes=hiddenLayer
                            , random_state=1)
    tempClf.fit(X, y)
    return tempClf


for alpha in [3e-5, 1e-4, 3e-4, 1e-3]:
    maxForAlpha = 0
    for j in sizes:
        for k in sizes:
            for l in sizes:
                hiddenLayer = (j, k, l)
                runs = 5
                total = 0
                for i in range(runs):
                    nn_train, nn_test = train_test_split(nn_df, test_size=0.2)
                    X = nn_train.drop('Survived', axis=1)
                    y = nn_train.Survived
                    tempClf = classifierFor(alpha, hiddenLayer, X, y)
                    nn_score = tempClf.score( X=nn_test.drop('Survived', axis=1), y=nn_test.Survived)
                    total = total + nn_score
                averageScore = total / runs
                if (averageScore > bestScore):
                    bestScore = averageScore
                    print('averageScore=' + str(averageScore) + ' alpha=' + str(alpha) + ' hiddenLayers=' + str(hiddenLayer))
                    clf = tempClf
                if (nn_score > maxForAlpha):
                    maxForAlpha = nn_score
                    maxForAlphaHidden = hiddenLayer
    print('Max for alpha of ' + str(alpha) + ':::  nn_score= ' + str(maxForAlpha) + ' hiddenLayers=' + str(maxForAlphaHidden))

# This causes a slight? overfit of the model. Is there a way I can do 
# something like this to get a good hiddenlayer makeup that's better
# than me just randomly guessing?









    



averageScore=0.821229050279 alpha=3e-05 hiddenLayers=(5, 5, 5)
averageScore=0.840223463687 alpha=3e-05 hiddenLayers=(5, 5, 10)
averageScore=0.845810055866 alpha=3e-05 hiddenLayers=(5, 10, 20)
averageScore=0.846927374302 alpha=3e-05 hiddenLayers=(5, 10, 25)
averageScore=0.853631284916 alpha=3e-05 hiddenLayers=(10, 25, 15)
Max for alpha of 3e-05:::  nn_score= 0.882681564246 hiddenLayers=(5, 10, 25)
averageScore=0.856983240223 alpha=0.0001 hiddenLayers=(10, 20, 20)
Max for alpha of 0.0001:::  nn_score= 0.882681564246 hiddenLayers=(15, 5, 15)
Max for alpha of 0.0003:::  nn_score= 0.882681564246 hiddenLayers=(5, 20, 5)
Max for alpha of 0.001:::  nn_score= 0.888268156425 hiddenLayers=(5, 10, 15)



In [14]:

    
nn_preds = clf.predict(nn_test.drop('Survived', axis=1))

nn_train_score = clf.score( X=nn_train.drop('Survived', axis=1)
                          , y=nn_train['Survived'])

nn_score = clf.score( X=nn_test.drop('Survived', axis=1)
                     , y=nn_test["Survived"])

print(nn_score)

pd.crosstab(nn_preds, nn_test["Survived"])









    



0.877094972067






    Out[14]:






  
    
      Survived
      0
      1
    
    
      row_0
      
      
    
  
  
    
      0
      111
      16
    
    
      1
      6
      46



In [15]:

    
print([log_train_score, log_score])
print([nn_train_score, nn_score])









    



[0.800561797752809, 0.78770949720670391]
[0.8455056179775281, 0.87709497206703912]

Setting up data for submit.



In [16]:

    
nn_titanic_test = scrub('test.csv')
nn_titanic_test = nn_scrub(nn_titanic_test)

## Adding a label that doesn't exist in any of the test data. TODO: ask Jeremy if there is a way to do this better?
nn_titanic_test.insert(14, 'Cabin_T', 0)

submit_preds = clf.predict( X=nn_titanic_test)

submission = pd.DataFrame({ "PassengerId": scrub('test.csv')["PassengerId"]
                          , "Survived":submit_preds})

submission.to_csv( "submission.csv"
                 , index=False)

TODO

Set up a voting classifier that uses the logistic nn model.
Is a way to do the pd.get_dummies where you can pass in all of the valid values.



In [17]:

    
nnproba_s = clf.predict_proba(nn_test.drop('Survived', axis=1))
log_model.predict_proba(trainFeaturesFor(test))
print('I think I need to do something with these')









    



I think I need to do something with these



In [ ]:

	Survived	Pclass	Age	SibSp	Parch	Sex_female	Sex_male	Cabin_B	Cabin_C	...	Cabin_n	Embarked_C	Embarked_Q	Embarked_S	Title_Miss	Title_Mr	Title_Other
886	0	2	27.0	0	0	0	1	0	0	...	1	0	0	1	0	0	1
887	1	1	19.0	0	0	1	0	1	0	...	0	0	0	1	1	0	0
888	0	3	24.0	1	2	1	0	0	0	...	1	0	0	1	1	0	0
889	1	1	26.0	0	0	0	1	0	1	...	0	1	0	0	0	1	0
890	0	3	32.0	0	0	0	1	0	0	...	1	0	1	0	0	1	0