In [43]:
import numpy as np
import pandas as pd

titanic=pd.read_csv('./train.csv')

First take a look at the data types and non-null entries


In [44]:
print titanic.info()

print titanic.describe().T

print titanic.head(5)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
             count        mean         std   min       25%       50%    75%  \
PassengerId  891.0  446.000000  257.353842  1.00  223.5000  446.0000  668.5   
Survived     891.0    0.383838    0.486592  0.00    0.0000    0.0000    1.0   
Pclass       891.0    2.308642    0.836071  1.00    2.0000    3.0000    3.0   
Age          714.0   29.699118   14.526497  0.42       NaN       NaN    NaN   
SibSp        891.0    0.523008    1.102743  0.00    0.0000    0.0000    1.0   
Parch        891.0    0.381594    0.806057  0.00    0.0000    0.0000    0.0   
Fare         891.0   32.204208   49.693429  0.00    7.9104   14.4542   31.0   

                  max  
PassengerId  891.0000  
Survived       1.0000  
Pclass         3.0000  
Age           80.0000  
SibSp          8.0000  
Parch          6.0000  
Fare         512.3292  
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

Some observations from looking at the above data include:

  • Age contains missing values for a relatively small set of the population. We will need to impute numbers for this somehow.
  • Cabin contains many, many missing values. The first character also likely contains information on the deck location
  • Embarked contains two missing values. We likely need to account for this, but likely through a simple missing dummy/factor
  • There is quite a bit of information in the name field that could become useful. For example, while there is a sibling/spouse field, it may be helpful to identify/separate whether it is a sibling or a spouse onboard using the "Miss/Mrs." feature as well as knowing that female spouses have their husbands' name in the registry, so we can find a match.
  • Variables to turn into factors include Pclass, (potentially, given likely discontinuous breaks) Age, and Sex

Let's start with converting the Sex values into a boolean; just feels cleaner to start there, especially if we are going to be working with this variable when cleaning other variables


In [45]:
titanic.Sex.replace(['male','female'],[True,False], inplace=True)

Next, let's fill in the missing Age values. We have a pretty full dataset (about 7/8ths), so we can probably get away, in a first pass, with doing some simple stratification and assuming ages are missing-at-random within those strata. For now, let's stratify on gender and class.


In [46]:
#print titanic.Age.mean()
    
titanic.Age= titanic.groupby(['Sex','Pclass'])[['Age']].transform(lambda x: x.fillna(x.mean()))
titanic.Fare= titanic.groupby(['Pclass'])[['Fare']].transform(lambda x: x.fillna(x.mean()))
#print titanic.info()

Next, deal with converting Pclass into something we can work with, and also create dummies for deck location and port of embarkation (when found)


In [47]:
titanic_class=pd.get_dummies(titanic.Pclass,prefix='Pclass',dummy_na=False)
titanic=pd.merge(titanic,titanic_class,on=titanic['PassengerId'])
titanic=pd.merge(titanic,pd.get_dummies(titanic.Embarked, prefix='Emb', dummy_na=True), on=titanic['PassengerId'])

titanic['Floor']=titanic['Cabin'].str.extract('^([A-Z])', expand=False)
#T only appears once, so let's just scrub that to NaN
titanic['Floor'].replace(to_replace='T',value=np.NaN ,inplace=True)
titanic['Floor'].replace(to_replace=['A','B','C','D','E','F','G'],value=['AB','AB','CD','CD','EFG','EFG','EFG'] ,inplace=True)

titanic=pd.merge(titanic,pd.get_dummies(titanic.Floor, prefix="Fl", dummy_na=True),on=titanic['PassengerId'])

In [48]:
titanic['Age_cut']=pd.cut(titanic['Age'],[0,14.9,54.9,99], labels=['C','A','S'])
titanic=pd.merge(titanic,pd.get_dummies(titanic.Age_cut, prefix="Age_ct", dummy_na=False),on=titanic['PassengerId'])

Finally, before going forward I'd really like to be able to separate spouses from siblings in that variable. One way to do this is that we see married women have their husbands' names located outside of parentheses within their own name. We create a new variable that just contains the words outside of the parentheses, and see if these match any other names in the dataset.

Additionally, from just scanning the names in the dataset, it appears that titles always come between the comma after the last name and a period. This is a perfect opportunity to use regular expressions to extract that title and turn it into a feature we can consider in analysis.


In [49]:
import re as re

In [50]:
titanic['Title']=titanic['Name'].str.extract(', (.*)\.', expand=False)

So there is some cleaning that could be done here. We'll do three things: 1.) Turn French ladies' titles into English ones 2.) Aggregate Military titles 3.) For all remaining titles with count less than five, create remainder bin


In [51]:
titanic['Title'].replace(to_replace='Mrs\. .*',value='Mrs', inplace=True, regex=True)
titanic.loc[titanic.Title.isin(['Col','Major','Capt']),['Title']]='Mil'
titanic.loc[titanic.Title=='Mlle',['Title']]='Miss'
titanic.loc[titanic.Title=='Mme',['Title']]='Mrs'

print titanic.Title.value_counts()


Mr              517
Miss            184
Mrs             126
Master           40
Dr                7
Rev               6
Mil               5
Jonkheer          1
Ms                1
Lady              1
Don               1
the Countess      1
Sir               1
Name: Title, dtype: int64

In [52]:
titanic['Title_ct']=titanic.groupby(['Title'])['Title'].transform('count')
titanic.loc[titanic.Title_ct<5,['Title']]='Other'

titanic.Title.value_counts()

titanic=pd.merge(titanic,pd.get_dummies(titanic.Title, prefix='Ti',dummy_na=False), on=titanic['PassengerId'])

print titanic.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 38 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null bool
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
Pclass_1       891 non-null float64
Pclass_2       891 non-null float64
Pclass_3       891 non-null float64
Emb_C          891 non-null float64
Emb_Q          891 non-null float64
Emb_S          891 non-null float64
Emb_nan        891 non-null float64
Floor          203 non-null object
Fl_AB          891 non-null float64
Fl_CD          891 non-null float64
Fl_EFG         891 non-null float64
Fl_nan         891 non-null float64
Age_cut        891 non-null object
Age_ct_C       891 non-null float64
Age_ct_A       891 non-null float64
Age_ct_S       891 non-null float64
Title          891 non-null object
Title_ct       891 non-null int64
Ti_Dr          891 non-null float64
Ti_Master      891 non-null float64
Ti_Mil         891 non-null float64
Ti_Miss        891 non-null float64
Ti_Mr          891 non-null float64
Ti_Mrs         891 non-null float64
Ti_Other       891 non-null float64
Ti_Rev         891 non-null float64
dtypes: bool(1), float64(24), int64(6), object(7)
memory usage: 265.4+ KB
None

In [53]:
titanic['NameTest']=titanic.Name
titanic['NameTest'].replace(to_replace=" \(.*\)",value="",inplace=True, regex=True)
titanic['NameTest'].replace(to_replace=", M.*\.",value=", ",inplace=True, regex=True)

In [54]:
name_list=pd.concat([titanic[['PassengerId','NameTest']]])
name_list['Sp_ct']=name_list.groupby('NameTest')['NameTest'].transform('count')-1
titanic=pd.merge(titanic,name_list[['PassengerId','Sp_ct']],on='PassengerId',how='left')

In [55]:
titanic.to_csv('./titanic_clean_data.csv')

Typically, the next step here would be to perform a univariate (and, for learners that do not naturally perform feature selection with interactions, potentially bivariate) analysis to see which features best predict the outcome. In some cases (Random Forests, Boosting trees) the learner naturally performs feature selection; in others (SVM, Logit, Naive Bayes), we rely on a univariate analysis pipelined into the algorithm to make these decisions.


In [56]:
execfile('./Final_setup_Random_Forest.py')


500
gini
sqrt
9
12
0.0
None

In [57]:
execfile('./Final_setup_GBoost.py')


exponential
0.29
100
3
5
11
0.0
1.0
None

In [60]:
execfile('./Final_setup_SVM.py')


39
0.5
0.826336341616

In [61]:
execfile('./Final_setup_Logit.py')

In [62]:
execfile('./Final_setup_NB.py')

In [63]:
execfile('./Final_ensemble.py')


   rf_pred  gboost_pred  svm_pred  nb_pred  log_pred
0        1            1         1        1         1
1        1            1         1        1         1
2        0            0         0        0         0
3        0            0         1        1         0
4        0            0         0        0         0
              rf_pred  gboost_pred  svm_pred   nb_pred  log_pred
rf_pred      1.000000     0.749622  0.796041  0.660271  0.840651
gboost_pred  0.749622     1.000000  0.668922  0.513727  0.723178
svm_pred     0.796041     0.668922  1.000000  0.568527  0.703526
nb_pred      0.660271     0.513727  0.568527  1.000000  0.541967
log_pred     0.840651     0.723178  0.703526  0.541967  1.000000

In [ ]: