Predicting Survival on the Titanic

Note! Work in Progress - This notebook is not yet finished

An implementation in Python of the exploration of the Titanic dataset that closely follows the excellent Exploring Survival on the Titanic notebook by Megan L. Risdal found at https://www.kaggle.com/mrisdal/titanic/exploring-survival-on-the-titanic/notebook. Data preprocessing largely follows what she did though predictive modeling attempts to explore more models than just the random forest she used.

As an aside, this also serves as an interesting look at how some of the tasks performed in her notebook might be done in Python and, in a way, shows both languages' relative strengths and weaknesses.



In [2]:

    
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import metrics

The Dataset



In [40]:

    
train = pd.read_csv("train.csv", index_col='PassengerId')
test = pd.read_csv("test.csv", index_col='PassengerId')
train.head(3)
test.head(3)









    Out[40]:






  
    
      
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      892
      3
      Kelly, Mr. James
      male
      34.5
      0
      0
      330911
      7.8292
      NaN
      Q
    
    
      893
      3
      Wilkes, Mrs. James (Ellen Needs)
      female
      47.0
      1
      0
      363272
      7.0000
      NaN
      S
    
    
      894
      2
      Myles, Mr. Thomas Francis
      male
      62.0
      0
      0
      240276
      9.6875
      NaN
      Q



In [4]:

    
# print(train.shape)
# print(test.shape)
print('Number of features: {}'.format(test.shape[1]))
print('Training samples: {}'.format(train.shape[0]))
print('Test samples: {}'.format(test.shape[0]))
print('Total number of samples: {}'.format(train.shape[0]+test.shape[0]))









    



Number of features: 10
Training samples: 891
Test samples: 418
Total number of samples: 1309

The data contains the following features:

PassengerId - a number describing a unique passenger
Survived - the binary dependent variable indicating whether a passenger survived (1) or died (0)
Pclass - the passenger's class, from first class (1) to third class (3)
Name
Sex
Age
SibSp - the number of siblings or spouses aboard
Parch - the number of parents or children aboard
Ticket - the ticket number
Fare - the fare that the passenger paid
Cabin - the cabin number the passenger stayed in
Embarked - the port where the passenger embarked, whether at Cherbourg (C), Queenstown (Q), or Southampton (S)

It's time to explore the dataset to get a general idea of what it's like.

Exploratory Data Analysis

We first do some general overviews of the data via summary statistics and histograms before moving on to preprocessing.



In [5]:

    
# First, combine datasets
total = pd.concat([train, test])

# View summary statistics
total.describe()









    



C:\Users\Justin\Anaconda3\lib\site-packages\numpy\lib\function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)






    Out[5]:






  
    
      
      Age
      Fare
      Parch
      Pclass
      SibSp
      Survived
    
  
  
    
      count
      1046.000000
      1308.000000
      1309.000000
      1309.000000
      1309.000000
      891.000000
    
    
      mean
      29.881138
      33.295479
      0.385027
      2.294882
      0.498854
      0.383838
    
    
      std
      14.413493
      51.758668
      0.865560
      0.837836
      1.041658
      0.486592
    
    
      min
      0.170000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
    
    
      25%
      NaN
      NaN
      0.000000
      2.000000
      0.000000
      NaN
    
    
      50%
      NaN
      NaN
      0.000000
      3.000000
      0.000000
      NaN
    
    
      75%
      NaN
      NaN
      0.000000
      3.000000
      1.000000
      NaN
    
    
      max
      80.000000
      512.329200
      9.000000
      3.000000
      8.000000
      1.000000

Most numerical data appear to be fairly complete, with the exception of fare (which only has one missing value) and age (which has 263 missing values). We can deal with the missing values later.

Let's also visualize the data with histograms to see the general distribution of the data.



In [6]:

    
# Generate histograms
sns.set_color_codes('muted')
total.hist(color='g')
plt.tight_layout()
plt.show()

A fairly obvious observation here is that the PassengerId variable is not very useful -- we should drop this column. The rest of the data is quite interesting, with most passengers being somewhat young (around 20 to 30 years of age) and most people traveling without too much family.

Pclass serves as a proxy for the passengers' socioeconomic stata. Interestingly, the middle class appears to be the lowest in size, though not by much compared to upperclass passengers.

Looking at the data, given that we don't have the ticket number does not appear to be too informative.



In [9]:

    
totalwithoutnas = total.dropna()
scattermatrix = sns.pairplot(totalwithoutnas)
plt.show()

Data Preprocessing

The first thing we should do is drop columns that will not be particularly helpful in our analysis. This includes the Ticket variable identified previously.



In [11]:

    
total.drop('Ticket', axis=1, inplace=True)

Feature Engineering

A number of the variables in the data present opportunities to be further generate meaningful features. One particular feature that appears to contain a lot of meaning is the names of the passengers. As in the notebook of Megan, we will be able to extract titles (which are indicative of both gender and marriage status) and families (given by shared surnames, under the assumption that incidences of unrelated people having the same surname are trivial).

Surnames and Titles



In [12]:

    
Surnames = pd.DataFrame(total['Name'].str.split(",").tolist(), columns=['Surname', 'Rest'])
Titles = pd.DataFrame(Surnames['Rest'].str.split(".").tolist(), columns=['Title', 'Rest1', 'Rest2'])

Surnames.drop('Rest',axis=1,inplace=True)
Titles = pd.DataFrame(Titles['Title'])

Surnames['Surname'].str.strip()
Titles['Title'].str.strip()

total['Surname'] = Surnames.set_index(np.arange(1,1310))
total['Title'] = Titles.set_index(np.arange(1,1310))

total.head()









    Out[12]:






  
    
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Surname
      Title
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      22.0
      NaN
      S
      7.2500
      Braund, Mr. Owen Harris
      0
      3
      male
      1
      0.0
      Braund
      Mr
    
    
      2
      38.0
      C85
      C
      71.2833
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      0
      1
      female
      1
      1.0
      Cumings
      Mrs
    
    
      3
      26.0
      NaN
      S
      7.9250
      Heikkinen, Miss. Laina
      0
      3
      female
      0
      1.0
      Heikkinen
      Miss
    
    
      4
      35.0
      C123
      S
      53.1000
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      0
      1
      female
      1
      1.0
      Futrelle
      Mrs
    
    
      5
      35.0
      NaN
      S
      8.0500
      Allen, Mr. William Henry
      0
      3
      male
      0
      0.0
      Allen
      Mr

Let's tabulate our titles against sex to see the frequency of the various titles.



In [13]:

    
pd.crosstab(total['Sex'], total['Title'])









    Out[13]:






  
    
      Title
      Capt
      Col
      Don
      Dona
      Dr
      Jonkheer
      Lady
      Major
      Master
      Miss
      Mlle
      Mme
      Mr
      Mrs
      Ms
      Rev
      Sir
      the Countess
    
    
      Sex
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      female
      0
      0
      0
      1
      1
      0
      1
      0
      0
      260
      2
      1
      0
      197
      2
      0
      0
      1
    
    
      male
      1
      4
      1
      0
      7
      1
      0
      2
      61
      0
      0
      0
      757
      0
      0
      8
      1
      0

We see that with the exception of Master, Mr, Miss, and Mrs, the other titles are relatively rare. We can group rare titles together to simplify our analysis. Also note that Mlle and Ms are synonymous with Miss, and Mme is synonymous with Mrs.



In [14]:

    
raretitles = ['Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer']

total.ix[total['Title'].str.contains('Mlle|Ms|Miss'), 'Title'] = 'Miss'
total.ix[total['Title'].str.contains('Mme|Mrs'), 'Title'] = 'Mrs'
total.ix[total['Title'].str.contains('|'.join(raretitles)), 'Title'] = 'Rare Title'

pd.crosstab(total['Sex'], total['Title'])



In [17]:

    
total['Surname'].nunique()









    Out[17]:





875

We have 875 unique surnames.

Family Sizes

Family size may have an impact on survival. To this end, we create a family size attribute and plot the relationship.



In [25]:

    
total['FamilySize'] = total['SibSp'] + total['Parch'] + 1

total['Family'] = total['Surname'] + "_" + total['FamilySize'].apply(str)

total.head(1)









    Out[25]:






  
    
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Surname
      Title
      FamilySize
      Family
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      22.0
      NaN
      S
      7.25
      Braund, Mr. Owen Harris
      0
      3
      male
      1
      0.0
      Braund
      Mr
      2
      Braund_2



In [45]:

    
# Plot family size
famsizebarplot = sns.countplot(total['FamilySize'].loc[1:len(train.index)], hue=total['Survived'])
famsizebarplot.set_xlabel('Family Size')
plt.show()

The chart above clearly shows an interesting phenomenon -- single people and families of over 4 people have a significantly lower chance of survival than those in small (2 to 4 person) families.



In [49]:

    
# Categorize family size
total['FamSizeCat'] = 'small'
total.loc[(total['FamilySize'] == 1), 'FamSizeCat'] = 'singleton'
total.loc[(total['FamilySize'] > 4), 'FamSizeCat'] = 'large'

# Create mosaic plot
# To be done in the future using statsmodel

Dealing with Missing Values

We first check columns with missing values.



In [52]:

    
total.isnull().sum()









    Out[52]:





Age            263
Cabin         1014
Embarked         2
Fare             1
Name             0
Parch            0
Pclass           0
Sex              0
SibSp            0
Survived       418
Surname          0
Title            0
FamilySize       0
Family           0
FamSizeCat       0
dtype: int64

It appears that age, cabin, embarked, and fare have missing values. Let's first work on "Embarked" and "Fare" given that there are few enough NaN's for us to be able to manually work out what values they should have. For Cabin, given that there are 1309 samples and more than 75% of them are missing, we can probably just drop this column. It might have been useful given that location on the ship might influence their chance of survival, but data is too sparse on this particular attribute.



In [59]:

    
total[(total['Embarked'].isnull()) | (total['Fare'].isnull())]









    Out[59]:






  
    
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Surname
      Title
      FamilySize
      Family
      FamSizeCat
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      62
      38.0
      B28
      NaN
      80.0
      Icard, Miss. Amelie
      0
      1
      female
      0
      1.0
      Icard
      Miss
      1
      Icard_1
      singleton
    
    
      830
      62.0
      B28
      NaN
      80.0
      Stone, Mrs. George Nelson (Martha Evelyn)
      0
      1
      female
      0
      1.0
      Stone
      Mrs
      1
      Stone_1
      singleton
    
    
      1044
      60.5
      NaN
      S
      NaN
      Storey, Mr. Thomas
      0
      3
      male
      0
      NaN
      Storey
      Mr
      1
      Storey_1
      singleton

Miss Icard and Mrs. Stone, both shared the same cabin, both survived, both paid the same fare, and are both of the same class, interestingly enough. Mr. Storey is of the third class and embarked from Southampton.

Visualizing the fares by embarkation location may shed some light on where the two first class ladies embarked.



In [64]:

    
sns.boxplot(x='Embarked',y='Fare',data=train.dropna(),hue='Pclass')
plt.tight_layout()
plt.show()



In [87]:

    
trainwithoutnas = train.dropna()
print("Mean fares for passengers traveling in first class:")
print(trainwithoutnas[trainwithoutnas['Pclass']==1].groupby('Embarked')['Fare'].mean())
print("\nMedian fares for passengers traveling in first class:")
print(trainwithoutnas[trainwithoutnas['Pclass']==1].groupby('Embarked')['Fare'].median())









    



Mean fares for passengers traveling in first class:
Embarked
C    106.199937
Q     90.000000
S     76.788754
Name: Fare, dtype: float64

Median fares for passengers traveling in first class:
Embarked
C    78.2667
Q    90.0000
S    55.9000
Name: Fare, dtype: float64

The closest value to the $80 fare paid by both ladies for first class is very close to the mean fare paid by first class passengers embarking from Southampton, but also aligns very nicely with the median fare paid by those embarking from Cherbourg. Perhaps a swarm plot will better show how passengers are distributed.



In [92]:

    
sns.swarmplot(x='Embarked',y='Fare',data=train.dropna(),hue='Pclass')
plt.show()

This is a tough call. Looking at the spread of the points, however, it seems that those that embarked from Southampton generally paid lower fares. It appears that the mean fare paid by those from Cherbourg is pulled up by the extreme outliers that paid more than \$500 for their tickets, with a majority of first class passengers indeed paying around $80. As such, we classify the two ladies as having embarked from Cherbourg (C).



In [98]:

    
total.loc[(62,830), 'Embarked'] = "C"
total.loc[(62,830), 'Embarked']









    Out[98]:





PassengerId
62     C
830    C
Name: Embarked, dtype: object

The swarm plot also shows that the passengers embarking from Southampton in third class have paid around the same fare. It would be reasonable to use the mean value of third class passengers from Southampton as his fare value.



In [112]:

    
total.loc[1044,'Fare'] = total[(total['Embarked']=="S") & (total['Pclass']==3)]['Fare'].mean()
total.loc[1044, ['Name','Fare']]









    Out[112]:





Name    Storey, Mr. Thomas
Fare               14.4354
Name: 1044, dtype: object

We could do mice imputation similar to Megan's notebook via the fancyimpute package.



In [191]:

    
AgeHistogram = total['Age'].hist(bins=20, edgecolor="black")
AgeHistogram.set_xlabel("Age")
AgeHistogram.set_ylabel("Count")
AgeHistogram.set_title("Age (Prior to Missing Value Imputation)")
plt.show()



In [132]:

    
import fancyimpute
total.isnull().sum()









    Out[132]:





Age            263
Cabin         1014
Embarked         0
Fare             0
Name             0
Parch            0
Pclass           0
Sex              0
SibSp            0
Survived       418
Surname          0
Title            0
FamilySize       0
Family           0
FamSizeCat       0
dtype: int64



In [202]:

    
totalforMICE = total.drop(['Survived','Cabin','FamSizeCat','Family','Name','Surname'], axis=1)
# totalforMICE.fillna(np.nan)
totalforMICE['Sex'] = pd.get_dummies(totalforMICE['Sex'])['male']
dummycodedTitles = pd.get_dummies(totalforMICE['Title']).drop('Rare Title', axis=1)
totalforMICE = pd.merge(totalforMICE, dummycodedTitles, left_index=True, right_index=True, how='outer')
totalforMICE = totalforMICE.drop(['Title'],axis=1)
dummycodedEmbarked = pd.get_dummies(totalforMICE['Embarked'])[['C','Q']]
totalforMICE = totalforMICE.join(dummycodedEmbarked).drop(['Embarked'],axis=1)
dummycodedPclass = pd.get_dummies(totalforMICE['Pclass'], columns=[list("123")]).drop(3,axis=1)
totalforMICE = totalforMICE.join(dummycodedPclass).drop('Pclass',axis=1)
MICEdtotal = fancyimpute.MICE().complete(totalforMICE.values.astype(float))









    



[MICE] Completing matrix with shape (1309, 14)
[MICE] Starting imputation round 1/110, elapsed time 0.000
[MICE] Starting imputation round 2/110, elapsed time 0.001
[MICE] Starting imputation round 3/110, elapsed time 0.001
[MICE] Starting imputation round 4/110, elapsed time 0.002
[MICE] Starting imputation round 5/110, elapsed time 0.002
[MICE] Starting imputation round 6/110, elapsed time 0.003
[MICE] Starting imputation round 7/110, elapsed time 0.003
[MICE] Starting imputation round 8/110, elapsed time 0.003
[MICE] Starting imputation round 9/110, elapsed time 0.004
[MICE] Starting imputation round 10/110, elapsed time 0.004
[MICE] Starting imputation round 11/110, elapsed time 0.005
[MICE] Starting imputation round 12/110, elapsed time 0.005
[MICE] Starting imputation round 13/110, elapsed time 0.006
[MICE] Starting imputation round 14/110, elapsed time 0.006
[MICE] Starting imputation round 15/110, elapsed time 0.007
[MICE] Starting imputation round 16/110, elapsed time 0.007
[MICE] Starting imputation round 17/110, elapsed time 0.008
[MICE] Starting imputation round 18/110, elapsed time 0.008
[MICE] Starting imputation round 19/110, elapsed time 0.009
[MICE] Starting imputation round 20/110, elapsed time 0.009
[MICE] Starting imputation round 21/110, elapsed time 0.010
[MICE] Starting imputation round 22/110, elapsed time 0.010
[MICE] Starting imputation round 23/110, elapsed time 0.011
[MICE] Starting imputation round 24/110, elapsed time 0.011
[MICE] Starting imputation round 25/110, elapsed time 0.012
[MICE] Starting imputation round 26/110, elapsed time 0.012
[MICE] Starting imputation round 27/110, elapsed time 0.013
[MICE] Starting imputation round 28/110, elapsed time 0.013
[MICE] Starting imputation round 29/110, elapsed time 0.014
[MICE] Starting imputation round 30/110, elapsed time 0.014
[MICE] Starting imputation round 31/110, elapsed time 0.014
[MICE] Starting imputation round 32/110, elapsed time 0.015
[MICE] Starting imputation round 33/110, elapsed time 0.015
[MICE] Starting imputation round 34/110, elapsed time 0.016
[MICE] Starting imputation round 35/110, elapsed time 0.016
[MICE] Starting imputation round 36/110, elapsed time 0.017
[MICE] Starting imputation round 37/110, elapsed time 0.017
[MICE] Starting imputation round 38/110, elapsed time 0.018
[MICE] Starting imputation round 39/110, elapsed time 0.019
[MICE] Starting imputation round 40/110, elapsed time 0.019
[MICE] Starting imputation round 41/110, elapsed time 0.020
[MICE] Starting imputation round 42/110, elapsed time 0.020
[MICE] Starting imputation round 43/110, elapsed time 0.021
[MICE] Starting imputation round 44/110, elapsed time 0.022
[MICE] Starting imputation round 45/110, elapsed time 0.022
[MICE] Starting imputation round 46/110, elapsed time 0.023
[MICE] Starting imputation round 47/110, elapsed time 0.023
[MICE] Starting imputation round 48/110, elapsed time 0.024
[MICE] Starting imputation round 49/110, elapsed time 0.025
[MICE] Starting imputation round 50/110, elapsed time 0.026
[MICE] Starting imputation round 51/110, elapsed time 0.027
[MICE] Starting imputation round 52/110, elapsed time 0.028
[MICE] Starting imputation round 53/110, elapsed time 0.028
[MICE] Starting imputation round 54/110, elapsed time 0.029
[MICE] Starting imputation round 55/110, elapsed time 0.030
[MICE] Starting imputation round 56/110, elapsed time 0.030
[MICE] Starting imputation round 57/110, elapsed time 0.031
[MICE] Starting imputation round 58/110, elapsed time 0.032
[MICE] Starting imputation round 59/110, elapsed time 0.032
[MICE] Starting imputation round 60/110, elapsed time 0.033
[MICE] Starting imputation round 61/110, elapsed time 0.033
[MICE] Starting imputation round 62/110, elapsed time 0.034
[MICE] Starting imputation round 63/110, elapsed time 0.035
[MICE] Starting imputation round 64/110, elapsed time 0.035
[MICE] Starting imputation round 65/110, elapsed time 0.036
[MICE] Starting imputation round 66/110, elapsed time 0.036
[MICE] Starting imputation round 67/110, elapsed time 0.037
[MICE] Starting imputation round 68/110, elapsed time 0.038
[MICE] Starting imputation round 69/110, elapsed time 0.038
[MICE] Starting imputation round 70/110, elapsed time 0.039
[MICE] Starting imputation round 71/110, elapsed time 0.040
[MICE] Starting imputation round 72/110, elapsed time 0.040
[MICE] Starting imputation round 73/110, elapsed time 0.041
[MICE] Starting imputation round 74/110, elapsed time 0.041
[MICE] Starting imputation round 75/110, elapsed time 0.042
[MICE] Starting imputation round 76/110, elapsed time 0.043
[MICE] Starting imputation round 77/110, elapsed time 0.043
[MICE] Starting imputation round 78/110, elapsed time 0.044
[MICE] Starting imputation round 79/110, elapsed time 0.045
[MICE] Starting imputation round 80/110, elapsed time 0.045
[MICE] Starting imputation round 81/110, elapsed time 0.046
[MICE] Starting imputation round 82/110, elapsed time 0.046
[MICE] Starting imputation round 83/110, elapsed time 0.047
[MICE] Starting imputation round 84/110, elapsed time 0.048
[MICE] Starting imputation round 85/110, elapsed time 0.048
[MICE] Starting imputation round 86/110, elapsed time 0.049
[MICE] Starting imputation round 87/110, elapsed time 0.050
[MICE] Starting imputation round 88/110, elapsed time 0.050
[MICE] Starting imputation round 89/110, elapsed time 0.051
[MICE] Starting imputation round 90/110, elapsed time 0.051
[MICE] Starting imputation round 91/110, elapsed time 0.052
[MICE] Starting imputation round 92/110, elapsed time 0.053
[MICE] Starting imputation round 93/110, elapsed time 0.053
[MICE] Starting imputation round 94/110, elapsed time 0.054
[MICE] Starting imputation round 95/110, elapsed time 0.054
[MICE] Starting imputation round 96/110, elapsed time 0.055
[MICE] Starting imputation round 97/110, elapsed time 0.056
[MICE] Starting imputation round 98/110, elapsed time 0.056
[MICE] Starting imputation round 99/110, elapsed time 0.057
[MICE] Starting imputation round 100/110, elapsed time 0.058
[MICE] Starting imputation round 101/110, elapsed time 0.058
[MICE] Starting imputation round 102/110, elapsed time 0.059
[MICE] Starting imputation round 103/110, elapsed time 0.059
[MICE] Starting imputation round 104/110, elapsed time 0.060
[MICE] Starting imputation round 105/110, elapsed time 0.061
[MICE] Starting imputation round 106/110, elapsed time 0.061
[MICE] Starting imputation round 107/110, elapsed time 0.062
[MICE] Starting imputation round 108/110, elapsed time 0.062
[MICE] Starting imputation round 109/110, elapsed time 0.063
[MICE] Starting imputation round 110/110, elapsed time 0.064



In [203]:

    
MICEdtotal = pd.DataFrame(MICEdtotal, columns=totalforMICE.columns)
MICEdtotal.isnull().sum()









    Out[203]:





Age           0
Fare          0
Parch         0
Sex           0
SibSp         0
FamilySize    0
 Master       0
 Mr           0
Miss          0
Mrs           0
C             0
Q             0
1             0
2             0
dtype: int64

We see that the MICE'd data has no more missing Age values. Plotting these values in the histogram:



In [204]:

    
MICEAgeHistogram = MICEdtotal['Age'].hist(bins=20, edgecolor="black")
MICEAgeHistogram.set_xlabel("Age")
MICEAgeHistogram.set_ylabel("Count")
MICEAgeHistogram.set_title("Age (After Missing Value Imputation)")
plt.show()



In [236]:

    
AgeHists, AgeHistAxes = plt.subplots(nrows=1,ncols=2, figsize=(10,5), sharey=True)

AgeHistAxes[0].hist(total['Age'].dropna(), bins=20, edgecolor='black', normed=True)
AgeHistAxes[0].set_xlabel("Age")
AgeHistAxes[0].set_ylabel("Density")
AgeHistAxes[0].set_title("Age Density (Original Data)")

AgeHistAxes[1].hist(MICEdtotal['Age'], bins=20, edgecolor='black', normed=True)
AgeHistAxes[1].set_xlabel("Age")
AgeHistAxes[1].set_ylabel("Density")
AgeHistAxes[1].set_title("Age Density (After MICE)")

AgeHists.tight_layout()
AgeHists









    Out[236]:

Most age values were added around the 20 to 30 year-old age range, which makes sense given the distribution of the ages in the data that we had. Note that the fancyimpute version of MICE uses Bayesian Ridge Regression. The density is not perfectly preserved but is useful enough to proceed with the analysis.

We use the new Age column with the imputed values for our analysis.



In [237]:

    
newtotal = total
newtotal['Age'] = MICEdtotal['Age']

We can create some additional categorical columns based on our complete age feature -- whether the person is a child (18 or under) and whether a person is a mother (female, over 18, with children, and does not have the title "Miss").



In [327]:

    
AgeandSexHist = sns.FacetGrid(newtotal.iloc[0:891,:], col = 'Sex', hue='Survived', size=5)
# AgeandSexHist.map(sns.distplot, 'Age', kde=False, hist_kws={'edgecolor':'black','stacked':True})
AgeandSexHist.map(plt.hist, 'Age', alpha=0.5, bins=20)
AgeandSexHist.add_legend()
# plt.close('all')
plt.show(AgeandSexHist)



In [342]:

    
AgeandSexHist, AgeandSexHistAxes = plt.subplots(nrows=1,ncols=2, figsize=(10,5), sharey=True)
AgeandSexHistAxes[0].hist([newtotal.loc[0:891, 'Age'].loc[(newtotal['Sex']=='male') & (newtotal['Survived']==1)],
          newtotal.loc[0:891, 'Age'].loc[(newtotal['Sex']=='male') & (newtotal['Survived']==0)]],stacked=True, edgecolor='black', label=['Survived','Did Not Survive'], bins=24)
AgeandSexHistAxes[1].hist([newtotal.loc[0:891, 'Age'].loc[(newtotal['Sex']=='female') & (newtotal['Survived']==1)],
          newtotal.loc[0:891, 'Age'].loc[(newtotal['Sex']=='female') & (newtotal['Survived']==0)]],stacked=True, edgecolor='black', bins=24)
AgeandSexHistAxes[0].set_title('Survival By Age for Males')
AgeandSexHistAxes[1].set_title('Survival By Age for Females')
for i in range(2):
    AgeandSexHistAxes[i].set_xlabel('Age')
AgeandSexHistAxes[0].set_ylabel('Count')
AgeandSexHistAxes[0].legend()
plt.show()



In [343]:

    
# Create the 'Child' variable
newtotal['Child'] = 1
newtotal.loc[newtotal['Age']>=18, 'Child'] = 0

pd.crosstab(newtotal['Child'],newtotal['Survived'])



In [345]:

    
# Create the 'Mother' variable
newtotal['Mother'] = 0
newtotal.loc[(newtotal['Sex']=='female') & (newtotal['Parch'] > 0) & (newtotal['Age']>18) & (newtotal['Title'] != "Miss"), 'Mother'] = 1

pd.crosstab(newtotal['Mother'], newtotal['Survived'])

Let's take a look at the dataset once again.



In [346]:

    
newtotal.head()









    Out[346]:






  
    
      
      Age
      Cabin
      Embarked
      Fare
      Name
      Parch
      Pclass
      Sex
      SibSp
      Survived
      Surname
      Title
      FamilySize
      Family
      FamSizeCat
      Child
      Mother
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      38.000000
      NaN
      S
      7.2500
      Braund, Mr. Owen Harris
      0
      3
      male
      1
      0.0
      Braund
      Mr
      2
      Braund_2
      small
      0
      0
    
    
      2
      26.000000
      C85
      C
      71.2833
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      0
      1
      female
      1
      1.0
      Cumings
      Mrs
      2
      Cumings_2
      small
      0
      0
    
    
      3
      35.000000
      NaN
      S
      7.9250
      Heikkinen, Miss. Laina
      0
      3
      female
      0
      1.0
      Heikkinen
      Miss
      1
      Heikkinen_1
      singleton
      0
      0
    
    
      4
      35.000000
      C123
      S
      53.1000
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      0
      1
      female
      1
      1.0
      Futrelle
      Mrs
      2
      Futrelle_2
      small
      0
      0
    
    
      5
      29.783901
      NaN
      S
      8.0500
      Allen, Mr. William Henry
      0
      3
      male
      0
      0.0
      Allen
      Mr
      1
      Allen_1
      singleton
      0
      0



In [349]:

    
newtotal.shape









    Out[349]:





(1309, 17)

We ensure that all important categorical variables are dummy coded.



In [356]:

    
dummycodedFamSizeCat = pd.get_dummies(newtotal['FamSizeCat']).drop('large',axis=1)
newtotal = newtotal.drop(['Title','Embarked','Pclass', 'Cabin', 'Name', 'Family', 'Surname'], axis=1)
newtotal['Sex'] = pd.get_dummies(newtotal['Sex'])['male']
newtotal = newtotal.join(dummycodedEmbarked)
newtotal = newtotal.join(dummycodedPclass)
newtotal = newtotal.join(dummycodedTitles)
newtotal = newtotal.join(dummycodedFamSizeCat)
newtotal.head()









    Out[356]:






  
    
      
      Age
      Fare
      Parch
      Sex
      SibSp
      Survived
      FamilySize
      FamSizeCat
      Child
      Mother
      C
      Q
      1
      2
      Master
      Mr
      Miss
      Mrs
      singleton
      small
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      38.000000
      7.2500
      0
      1.0
      1
      0.0
      2
      small
      0
      0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      1.0
    
    
      2
      26.000000
      71.2833
      0
      0.0
      1
      1.0
      2
      small
      0
      0
      1.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      1.0
    
    
      3
      35.000000
      7.9250
      0
      0.0
      0
      1.0
      1
      singleton
      0
      0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      1.0
      0.0
    
    
      4
      35.000000
      53.1000
      0
      0.0
      1
      1.0
      2
      small
      0
      0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      1.0
    
    
      5
      29.783901
      8.0500
      0
      1.0
      0
      0.0
      1
      singleton
      0
      0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      1.0
      0.0

After we split the data back into training and test sets, our data set will be ready to use for modeling.



In [358]:

    
newtrain = newtotal.loc[:891,:]
newtest = newtotal.loc[892:,:]

Modeling and Prediction

Note! Work in Progress - This notebook is not yet finished

	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q

	Age	Fare	Parch	Pclass	SibSp	Survived
count	1046.000000	1308.000000	1309.000000	1309.000000	1309.000000	891.000000
mean	29.881138	33.295479	0.385027	2.294882	0.498854	0.383838
std	14.413493	51.758668	0.865560	0.837836	1.041658	0.486592
min	0.170000	0.000000	0.000000	1.000000	0.000000	0.000000
25%	NaN	NaN	0.000000	2.000000	0.000000	NaN
50%	NaN	NaN	0.000000	3.000000	0.000000	NaN
75%	NaN	NaN	0.000000	3.000000	1.000000	NaN
max	80.000000	512.329200	9.000000	3.000000	8.000000	1.000000

	Age	Cabin	Embarked	Fare	Name	Parch	Pclass	Sex	SibSp	Survived	Surname	Title
PassengerId
1	22.0	NaN	S	7.2500	Braund, Mr. Owen Harris	0	3	male	1	0.0	Braund	Mr
2	38.0	C85	C	71.2833	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	1	female	1	1.0	Cumings	Mrs
3	26.0	NaN	S	7.9250	Heikkinen, Miss. Laina	0	3	female	0	1.0	Heikkinen	Miss
4	35.0	C123	S	53.1000	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	1	female	1	1.0	Futrelle	Mrs
5	35.0	NaN	S	8.0500	Allen, Mr. William Henry	0	3	male	0	0.0	Allen	Mr

	Age	Cabin	Embarked	Fare	Name	Parch	Pclass	Sex	SibSp	Survived	Surname	Title	FamilySize	Family
PassengerId
1	22.0	NaN	S	7.25	Braund, Mr. Owen Harris	0	3	male	1	0.0	Braund	Mr	2	Braund_2

	Age	Cabin	Embarked	Fare	Name	Parch	Pclass	Sex	SibSp	Survived	Surname	Title	FamilySize	Family	FamSizeCat
PassengerId
62	38.0	B28	NaN	80.0	Icard, Miss. Amelie	0	1	female	0	1.0	Icard	Miss	1	Icard_1	singleton
830	62.0	B28	NaN	80.0	Stone, Mrs. George Nelson (Martha Evelyn)	0	1	female	0	1.0	Stone	Mrs	1	Stone_1	singleton
1044	60.5	NaN	S	NaN	Storey, Mr. Thomas	0	3	male	0	NaN	Storey	Mr	1	Storey_1	singleton

	Age	Cabin	Embarked	Fare	Name	Parch	Pclass	Sex	SibSp	Survived	Surname	Title	FamilySize	Family	FamSizeCat	Child	Mother
PassengerId
1	38.000000	NaN	S	7.2500	Braund, Mr. Owen Harris	0	3	male	1	0.0	Braund	Mr	2	Braund_2	small	0	0
2	26.000000	C85	C	71.2833	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	1	female	1	1.0	Cumings	Mrs	2	Cumings_2	small	0	0
3	35.000000	NaN	S	7.9250	Heikkinen, Miss. Laina	0	3	female	0	1.0	Heikkinen	Miss	1	Heikkinen_1	singleton	0	0
4	35.000000	C123	S	53.1000	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	1	female	1	1.0	Futrelle	Mrs	2	Futrelle_2	small	0	0
5	29.783901	NaN	S	8.0500	Allen, Mr. William Henry	0	3	male	0	0.0	Allen	Mr	1	Allen_1	singleton	0	0