Titanic: ML From Disaster

Kaggle competition exploration notebook

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.*

Imports


In [201]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import re

plt.rcParams['figure.figsize'] = (10.0, 8.0)
sns.set_context(rc={"figure.figsize": (10.0, 8.0)})

Read in train and test data

VARIABLE DESCRIPTIONS:

survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:

Pclass is a proxy for socio-economic status (SES)

1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)

If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws.

Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.


In [246]:
train = pd.read_csv('../../data/train.csv', index_col=0)
test = pd.read_csv('../../data/test.csv', index_col=0)

def clean(df):

    column_names = dict(Survived='survived', Pclass='class', Name='name', Sex='sex', Age='age', SibSp='sip_sp', Parch='par_ch',
                        Ticket='ticket', Fare='fare', Cabin='cabin', Embarked='embarked')

    df = df.rename(columns=column_names)

    df["deck"] = df.cabin.str[0].map(lambda s: np.nan if s == "T" else s)
    df['embarked'] = df['embarked'].map({"C": "Cherbourg", "Q": "Queenstown", "S": "Southampton"})
    df['family'] = df['sip_sp'] + df['par_ch']
    df['alone'] = df['family'].apply(lambda d: 0 if d else 1)

    def fill_age(r):
        if pd.isnull(r['age']):
            m = df[(df['sex'] == r['sex']) & 
                   (df['class'] == r['class'])]['age'].dropna().median()
        else:
            m = r['age']
        return m
    
    def fill_embarked(r):
        if pd.isnull(r['embarked']):
            m = df[(df['sex'] == r['sex']) & (df['class'] == r['class'])]['embarked'].dropna().mode()[0]
        else:
            m = r['embarked']
        return m

    df['age'] = df.apply(fill_age, axis=1)
    df['embarked'] = df.apply(fill_embarked, axis=1)
    
    df['title'] = df['name'].apply(lambda name: re.search(r'\w*,[A-z ]* (\w*)', name).group(1))
    
    df = df.drop(['ticket', 'cabin', 'name'], axis=1)

    return df

train = clean(train)
test = clean(test)
print(train.info())

train.head(10)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 12 columns):
survived    891 non-null int64
class       891 non-null int64
sex         891 non-null object
age         891 non-null float64
sip_sp      891 non-null int64
par_ch      891 non-null int64
fare        891 non-null float64
embarked    891 non-null object
deck        203 non-null object
family      891 non-null int64
alone       891 non-null int64
title       891 non-null object
dtypes: float64(2), int64(6), object(4)None
Out[246]:
survived class sex age sip_sp par_ch fare embarked deck family alone title
PassengerId
1 0 3 male 22 1 0 7.2500 Southampton NaN 1 0 Mr
2 1 1 female 38 1 0 71.2833 Cherbourg C 1 0 Mrs
3 1 3 female 26 0 0 7.9250 Southampton NaN 0 1 Miss
4 1 1 female 35 1 0 53.1000 Southampton C 1 0 Mrs
5 0 3 male 35 0 0 8.0500 Southampton NaN 0 1 Mr
6 0 3 male 25 0 0 8.4583 Queenstown NaN 0 1 Mr
7 0 1 male 54 0 0 51.8625 Southampton E 0 1 Mr
8 0 3 male 2 3 1 21.0750 Southampton NaN 4 0 Master
9 1 3 female 27 0 2 11.1333 Southampton NaN 2 0 Mrs
10 1 2 female 14 1 0 30.0708 Cherbourg NaN 1 0 Mrs

Numbers


In [127]:
sns.factorplot('class', data=train, hue='sex');



In [169]:
fg = sns.FacetGrid(train, size=5, aspect=3, hue='sex', palette=dict(male='blue', female='pink'))
fg.map(plt.hist, 'age', bins=20, normed=True, alpha=0.5)
fg.map(sns.kdeplot, 'age', shade=True)
fg.add_legend()



In [162]:
fg = sns.FacetGrid(train, size=5, aspect=3, hue='class')
fg.map(plt.hist, 'fare', bins=20, normed=True, alpha=0.5)
fg.map(sns.kdeplot, 'fare', shade=True)
fg.add_legend()
fg.set(xlim=(0,500))


Out[162]:
<seaborn.axisgrid.FacetGrid at 0x1124bfc18>

In [209]:
train.groupby('title')['survived'].count()


Out[209]:
title
Capt          1
Col           2
Countess      1
Don           1
Dr            7
Jonkheer      1
Lady          1
Major         2
Master       40
Miss        182
Mlle          2
Mme           1
Mr          517
Mrs         125
Ms            1
Rev           6
Sir           1
Name: survived, dtype: int64

Survival


In [129]:
gen = pd.pivot_table(train, index='sex', values='name', columns = 'survived', aggfunc='count')
gen.columns = ['died', 'survived']
gen['Total'] = gen.sum(axis=1)

gen.plot(kind='barh')
gen


Out[129]:
died survived Total
sex
female 81 233 314
male 468 109 577

In [54]:
fig, (ax1, ax2) = plt.subplots(ncols=2)

cla = pd.pivot_table(train, index='class', values='name', columns='survived', aggfunc='count')
cla.columns=['died', 'survived']
cla['Total'] = cla.sum(axis=1)

cla.plot(ax=ax1, kind='barh', legend='reverse', title='Numbers')

cla['died'] /= cla['Total']
cla['survived'] /= cla['Total']

cla[['died', 'survived']].plot(ax=ax2, kind='barh', legend='reverse', title='Percentage')

ax2.set_xlim(0,1)


Out[54]:
(0, 1)

In [55]:
emb = pd.pivot_table(train, values='name', index='embarked', columns='survived', aggfunc='count')
emb.columns = ['died', 'survived']
emb['total'] = emb.sum(axis=1)
                
emb.plot(kind='barh')
emb


Out[55]:
died survived total
embarked
Cherbourg 75 93 168
Queenstown 47 30 77
Southampton 427 217 644

In [153]:
fg = sns.FacetGrid(train, size=5, aspect=1, row='survived', col='class', hue='sex')
fg.map(plt.hist, 'age', bins=20, normed=True, alpha=0.5)
fg.map(sns.kdeplot, 'age', shade=True)
fg.add_legend()
fg.set(xlim=(0,80));



In [251]:
sns.factorplot('class', 'survived', data=train, estimator=np.mean)


Out[251]:
<seaborn.axisgrid.FacetGrid at 0x10f6b6978>