Titanic: ML From Disaster

Kaggle competition exploration notebook

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.*

Imports



In [201]:

    
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import re

plt.rcParams['figure.figsize'] = (10.0, 8.0)
sns.set_context(rc={"figure.figsize": (10.0, 8.0)})

Read in train and test data

VARIABLE DESCRIPTIONS:

survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:

Pclass is a proxy for socio-economic status (SES)

1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)

If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws.

Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.



In [246]:

    
train = pd.read_csv('../../data/train.csv', index_col=0)
test = pd.read_csv('../../data/test.csv', index_col=0)

def clean(df):

    column_names = dict(Survived='survived', Pclass='class', Name='name', Sex='sex', Age='age', SibSp='sip_sp', Parch='par_ch',
                        Ticket='ticket', Fare='fare', Cabin='cabin', Embarked='embarked')

    df = df.rename(columns=column_names)

    df["deck"] = df.cabin.str[0].map(lambda s: np.nan if s == "T" else s)
    df['embarked'] = df['embarked'].map({"C": "Cherbourg", "Q": "Queenstown", "S": "Southampton"})
    df['family'] = df['sip_sp'] + df['par_ch']
    df['alone'] = df['family'].apply(lambda d: 0 if d else 1)

    def fill_age(r):
        if pd.isnull(r['age']):
            m = df[(df['sex'] == r['sex']) & 
                   (df['class'] == r['class'])]['age'].dropna().median()
        else:
            m = r['age']
        return m
    
    def fill_embarked(r):
        if pd.isnull(r['embarked']):
            m = df[(df['sex'] == r['sex']) & (df['class'] == r['class'])]['embarked'].dropna().mode()[0]
        else:
            m = r['embarked']
        return m

    df['age'] = df.apply(fill_age, axis=1)
    df['embarked'] = df.apply(fill_embarked, axis=1)
    
    df['title'] = df['name'].apply(lambda name: re.search(r'\w*,[A-z ]* (\w*)', name).group(1))
    
    df = df.drop(['ticket', 'cabin', 'name'], axis=1)

    return df

train = clean(train)
test = clean(test)
print(train.info())

train.head(10)









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 12 columns):
survived    891 non-null int64
class       891 non-null int64
sex         891 non-null object
age         891 non-null float64
sip_sp      891 non-null int64
par_ch      891 non-null int64
fare        891 non-null float64
embarked    891 non-null object
deck        203 non-null object
family      891 non-null int64
alone       891 non-null int64
title       891 non-null object
dtypes: float64(2), int64(6), object(4)None






    Out[246]:






  
    
      
      survived
      class
      sex
      age
      sip_sp
      par_ch
      fare
      embarked
      deck
      family
      alone
      title
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1 
       0
       3
         male
       22
       1
       0
        7.2500
       Southampton
       NaN
       1
       0
           Mr
    
    
      2 
       1
       1
       female
       38
       1
       0
       71.2833
         Cherbourg
         C
       1
       0
          Mrs
    
    
      3 
       1
       3
       female
       26
       0
       0
        7.9250
       Southampton
       NaN
       0
       1
         Miss
    
    
      4 
       1
       1
       female
       35
       1
       0
       53.1000
       Southampton
         C
       1
       0
          Mrs
    
    
      5 
       0
       3
         male
       35
       0
       0
        8.0500
       Southampton
       NaN
       0
       1
           Mr
    
    
      6 
       0
       3
         male
       25
       0
       0
        8.4583
        Queenstown
       NaN
       0
       1
           Mr
    
    
      7 
       0
       1
         male
       54
       0
       0
       51.8625
       Southampton
         E
       0
       1
           Mr
    
    
      8 
       0
       3
         male
        2
       3
       1
       21.0750
       Southampton
       NaN
       4
       0
       Master
    
    
      9 
       1
       3
       female
       27
       0
       2
       11.1333
       Southampton
       NaN
       2
       0
          Mrs
    
    
      10
       1
       2
       female
       14
       1
       0
       30.0708
         Cherbourg
       NaN
       1
       0
          Mrs

Numbers



In [127]:

    
sns.factorplot('class', data=train, hue='sex');



In [169]:

    
fg = sns.FacetGrid(train, size=5, aspect=3, hue='sex', palette=dict(male='blue', female='pink'))
fg.map(plt.hist, 'age', bins=20, normed=True, alpha=0.5)
fg.map(sns.kdeplot, 'age', shade=True)
fg.add_legend()



In [162]:

    
fg = sns.FacetGrid(train, size=5, aspect=3, hue='class')
fg.map(plt.hist, 'fare', bins=20, normed=True, alpha=0.5)
fg.map(sns.kdeplot, 'fare', shade=True)
fg.add_legend()
fg.set(xlim=(0,500))









    Out[162]:





<seaborn.axisgrid.FacetGrid at 0x1124bfc18>



In [209]:

    
train.groupby('title')['survived'].count()









    Out[209]:





title
Capt          1
Col           2
Countess      1
Don           1
Dr            7
Jonkheer      1
Lady          1
Major         2
Master       40
Miss        182
Mlle          2
Mme           1
Mr          517
Mrs         125
Ms            1
Rev           6
Sir           1
Name: survived, dtype: int64

Survival



In [129]:

    
gen = pd.pivot_table(train, index='sex', values='name', columns = 'survived', aggfunc='count')
gen.columns = ['died', 'survived']
gen['Total'] = gen.sum(axis=1)

gen.plot(kind='barh')
gen



In [54]:

    
fig, (ax1, ax2) = plt.subplots(ncols=2)

cla = pd.pivot_table(train, index='class', values='name', columns='survived', aggfunc='count')
cla.columns=['died', 'survived']
cla['Total'] = cla.sum(axis=1)

cla.plot(ax=ax1, kind='barh', legend='reverse', title='Numbers')

cla['died'] /= cla['Total']
cla['survived'] /= cla['Total']

cla[['died', 'survived']].plot(ax=ax2, kind='barh', legend='reverse', title='Percentage')

ax2.set_xlim(0,1)









    Out[54]:





(0, 1)



In [55]:

    
emb = pd.pivot_table(train, values='name', index='embarked', columns='survived', aggfunc='count')
emb.columns = ['died', 'survived']
emb['total'] = emb.sum(axis=1)
                
emb.plot(kind='barh')
emb









    Out[55]:






  
    
      
      died
      survived
      total
    
    
      embarked
      
      
      
    
  
  
    
      Cherbourg
        75
        93
       168
    
    
      Queenstown
        47
        30
        77
    
    
      Southampton
       427
       217
       644



In [153]:

    
fg = sns.FacetGrid(train, size=5, aspect=1, row='survived', col='class', hue='sex')
fg.map(plt.hist, 'age', bins=20, normed=True, alpha=0.5)
fg.map(sns.kdeplot, 'age', shade=True)
fg.add_legend()
fg.set(xlim=(0,80));



In [251]:

    
sns.factorplot('class', 'survived', data=train, estimator=np.mean)









    Out[251]:





<seaborn.axisgrid.FacetGrid at 0x10f6b6978>

	survived	class	sex	age	sip_sp	par_ch	fare	embarked	deck	family	alone	title
PassengerId
1	0	3	male	22	1	0	7.2500	Southampton	NaN	1	0	Mr
2	1	1	female	38	1	0	71.2833	Cherbourg	C	1	0	Mrs
3	1	3	female	26	0	0	7.9250	Southampton	NaN	0	1	Miss
4	1	1	female	35	1	0	53.1000	Southampton	C	1	0	Mrs
5	0	3	male	35	0	0	8.0500	Southampton	NaN	0	1	Mr
6	0	3	male	25	0	0	8.4583	Queenstown	NaN	0	1	Mr
7	0	1	male	54	0	0	51.8625	Southampton	E	0	1	Mr
8	0	3	male	2	3	1	21.0750	Southampton	NaN	4	0	Master
9	1	3	female	27	0	2	11.1333	Southampton	NaN	2	0	Mrs
10	1	2	female	14	1	0	30.0708	Cherbourg	NaN	1	0	Mrs

	died	survived	total
embarked
Cherbourg	75	93	168
Queenstown	47	30	77
Southampton	427	217	644