Titanic Learning from Disaster

We would like to predict survival on the Titanic.

https://www.kaggle.com/c/titanic


In [2]:
%matplotlib inline
import pandas as pd
import numpy as np

Let's Explore the data


In [84]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [4]:
train.shape


Out[4]:
(891, 12)

In [6]:
train.head()


Out[6]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

In [23]:
train.Age.hist()


Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x1150968d0>

In [29]:
train.Age.describe()


Out[29]:
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [32]:
train[train['Age'] > 60][['Sex', 'Pclass', 'Age', 'Survived']]


Out[32]:
Sex Pclass Age Survived
33 male 2 66.0 0
54 male 1 65.0 0
96 male 1 71.0 0
116 male 3 70.5 0
170 male 1 61.0 0
252 male 1 62.0 0
275 female 1 63.0 1
280 male 3 65.0 0
326 male 3 61.0 0
438 male 1 64.0 0
456 male 1 65.0 0
483 female 3 63.0 1
493 male 1 71.0 0
545 male 1 64.0 0
555 male 1 62.0 0
570 male 2 62.0 1
625 male 1 61.0 0
630 male 1 80.0 1
672 male 2 70.0 0
745 male 1 70.0 0
829 female 1 62.0 1
851 male 3 74.0 0

Now I'm starting to see a pattern here. Let's see how many female survived.


In [43]:
females = train[train['Sex'] == 'female']
females_who_survived = females[females['Survived'] == 1]
females_who_survived.shape


Out[43]:
(233, 12)

In [42]:
males = train[train['Sex'] == 'male']
males_who_survived = males[males['Survived'] == 1]
males_who_survived.shape


Out[42]:
(109, 12)

Looks like the majority of people who survived are females.

Random Forest


In [44]:
import pylab as pl
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import r2_score

In [85]:
test.head()


Out[85]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

In [67]:
cols = ['Age', 'Pclass']
notnull_age = train[cols][train['Age'].notnull()]
notnull_survived = train['Survived'][train['Age'].notnull()]
notnull_age.head()


Out[67]:
Age Pclass
0 22 3
1 38 1
2 26 3
3 35 1
4 35 3

In [69]:
clf = RandomForestClassifier(n_estimators=20, max_features=2, min_samples_split=5)
clf.fit(notnull_age, notnull_survived)


Out[69]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=2, max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=5,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [79]:
notnull_test = test[cols][test['Age'].notnull()]

In [82]:
clf.predict(notnull_test)


Out[82]:
array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 1, 0, 0, 1, 0])

In [ ]: