Predicting survival of Titanic Passengers

This notebook explores a dataset containing information of passengers of the Titanic. The dataset can be downloaded from Kaggle

Tutorial goals

Explore the dataset
Build a simple predictive modeling
Iterate and improve your score
Optional: upload your prediction to Kaggle using the test dataset

How to follow along:

git clone https://github.com/Dataweekends/odsc_intro_to_data_science

cd odsc_intro_to_data_science

ipython notebook

We start by importing the necessary libraries:



In [ ]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

1) Explore the dataset

Numerical exploration

Load the csv file into memory using Pandas
Describe each attribute
- is it discrete?
- is it continuous?
- is it a number?
- is it text?
Identify the target
Check if any values are missing

Load the csv file into memory using Pandas



In [ ]:

    
df = pd.read_csv('titanic-train.csv')

What's the content of df ?



In [ ]:

    
df.head(3)

Describe each attribute (is it discrete? is it continuous? is it a number? is it text?)



In [ ]:

    
df.info()

Is Pclass a continuous or discrete class?



In [ ]:

    
df['Pclass'].value_counts()

What about these: ('SibSp', 'Parch')?



In [ ]:

    
df['SibSp'].value_counts()



In [ ]:

    
df['Parch'].value_counts()

and what about these: ('Ticket', 'Fare', 'Cabin', 'Embarked')?



In [ ]:

    
df[['Ticket', 'Fare', 'Cabin']].head(3)



In [ ]:

    
df['Embarked'].value_counts()

Identify the target

What are we trying to predict?

ah, yes... Survival!



In [ ]:

    
df['Survived'].value_counts()

Check if any values are missing



In [ ]:

    
df.info()

Mental notes so far:

Dataset contains 891 entries
1 Target column (Survived)
11 Features:
- 6 numerical, 5 text
- 1 useless (PassengerId)
- 3 categorical (Pclass, Sex, Embarked)
- 4 numerical, > 0 (Age, SibSp, Parch, Fare)
- 3 not sure how to treat (Name, Ticket, Cabin)
Age is only available for 714 passengers
Cabin is only available for 204 passengers
Embarked is missing for 2 passengers

Visual exploration

plot the distribution of Age
impute the missing values for Age using the median Age
check the influence of Age, Sex and Class on Survival

Plot the distribution of Age



In [ ]:

    
df['Age'].plot(kind='hist', figsize=(10,6))
plt.title('Distribution of Age', size = '20')
plt.xlabel('Age', size = '20')
plt.ylabel('Number of passengers', size = '20')
median_age = df['Age'].median()
plt.axvline(median_age, color = 'r')
median_age

impute the missing values for Age using the median Age



In [ ]:

    
df['Age'].fillna(median_age, inplace = True)
df.info()

check the influence of Age



In [ ]:

    
df[df['Survived']==1]['Age'].plot(kind='hist', bins = 10, range = (0,100), figsize=(10,6), alpha = 0.3, color = 'g')
df[df['Survived']==0]['Age'].plot(kind='hist', bins = 10, range = (0,100), figsize=(10,6), alpha = 0.3, color = 'r')
plt.title('Distribution of Age', size = '20')
plt.xlabel('Age', size = '20')
plt.ylabel('Number of passengers', size = '20')
plt.legend(['Survived', 'Dead'])
plt.show()

Check the influence of Sex on Survival



In [ ]:

    
survival_by_gender = df[['Sex','Survived']].pivot_table(columns =
                        ['Survived'], index = ['Sex'], aggfunc=len)
survival_by_gender



In [ ]:

    
survival_by_gender.plot(kind = 'bar', stacked = True)
plt.show()

Check the influence of Pclass on Survival



In [ ]:

    
survival_by_Pclass = df[['Pclass','Survived']].pivot_table(columns =
                        ['Survived'], index = ['Pclass'], aggfunc=len)
survival_by_Pclass



In [ ]:

    
survival_by_Pclass.plot(kind = 'bar', stacked = True)
plt.show()

Ok, so, Age and Pclass seem to have some influence on survival rate.

Let's build a simple model to test that

Define a new feature called "Male" that is 1 if Sex = 'male' and 0 otherwise



In [ ]:

    
df['Male'] = df['Sex'].map({'male': 1, 'female': 0})
df[['Sex', 'Male']].head()

Define simplest model as benchmark

The simplest model is a model that predicts 0 for everybody, i.e. no survival.

How good is it?



In [ ]:

    
actual_dead = len(df[df['Survived'] == 0])
total_passengers = len(df)
ratio_of_dead = actual_dead / float(total_passengers)

print "If I predict everybody dies, I'm correct %0.1f %% of the time" % (100 * ratio_of_dead)

df['Survived'].value_counts()

We need to do better than that

Define features (X) and target (y) variables



In [ ]:

    
X = df[['Male', 'Pclass', 'Age']]
y = df['Survived']

Initialize a decision tree model



In [ ]:

    
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=0)
model

Split the features and the target into a Train and a Test subsets.

Ratio should be 80/20



In [ ]:

    
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                            test_size = 0.2, random_state=0)

Train the model



In [ ]:

    
model.fit(X_train, y_train)

Calculate the model score



In [ ]:

    
my_score = model.score(X_test, y_test)

print "Classification Score: %0.2f" % my_score

Print the confusion matrix for the decision tree model



In [ ]:

    
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test)

print "\n=======confusion matrix=========="
print confusion_matrix(y_test, y_pred)

3) Iterate and improve

Now you have a basic pipeline. How can you improve the score? Try:

adding new features could you add a feature for family? could you use the Embark or other as dummies check the get_dummies function here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
changing the parameters of the model check the documentation here: http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
changing the model itself check examples here: http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

Let's have a small competition....

4) Optional: upload your prediction to Kaggle using the test dataset

https://www.kaggle.com/c/titanic/submissions/attach