Predicting survival of Titanic Passengers

This notebook explores a dataset containing information of passengers of the Titanic. The dataset can be downloaded from Kaggle

Tutorial goals

  1. Explore the dataset
  2. Build a simple predictive modeling
  3. Iterate and improve your score
  4. Optional: upload your prediction to Kaggle using the test dataset

How to follow along:

git clone https://github.com/Dataweekends/odsc_intro_to_data_science

cd odsc_intro_to_data_science

ipython notebook

We start by importing the necessary libraries:


In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

1) Explore the dataset

Numerical exploration

  • Load the csv file into memory using Pandas
  • Describe each attribute
    • is it discrete?
    • is it continuous?
    • is it a number?
    • is it text?
  • Identify the target
  • Check if any values are missing

Load the csv file into memory using Pandas


In [ ]:
df = pd.read_csv('titanic-train.csv')

What's the content of df ?


In [ ]:
df.head(3)

Describe each attribute (is it discrete? is it continuous? is it a number? is it text?)


In [ ]:
df.info()

Is Pclass a continuous or discrete class?


In [ ]:
df['Pclass'].value_counts()

What about these: ('SibSp', 'Parch')?


In [ ]:
df['SibSp'].value_counts()

In [ ]:
df['Parch'].value_counts()

and what about these: ('Ticket', 'Fare', 'Cabin', 'Embarked')?


In [ ]:
df[['Ticket', 'Fare', 'Cabin']].head(3)

In [ ]:
df['Embarked'].value_counts()

Identify the target

What are we trying to predict?

ah, yes... Survival!


In [ ]:
df['Survived'].value_counts()

Check if any values are missing


In [ ]:
df.info()

Mental notes so far:

  • Dataset contains 891 entries
  • 1 Target column (Survived)
  • 11 Features:
    • 6 numerical, 5 text
    • 1 useless (PassengerId)
    • 3 categorical (Pclass, Sex, Embarked)
    • 4 numerical, > 0 (Age, SibSp, Parch, Fare)
    • 3 not sure how to treat (Name, Ticket, Cabin)
  • Age is only available for 714 passengers
  • Cabin is only available for 204 passengers
  • Embarked is missing for 2 passengers

Visual exploration

  • plot the distribution of Age
  • impute the missing values for Age using the median Age
  • check the influence of Age, Sex and Class on Survival

Plot the distribution of Age


In [ ]:
df['Age'].plot(kind='hist', figsize=(10,6))
plt.title('Distribution of Age', size = '20')
plt.xlabel('Age', size = '20')
plt.ylabel('Number of passengers', size = '20')
median_age = df['Age'].median()
plt.axvline(median_age, color = 'r')
median_age

impute the missing values for Age using the median Age


In [ ]:
df['Age'].fillna(median_age, inplace = True)
df.info()

check the influence of Age


In [ ]:
df[df['Survived']==1]['Age'].plot(kind='hist', bins = 10, range = (0,100), figsize=(10,6), alpha = 0.3, color = 'g')
df[df['Survived']==0]['Age'].plot(kind='hist', bins = 10, range = (0,100), figsize=(10,6), alpha = 0.3, color = 'r')
plt.title('Distribution of Age', size = '20')
plt.xlabel('Age', size = '20')
plt.ylabel('Number of passengers', size = '20')
plt.legend(['Survived', 'Dead'])
plt.show()

Check the influence of Sex on Survival


In [ ]:
survival_by_gender = df[['Sex','Survived']].pivot_table(columns =
                        ['Survived'], index = ['Sex'], aggfunc=len)
survival_by_gender

In [ ]:
survival_by_gender.plot(kind = 'bar', stacked = True)
plt.show()

Check the influence of Pclass on Survival


In [ ]:
survival_by_Pclass = df[['Pclass','Survived']].pivot_table(columns =
                        ['Survived'], index = ['Pclass'], aggfunc=len)
survival_by_Pclass

In [ ]:
survival_by_Pclass.plot(kind = 'bar', stacked = True)
plt.show()

Ok, so, Age and Pclass seem to have some influence on survival rate.

Let's build a simple model to test that

Define a new feature called "Male" that is 1 if Sex = 'male' and 0 otherwise


In [ ]:
df['Male'] = df['Sex'].map({'male': 1, 'female': 0})
df[['Sex', 'Male']].head()

Define simplest model as benchmark

The simplest model is a model that predicts 0 for everybody, i.e. no survival.

How good is it?


In [ ]:
actual_dead = len(df[df['Survived'] == 0])
total_passengers = len(df)
ratio_of_dead = actual_dead / float(total_passengers)

print "If I predict everybody dies, I'm correct %0.1f %% of the time" % (100 * ratio_of_dead)

df['Survived'].value_counts()

We need to do better than that

Define features (X) and target (y) variables


In [ ]:
X = df[['Male', 'Pclass', 'Age']]
y = df['Survived']

Initialize a decision tree model


In [ ]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=0)
model

Split the features and the target into a Train and a Test subsets.

Ratio should be 80/20


In [ ]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                            test_size = 0.2, random_state=0)

Train the model


In [ ]:
model.fit(X_train, y_train)

Calculate the model score


In [ ]:
my_score = model.score(X_test, y_test)

print "Classification Score: %0.2f" % my_score

Print the confusion matrix for the decision tree model


In [ ]:
from sklearn.metrics import confusion_matrix

y_pred = model.predict(X_test)

print "\n=======confusion matrix=========="
print confusion_matrix(y_test, y_pred)

3) Iterate and improve

Now you have a basic pipeline. How can you improve the score? Try:

Let's have a small competition....

4) Optional: upload your prediction to Kaggle using the test dataset

https://www.kaggle.com/c/titanic/submissions/attach