In 1912, the ship RMS Titanic struck an iceberg on its maiden voyage and sank, resulting in the deaths of most of its passengers and crew. In this project, we will explore a subset of the RMS Titanic passenger manifest to determine which features best predict whether someone survived or did not survive.
In [3]:
import numpy as np
import pandas as pd
import pylab as pl
# RMS Titanic data visualization code
from titanic_visualizations import survival_stats
from IPython.display import display
%matplotlib inline
# Load the dataset
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)
# Print the first few entries of the RMS Titanic data
display(full_data.head())
From a sample of the RMS Titanic data, we can see the various features present for each passenger on the ship:
NaN
)NaN
)Since we're interested in the outcome of survival for each passenger or crew member, we can remove the Survived feature from this dataset and store it as its own separate variable outcomes
.
In [4]:
# Store the 'Survived' feature in a new variable and remove it from the dataset
outcomes = full_data['Survived']
data = full_data.drop('Survived', axis = 1)
# Show the new dataset with 'Survived' removed
display(data.head())
The very same sample of the RMS Titanic data now shows the Survived feature removed from the DataFrame. Note that data
(the passenger data) and outcomes
(the outcomes of survival) are now paired. That means for any passenger data.loc[i]
, they have the survival outcome outcome[i]
.
To measure the performance of our predictions, we need a metric to score our predictions against the true outcomes of survival. Since we are interested in how accurate our predictions are, we will calculate the proportion of passengers where our prediction of their survival is correct.
In [5]:
def accuracy_score(truth, pred):
""" Returns accuracy score for input truth and predictions. """
# Ensure that the number of predictions matches number of outcomes
if len(truth) == len(pred):
# Calculate and return the accuracy as a percent
return "Predictions have an accuracy of {:.2f}%.".format((truth == pred).mean()*100)
else:
return "Number of predictions does not match number of outcomes!"
# Test the 'accuracy_score' function
predictions = pd.Series(np.ones(5, dtype = int))
print accuracy_score(outcomes[:5], predictions)
If we were asked to make a prediction about any passenger aboard the RMS Titanic whom we knew nothing about, then the best prediction we could make would be that they did not survive. This is because we can assume that a majority of the passengers (more than 50%) did not survive the ship sinking.
The predictions_0
function below will always predict that a passenger did not survive.
In [6]:
def predictions_0(data):
""" Model with no features. Always predicts a passenger did not survive. """
predictions = []
for _, passenger in data.iterrows():
# Predict the survival of 'passenger'
predictions.append(0)
# Return our predictions
return pd.Series(predictions)
# Make the predictions
predictions = predictions_0(data)
In [7]:
print accuracy_score(outcomes, predictions)
Using the RMS Titanic data, a prediction would be 61.62% accurate that none of the passengers survived.
Let's take a look at whether the feature Sex has any indication of survival rates among passengers using the survival_stats
function. This function is defined in the titanic_visualizations.py
. The first two parameters passed to the function are the RMS Titanic data and passenger survival outcomes, respectively. The third parameter indicates which feature we want to plot survival statistics across.
In [8]:
survival_stats(data, outcomes, 'Sex')
Examining the survival statistics, a large majority of males did not survive the ship sinking. However, a majority of females did survive the ship sinking. Let's build on our previous prediction: If a passenger was female, then we will predict that they survived. Otherwise, we will predict the passenger did not survive.
In [9]:
def predictions_1(data):
""" Model with one feature:
- Predict a passenger survived if they are female. """
predictions = []
for _, passenger in data.iterrows():
# Remove the 'pass' statement below
# and write your prediction conditions here
if(passenger['Sex'] == 'female'):
predictions.append(1)
else:
predictions.append(0)
# Return our predictions
return pd.Series(predictions)
# Make the predictions
predictions = predictions_1(data)
In [10]:
print accuracy_score(outcomes, predictions)
Therefore, the prediction that all female passengers survived and the remaining passengers did not survive, would be 78.68% accurate.
Using just the Sex feature for each passenger, we are able to increase the accuracy of our predictions by a significant margin. Now, let's consider using an additional feature to see if we can further improve our predictions. For example, consider all of the male passengers aboard the RMS Titanic: Can we find a subset of those passengers that had a higher rate of survival? Let's start by looking at the Age of each male, by again using the survival_stats
function. This time, we'll use a fourth parameter to filter out the data so that only passengers with the Sex 'male' will be included.
In [11]:
survival_stats(data, outcomes, 'Age', ["Sex == 'male'"])
Examining the survival statistics, the majority of males younger then 10 survived the ship sinking, whereas most males age 10 or older did not survive the ship sinking. Let's continue to build on our previous prediction: If a passenger was female, then we will predict they survive. If a passenger was male and younger than 10, then we will also predict they survive. Otherwise, we will predict they do not survive.
In [12]:
def predictions_2(data):
""" Model with two features:
- Predict a passenger survived if they are female.
- Predict a passenger survived if they are male and younger than 10. """
predictions = []
for _, passenger in data.iterrows():
# Remove the 'pass' statement below
# and write your prediction conditions here
if passenger['Sex'] == 'female':
predictions.append(1)
elif passenger['Age'] < 10:
predictions.append(1)
else:
predictions.append(0)
# Return our predictions
return pd.Series(predictions)
# Make the predictions
predictions = predictions_2(data)
Prediction: all female passengers and all male passengers younger than 10 survived
In [13]:
print accuracy_score(outcomes, predictions)
Thus, the accuracy increases with above prediction to 79.35%
Adding the feature Age as a condition in conjunction with Sex improves the accuracy by a small margin more than with simply using the feature Sex alone.
In [14]:
survival_stats(data, outcomes, 'Sex')
survival_stats(data, outcomes, 'Pclass')
survival_stats(data, outcomes, 'Pclass',["Sex == 'female'"])
survival_stats(data, outcomes, 'SibSp', ["Sex == 'female'", "Pclass == 3"])
survival_stats(data, outcomes, 'Age', ["Sex == 'male'", "Age < 18"])
survival_stats(data, outcomes, 'Pclass', ["Sex == 'male'", "Age < 15"])
survival_stats(data, outcomes, 'Age',["Sex == 'female'"])
survival_stats(data, outcomes, 'Age', ["Sex == 'male'", "Pclass == 1"] )
survival_stats(data, outcomes, 'Sex', ["Age < 10", "Pclass == 1"] )
survival_stats(data, outcomes, 'SibSp', ["Sex == 'male'"])
In [15]:
def predictions_3(data):
""" Model with multiple features. Makes a prediction with an accuracy of at least 80%. """
predictions = []
for _, passenger in data.iterrows():
if ( 'Master' in passenger['Name'] and np.isnan(passenger['Age'])) :
predictions.append(1)
continue
if ( passenger['Sex'] == 'male' and passenger['Age'] > 20
and passenger['Age'] < 41 and passenger['Pclass'] == 1) :
predictions.append(1)
continue
# Remove the 'pass' statement below
# and write your prediction conditions here
if passenger['Sex'] == 'female':
if(passenger['Pclass'] < 3):
predictions.append(1)
elif passenger['SibSp'] < 1:
predictions.append(1)
else:
predictions.append(0)
elif (passenger['Age'] < 10):
predictions.append(1)
else:
predictions.append(0)
# Return our predictions
print len(predictions)
return pd.Series(predictions)
# Make the predictions
predictions = predictions_3(data)
In [16]:
print accuracy_score(outcomes, predictions)
With above features, I obtain a prediction accuracy of 81.03%
Description: We have used features in the following order:
1) Sex: Fig1 shows that the survival rate for females is higher than that of males. 2) Pclass: Fig2 shows that the survival rate is higher than 50% in Class1 and less than 50% in Class2. We then looked at the survival rate of women in different classes. Fig3 shows that almost all the women survived in Class1 and Class2. Therefore, we used the filter that: if passenger['Sex'] == 'female': if(passenger['Pclass'] < 3): predictions.append(1) In Class3, there is a 50% chance of survival for women. We want to see which women in that 50% survived. For that we looked at the SibSp feature. Fig5 shows that single women have a better chance of survival. Therefore, we used the following filter in conjunction with the above one: elif passenger['SibSp'] < 1: predictions.append(1) 3) Age: Next feature we looked is Age of males. Next figure shows that males of age below 10 have a higher chance of survival. This figure shows that all males of age below 10 survived in Class1 and Class2. Therefore, we put the following filter in conjunction with the above filter: elif (passenger['Age'] < 10 and passenger['Pclass'] < 3): predictions.append(1) else: predictions.append(0) Fig8 shows that almost all the males between the age of 20 and 41 in Class1 survived. if ( passenger['Sex'] == 'male' and passenger['Age'] < 50 and passenger['Pclass'] == 1) : predictions.append(1) continue At each step, we are trying to decrease the entropy of each partition. Since we are asked to improve the accuracy to at least 80%, we stopped at that.
There are ways to improve the accuracy further by: 1) Looking at the age of single women in Filter2 above and by looking at the parch information for these women, one can modify the filter2 to: elif (passenger['SibSp'] < 1 and passenger[‘Age’] < 15) : predictions.append(1)
After several iterations of exploring and conditioning on the data, I have built a useful algorithm for predicting the survival of each passenger aboard the RMS Titanic. The technique applied in this project is a manual implementation of a simple machine learning model, the decision tree. A decision tree splits a set of data into smaller and smaller groups (called nodes), by one feature at a time. Each time a subset of the data is split, our predictions become more accurate if each of the resulting subgroups are more homogeneous (contain similar labels) than before.
A decision tree is just one of many models that come from supervised learning. In supervised learning, we attempt to use features of the data to predict or model things with objective outcome labels. That is to say, each of our data points has a known outcome value, such as a categorical, discrete label like 'Survived'
, or a numerical, continuous value like predicting the price of a house.
Another use of supervised learning is building Spam filtering models. The outcome that we are predicting is a binary: spam or not-spam label. Mis-spellings and the email address of the sender can be used as two important features for making prediction.