We conduct a data analysis on the Titanic data set from the Kaggle website. This notebook collects the code used to perform the analysis and presents our findings. The goal of our analysis is to identify factors that made the passengers more likely to survive.
We address the following questions - note that passenger class is a proxy for socio-economic status:
In [1]:
## Set up the environment
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
In [2]:
## Load the data set
titanic_df = pd.read_csv('titanic_data.csv')
In [3]:
## Let's see what the data looks like
## First let's see what the columns are.
titanic_df.info()
In [4]:
## Let's look at the first few rows of data.
titanic_df.head(10)
Out[4]:
In [5]:
## Let's look at some basic statistics for the data.
titanic_df.describe()
Out[5]:
An initial inspection of the data shows us that we have data for 891 passengers not all of whose ages are known. Passengers in 2nd and 3rd class do not appear to have cabins assigned to them. The majority of the passengers (in the data set) were in 3rd class. This is based on the upper 50th percentile for pclass
being 3 so the lower 50th percentile must include the 1st and 2nd class passengers combined. It would appear that the most interesting/relevant factors to look at are: Survived
, Sex
, Pclass
, and Age
. The variables SibSp
and Parch
can help us address the third question.
We look at correlations between the numerical features to see if there are any especially strong correlations. See the Seaborn documentation here for details about the correlation matrix.
In [6]:
# Please note that I copied and pasted thecode below from the above link
# Compute the correlation matrix
corr = titanic_df.corr()
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=None, cmap=cmap, vmax=.3,
square=True, xticklabels=True, yticklabels=True,
linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)
Out[6]:
There are no surprising strong correlations. There is a negative correlation between Survived
and Pclass
.
The next area to explore is differences between males and females.
According to maritime custom, women and children were to be given priority for lifeboat seating when abandoning ship. Therefore, we should expect to see some differences in survival rate for females and males.
In [7]:
# Let's start by counting how many females and males we have.
print titanic_df['Sex'].value_counts()
# Let's also look at a bar graph for comparison.
mf_plot = sns.factorplot('Sex', data=titanic_df, kind='count')
mf_plot.set_ylabels("Number of Passengers")
Out[7]:
We see that there are almost twice as many male passengers as female passengers in the data set. Keep this in mind when we look at survival rates for males and females.
Next, we break down the passenger list by gender and class.
In [8]:
# Create a bar chart breaking down the passengers by gender and class
mf_class_plot= sns.factorplot('Pclass',order=[1,2,3], data=titanic_df, hue='Sex', kind='count')
mf_class_plot.set_xlabels("Passenger Class")
mf_class_plot.set_ylabels("Number of Passengers")
Out[8]:
For this dataset, the majority of males were in 3rd class, whereas the females appear to be more evenly spread amongst the three classes. Although, there are still more females in 3rd class, than in 1st or 2nd class.
We should also look at passenger age as factor of survival. Since children were also given priority to seats on lifeboats, we should expect a higher survival rate for children than for adults. For the purpose of this analysis, I will assume that the maximum age for a child was 12, based on the discussion here.
In [9]:
# Spread of passenger ages
print "Youngest Passenger Age: " + str(titanic_df['Age'].min())
print "Oldest Passenger Age: " + str(titanic_df['Age'].max())
print "Mean Passenger Age: " + str(titanic_df['Age'].mean())
print "Median Passenger Age: " + str(titanic_df['Age'].median())
# Histogram of ages
age_hist = titanic_df['Age'].hist(bins=80)
age_hist.set(xlabel="Age", ylabel="Number of Passengers")
Out[9]:
As noted earlier, some of the ages are missing - 177 records to be exact. We cannot ignore this many rows so we must come up with a way to impute the missing ages. One solution would be to fill in the missing ages with the median ages according to gender. This is not ideal, but it's better than excluding those rows outright.
In [10]:
# Determine the median age for males and for females
print "Median age for males: " + str(titanic_df[titanic_df.Sex == 'male']['Age'].median())
print "Median age for females: " + str(titanic_df[titanic_df.Sex == 'female']['Age'].median())
Since the median age of all passengers, 28.0, is the median of the above two median ages, we will fill in the NaN
s with 28.0.
In [12]:
# Use the above value to impute the missing ages.
# If the age is missing replace `NaN` with 28.0.
# Otherwise, keep the age that is already there.
imputed_age = np.where(titanic_df['Age'].isnull(), 28.0, titanic_df['Age'])
# Replace the `NaN`s with 28.0 otherwise don't replace the age.
titanic_df['Age'] = imputed_age
titanic_df.describe()
Out[12]:
I want to determine whether a passenger was an adult or a child. To this end, I add a new column to the data set indicating each passenger's status (at least for those passengers that have an age).
In [14]:
# Define a function that determines whether a passenger is an adult or a child.
# If a passenger is an adult, then that passenger is labeled according to `Sex`.
# Note that if the age is missing (entered as 'Nan' in the Age column), the the passenger will be consdered an adult.
def adult_or_child(passenger):
age,sex = passenger
if age <= 12:
return 'child'
else:
return sex
In [15]:
# Add a column called 'Majority' to the data set indicating whether each passenger is an adult or a child.
titanic_df['Majority'] = titanic_df[['Age','Sex']].apply(adult_or_child, axis=1)
In [16]:
# Let's see what our numbers look like now.
titanic_df['Majority'].value_counts()
Out[16]:
Let's see what the breakdown is for adults and children by class.
In [17]:
# Create a bar chart breaking down the passengers by gender and class.
majority_class_plot= sns.factorplot('Pclass',order=[1,2,3], data=titanic_df, hue='Majority', kind='count')
majority_class_plot.set_xlabels("Passenger Class")
majority_class_plot.set_ylabels("Number of Passengers")
Out[17]:
We look at the data to see how many passengers were part of a family and how many traveled alone. I assume that families would try to stay together which could increase the chance of survival of the family group, but might decrease the chance of survival of individuals within the family.
In [18]:
# Define a function that determines whether a passenger is traveling with a family or alone.
def family_or_alone(passenger):
sibsp,parch = passenger
if sibsp + parch > 0:
return 'family'
else:
return 'alone'
In [19]:
# Add a column called 'FamAlone' to the data set indicating whether each passenger traveled with a family or alone.
titanic_df['FamAlone'] = titanic_df[['SibSp','Parch']].apply(family_or_alone, axis=1)
In [20]:
# Let's see what the numbers for family affiliation look like.
titanic_df['FamAlone'].value_counts()
Out[20]:
Let's breakdown family affiliation by class.
In [21]:
# Create a bar chart breaking down the passengers by family affiliation.
family_plot= sns.factorplot('Pclass',order=[1,2,3], data=titanic_df, hue='FamAlone', kind='count')
family_plot.set_xlabels("Family or Alone")
family_plot.set_ylabels("Number of Passengers")
Out[21]:
Looking at the above graph, we see that in 3rd class, more passengers traveled alone whereas in the 1st class and 2nd class, roughly the same number of passengers traveled wth families as did those who traveled alone.
In [22]:
# Create a new column `Survivor` which translates the values of `Survived' as follows: 0 becomes 'No' and 1 becomes 'Yes'
titanic_df['Survivor']= titanic_df.Survived.map({0:'No', 1:'Yes'})
# Calculate the proportions of survivors and fatalities.
# Note that normalize = False just returns the nubers rather than the proportions.
titanic_df['Survivor'].value_counts(normalize = True)
Out[22]:
In [23]:
# Plot the numbers of those who survived and didn't.
survival_plot = sns.factorplot('Survivor', data=titanic_df, kind='count')
survival_plot.set_xlabels("Survived")
survival_plot.set_ylabels("Number of Passengers")
Out[23]:
Clearly the majority of passengers perished in the data set. Next, we compare survival rates by gender. Roughly 38% of the passengers survived and 62% didn't.
Next, we'll break down the survival rates by gender, class, and familial affiliation.
In [24]:
# Calculate the proportions of survivors by gender.
gender_survival = titanic_df.groupby(['Sex'])['Survivor']
gender_survival.value_counts(normalize = True)
Out[24]:
In [25]:
# Graph the above results.
gender_survival_plot = sns.factorplot("Sex", "Survived", data=titanic_df, kind="bar")
There is a notable difference in survival rates between genders with a rate of roughly 74.2% for females and only 18.9% for males.
In [27]:
# Calculate the proportions of survivors by class.
class_survival = titanic_df.groupby(['Pclass'])['Survivor']
class_survival.value_counts(normalize = True)
Out[27]:
In [28]:
# Graph the above results.
class_survival_plot = sns.factorplot("Pclass", "Survived", data=titanic_df, kind="bar")
There are notable differences in survival rates between passenger classes with a survival rate of greater than 50% for first class passengers, less than 50% for second class passengers, and less than 25% for third class passengers.
In [29]:
# Calculate the proportions of survivors by class and gender.
class_gender_survival = titanic_df.groupby(['Pclass','Sex'])['Survivor']
class_gender_survival.value_counts(normalize = True)
Out[29]:
In [30]:
# Graph the results from above
sns.factorplot("Pclass", "Survived",order=[1,2,3],data=titanic_df,hue='Sex', kind='bar')
Out[30]:
We observe that females in a particular class had noticeably greater surival rates than males in the same class. In first class, the rate of survial for females was greater than 96% (but slightly uunder 32% for males). In second class, the rate of survial for females was greater than 92% (but slightly under 16% for males). In third class, the rate of survial for females was roughly 50% (but under 14% for males).
In [31]:
# Calculate the proportions of survivors by family affiliation.
class_survival = titanic_df.groupby(['FamAlone'])['Survivor']
class_survival.value_counts(normalize = True)
Out[31]:
In [32]:
# Graph the results from above.
class_survival_plot = sns.factorplot("FamAlone", "Survived", data=titanic_df, kind="bar")
Those who traveled alone had a survival rate of roughly 30.4%, whereas those who traveled with families had a survival rate of roughly 50.6%. Let's break this down further by class.
In [33]:
# Calculate the proportions of survivors by class and familial affiliation.
class_family_survival = titanic_df.groupby(['Pclass','FamAlone'])['Survivor']
class_family_survival.value_counts(normalize = True)
Out[33]:
In [34]:
# Graph the results from above
sns.factorplot("Pclass", "Survived",order=[1,2,3],data=titanic_df,hue='FamAlone', kind='bar')
Out[34]:
We observe that passengers in families survived at higher rates than those who traveled alone, irrespective of class.
In [46]:
# Graph the box plots for the ages.
age_survival_box_plot = sns.factorplot("Survived", "Age", data=titanic_df, kind="box", size=6)
age_survival_box_plot.set( xlabel='', xticklabels = ['Died', 'Survived'])
Out[46]:
Let's compare survival rates by class.
In [53]:
# Calculate the proportions of survivors by class and familial affiliation.
class_family_survival = titanic_df.groupby(['Pclass','Majority'])['Survivor']
class_family_survival.value_counts(normalize = True)
Out[53]:
Children in first and second classes had much higher survival rates than children in third class. Similarly women in first and second classes fared better than women in third class. Men in first class had a notably higher rate of survival than men in second and third classes.
In [54]:
# Graph the box plots for the ages by class.
majority_survival_box_plot = sns.factorplot("Survived", "Age", data=titanic_df, kind="box", size=8, hue='Pclass')
majority_survival_box_plot.set( xlabel='', xticklabels = ['Died', 'Survived'])
Out[54]:
Comparing the box plots, it appears that the younger passengers in each class were the majority of survivors.
We were supposed to analyze data for the passengers of the titanic, however we were given an incomplete set of data. There were over 2200 people on board (around 1300 passengers and around 900 crew), so it is hard to draw conclusions based on the 891 poeple in the given data set. Also, many ages were missing. This makes drawing conclusions based on age difficult.
Not knowing most of the cabin numbers prevent us from exploring whether cabin location had an effect on survival - although, we can't really assume that most of the passengers were in their cabins when the ship struck the iceberg. To really do this analysis, we would need to know where passengers were at the time the ship struck the iceberg.
Lastly, there are other factors which are more intangible. For example, despite depictions in film, third class passengers were not locked down below deck and women and children denied access to lifeboats.