This notebook investigates the Titanic dataset containing demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic.
The analysis of the Titanic dataset deals mainly with the relationship between survival of an individual and variables such as his:
Therefore we are investigating the following main question: Which factors made survival of an individual more likely?
During the course of analysis we are also looking at the following specific questions:
In [130]:
# load required modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# display plots inside the notebook
%matplotlib inline
# ensure compatibility with Python 2.x
# from __future__ import print_function
In [131]:
# load dataset from local file system
titanic = pd.read_csv("titanic_data.csv")
Let's explore the dataset by printing its shape, the first and last 5 rows of data, and calculating some summary statistics
In [132]:
# print shape rows, columns) of data set
titanic.shape
Out[132]:
In [133]:
# show first 5 rows of dataset
titanic.head()
Out[133]:
In [134]:
# show last 5 rows of dataset
titanic.tail()
Out[134]:
Looking at each variable indepdently the summary statistics tell us that:
Given the summary statistics we might investigate the following questions:
In [135]:
# calculate summary statistics
titanic.describe()
Out[135]:
In [136]:
# helper functions to print rows containing the min/max value of a variable
def titanic_min(variable):
"""
Given a variable present in the titanic data set, the function prints the rows containing the min value
"""
print("Information for min values of %s:" % variable)
print(titanic.ix[titanic[variable] == min(titanic[variable])])
def titanic_max(variable):
"""
Given a variable present in the titanic data set, the function prints the row containing the max value
"""
print("Information for max values of %s:" % variable)
print(titanic.ix[titanic[variable] == max(titanic[variable])])
Let's look at surivial first. From the bar chart below we can see that only about 350 of 891 passengers survived their trip.
In [137]:
# plot survival data
titanic["Survived"].plot(kind="hist", title="Distribution of (non)-survivors", bins=2, xticks=(0,1)).set_xticklabels(["not survived", "survived"])
Out[137]:
What about passenger class? Apparently half of passengers were traveling in third class. The other half almost equally split into second and first class.
In [138]:
# plot passenger class data
titanic["Pclass"].plot(kind="hist", title="Distribution of passenger class", bins=3, xticks=(1,2,3))
Out[138]:
Next, let's investigate the age distribution of Titanic passengers. Apparently most of the passengers were between 20 and 30 years old. From the histogram it is evident that there were some very old passengers, too.
In [139]:
# plot age data
titanic["Age"].plot(kind="hist", title="Distribution of age")
Out[139]:
Now let's have a look at extreme ages. How old were the youngest/oldest passengers? As can be seen from the output below, the youngest passenger was not even 1 year old, while the oldest passenger was already 80. Interestingly both survived, despite travelling in different passenger classes.
In [140]:
# print information about youngest passenger
titanic_min("Age")
In [141]:
# print information about oldest passenger
titanic_max("Age")
Let's look at distribution of siblings/spouses of Titanic passengers. Interestingly, most passengers either did not have any or just one siblings/spouses on board, while there was one family (or someone with a lot of spouses) with 8 relatives on board.
In [142]:
# plot sibling/spouse data
titanic["SibSp"].plot(kind="hist", title="Distribution of siblings/spouses on board")
Out[142]:
Now who was the family with maxium number of relatives?. As can be seen from the table below, it was the Sage family, which unfortunately did not survive their journey.
In [143]:
# print information about max/min values
titanic_max("SibSp")
What about the distribution of parents/childeren onboard of Titanic? The figure below shows that the majority of passengers did not have any children on board. As seen within the siblings/spouses data, there is one extreme case which we investigate below.
In [144]:
# plot parent/child data
titanic["Parch"].plot(kind="hist", title="Distribution of parents/children on board")
Out[144]:
Apparently, Mrs. Goodwin was accompanied by 6 children and unfortunately did not survive her trip.
In [145]:
# print information about min/max parent/child data
titanic_max("Parch")
Finally, let's dig into the distribution of fare prices. Obviously most passengers paid well below USD 100 for their ticket. There are a some passengers who paid more, e.g. between USD 100 and USD 300 while a few payed as much as USD 500.
In [146]:
# plot fare data
titanic["Fare"].plot(kind="hist", title="Distribution of fare prices")
Out[146]:
From the boxplot below we can see that the median fare price was well below USD 100 (from cell 5 we actually know that the average is USD 32 with standard deviation of approximately USD 50. Furthermore the fare price of roughly USD 500 seems to be an outlier.
In [147]:
# plot fare data as box plot
ax = sns.boxplot(titanic["Fare"], orient="h")
ax.set_title("Fare prices")
Out[147]:
Now let's check the minium fare price. Intergestingly the minimum price is USD 0, meaning that 15 passengers did not pay for their ticket at all.
In [148]:
# print passengers who paid the minimum fare price
titanic_min("Fare")
In [149]:
# print number of passengers with minimum ticket price
len(titanic[titanic["Fare"] == 0])
Out[149]:
What about the maxium fare price? Obviously three passengers were willing to pay the maxium price of USD 512, which is 16 times higher than the average price USD 32. At least all three got a ticket for the first passenger class!
In [150]:
# print passengers who paid the maximum fare price
titanic_max("Fare")
Before moving on to actual analysis the data needs to be cleaned. During the exploration phase we discovered missing values for Age and Cabin. Furthermore some passengers were not assigned a proper ticket ID, but the value "Line". Another candidate for cleaning could be various extreme values in fare price, siblings/spouses or parents/children. How do we decide which values to keep and which to clean? One approach would be to go back to our initial question and check whether missing values in particular columns could impede analysis. As we are primary interested in factors influencing survival, e.g. sex, age, passenger class and other socio-economic variables, we should focus on these during data cleaning
Let's start with investigating real missing values: Age information is missing for 20% of all passengers, while cabin information is missing for 77% of all passengers. Why do we have so little information on cabins?
In [151]:
# for each column print number of records where information is missing
titanic.isnull().sum()
Out[151]:
In [152]:
# for reach column print missing values as percentage of total values
titanic.isnull().sum() / titanic.shape[0]
Out[152]:
Let's dig deeper into missing age and cabin data. Checking passengers travelling in third class for missing data reveals that most of our issues can be found there. 77% of missing age and 70% of missing cabin values are attached to passengers in the third class.
In [153]:
# print a subset of records with missing age information
titanic.ix[titanic["Age"].isnull()].head()
Out[153]:
In [154]:
# print a subset of records with missing cabin information
titanic.ix[titanic["Cabin"].isnull()].head()
Out[154]:
In [155]:
# for each column where PClass is equal to 3, print number of records where information is missing
titanic[titanic["Pclass"] == 3].isnull().sum()
Out[155]:
In [156]:
# for reach column where PClass is equal to 3, print missing values as percentage of total values
titanic[titanic["Pclass"] == 3].isnull().sum() / titanic.isnull().sum()
Out[156]:
What could be a possible explanation for that? Apparently third class had bunk beds for 4-6 people. Maybe data was not rigorously recorded for this class, see: https://nmni.com/titanic/On-Board/Sleeping.aspx
Although there does not seem to be a substantial problem with Embarked and Ticket information, let's have a brief look at the missing values:
In [157]:
# print records with missing embarked information
titanic.ix[titanic["Embarked"].isnull()]
Out[157]:
In [158]:
# print records where the value for ticket is "line"
titanic.ix[titanic["Ticket"] == "LINE"]
Out[158]:
Now, which data we want to omit for the analysis? Since our analysis mainly focuses on personal data like age, and socio-economic information, we first remove all columns containing data which does not help to investigate these variables, such as:
In [159]:
# remove PassengerId, Name, Ticket, Cabin and Embarked column
titanic_cleaned = titanic.drop(labels=["PassengerId", "Name", "Ticket", "Cabin", "Embarked"], axis=1, inplace=False)
Further we remove the outlier values (max values) for fare prices:
In [160]:
# remove fare price outliers
titanic_cleaned.drop(titanic_cleaned.index[[258, 679, 737]], inplace=True)
# verify that used-to-be max values for Fare have been removed
titanic_cleaned["Fare"].max()
Out[160]:
We do also drop all rows not containing age information:
In [161]:
# remove any records with missing age information
titanic_cleaned.dropna(subset=["Age"], inplace=True)
Finally we check if any missing values and remain, which is not the case. We are left with 711 rows of cleaned data:
In [162]:
# check whether rows with missing age information were succesfully removed
titanic_cleaned.isnull().sum()
Out[162]:
In [163]:
# print number of rows remaining after cleaning
titanic_cleaned.shape[0]
Out[163]:
After exploring and cleaning the data, we are finally able to analyze our main question: Which factors made survival of an individual more likely? In order to start investigating this question, we would like to know how survival is correlated with other variables in the dataset. Although this does not imply causation, i.e. a strong positive correlation between surival and travelling in the first passenger class does not proof, that passengers surived because they travelled in first class. Maybe first class passengers where particulary wealthy and could afford personal that saved them in case of emergency. Taking this into account, using correlation between variables is still a good start for deeper analysis.
In [164]:
titanic_cleaned.head()
Out[164]:
Let's start with age. Instead of correlating Survived with individual ages, we form three age groups (young, middle and old) and use these for analysis:
In [165]:
# bin age into young, middle, old buckets
titanic_cleaned["age"] = pd.cut(titanic["Age"], bins=3, labels=["young", "middle", "old"])
titanic_cleaned.head()
Out[165]:
We apply the same logic to fare prices:
In [166]:
# bin fare prices into low, medium, high buckets
titanic_cleaned["fare"] = pd.cut(titanic["Fare"], bins=3, labels=["low", "medium", "high"])
titanic_cleaned.head()
Out[166]:
Further, instead of correlating individual values for siblings/spouses and parents/children, we create the dummy variable "family" to indicate whether a passenger had at leat 1 sibling/spouse or 1 parent/child on board.
In [167]:
# create dummy variable "family" to indicate whether a passenger had at leat one sibling/spouse OR parent/child on board
titanic_cleaned["family"] = (titanic["SibSp"] >= 1) | (titanic["Parch"] >=1)
titanic_cleaned.head()
Out[167]:
Finally, we convert our recently created variables (age, fare, family), as well as Sex and passenger class into dummy variables.
In [168]:
# convert categorical variables (Sex and PClass) into dummy variables for analysis
titanic_dummies = pd.get_dummies(data=titanic_cleaned, columns=["Sex", "Pclass", "age", "family", "fare"])
titanic_dummies.drop(labels=["Age", "SibSp", "Parch", "Fare"], axis=1, inplace=True)
titanic_dummies.head()
Out[168]:
Now we are ready to calculate the correlation matrix.
In [169]:
# calculate correlation matrix
titanic_dummies.corr()
Out[169]:
Since the correlation matrix is hard to read, we visualize its result using a heatmap.
In [170]:
# visualize corrleation matrix
sns.heatmap(titanic_dummies.corr())
Out[170]:
Interestingly, we can observe a strong (>= 0.4) positive correlation between survival and being female, whereas the opposite is true for being male. Further, the correlation matrix shows a moderate positive relation between survival and travelling in first class. The opposite is true for residing in third class. As far as age groups are concerned, no correlation can be observed, while having at least one family member on board is modestly corrleated with survival. Finally, fare price groups do not seem to have a particular strong positive or negative with survival.
Given the positive relation between survival and being female, between survival and passenger class, as well as between survival and family, we look at these variables using a pivot tables.
Solely comparing survival between sex reveals that women were much more likely to surive then man. Further, passengers travelling in first class were much more likely than those travelling in third class. Finally, half of passengers having a family member on board survived their trip, whereas only 1/3 survived without family support.
In [171]:
# pivot table displaying survival vs. sex
pd.pivot_table(titanic_cleaned, values=["Survived"], index=["Sex"], aggfunc=[np.sum, np.mean, np.std], margins=True)
Out[171]:
In [172]:
# pivot table displaying survival vs. passenger class
pd.pivot_table(titanic_cleaned, values=["Survived"], index=["Pclass"], aggfunc=[np.sum, np.mean, np.std], margins=True)
Out[172]:
In [173]:
# pivot table displaying survival vs. family
pd.pivot_table(titanic_cleaned, values=["Survived"], index=["family"], aggfunc=[np.sum, np.mean, np.std], margins=True)
Out[173]:
Let's go one step further and investigate survival, sex and passenger class. The table shows that 96% of females travelling in first, and 92% travelling in second class survied their trip. Whereas only 40% of males in first, and only 15% in second and third class survived their trip.
In [174]:
# pivot table displaying survival vs. sex and passenger class
pd.pivot_table(titanic_cleaned, values=["Survived"], index=["Sex", "Pclass"], aggfunc=[np.sum, np.mean, np.std], margins=True)
Out[174]:
Repeating the same analysis using family and passenger class, we can observe relatively high survival rates for passengers travelling in first class, despite their family status. Although, the difference between passengers having a family member on board and travelling in first class, and those without family support is 10%. However, the major differences here is between passengers having family and travelling in second class. While 63% of passengers having a family member on board and travelling second class survived the trip, only 34% without family support did.
In [175]:
# pivot table displaying survival vs. family
pd.pivot_table(titanic_cleaned, values=["Survived"], index=["family", "Pclass"], aggfunc=[np.sum, np.mean, np.std], margins=True)
Out[175]:
In order to make it easier to consume our results, we visualize the analysis above:
In [176]:
# visualize differences between (non)-survivors given sex and passenger class
grid = sns.FacetGrid(titanic_cleaned, row="Sex", col="Pclass")
grid.map(plt.hist, "Survived", bins=2).set(xticks=(0,1)).set_xticklabels(["False", "True"])
Out[176]:
In [177]:
# visualize differences between (non)-survivors given sex and family on board
grid = sns.FacetGrid(titanic_cleaned, row="Sex", col="family")
grid.map(plt.hist, "Survived", bins=2).set(xticks=(0,1)).set_xticklabels(["False", "True"])
Out[177]:
Our general analysis revealed the following points:
Further, the investigation between survival and more than one variable revealed that:
Despite our findings, we can not be sure that the variables we found to have an effect on survival, really caused it. For instance, there may have been other reasons despite gender, for women to be more likely to survive than men.