From Wikipedia, "the passengers of the RMS Titanic were among the estimated 2,223 people who sailed on the maiden voyage of the second of the White Star Line's Olympic class ocean liners, from Southampton to New York City. Halfway through the voyage, the ship struck an iceberg and sank in the early morning of 15 April 1912, resulting in the deaths of over 1,500 people, including approximately 703 of the passengers."
In [1]:
from IPython.display import YouTubeVideo; YouTubeVideo("9xoqXVjBEF8")
Out[1]:
The goal will be to analyze the passengers list and look for some patterns and, ultimately, find whether woman and children really went first.
Your mission will be to complete the Python code in the cells below and execute it until the output looks similar or identical to the output shown. I would recommend to use a temporary notebook to work with the dataset, and when the code is ready and producing the expected output, copypaste it to this notebook. Once is done and validated, just copy the file elsewhere, as the notebook will be the only file to be sent for evaluation. Of course, everything starts by downloading this notebook.
February $24^{th}$.
The Titanic's passengers were divided into three separate classes, determined not only by the price of their ticket but by wealth and social class: those travelling in first class, the wealthiest passengers on board, were prominent members of the upper class and included businessmen, politicians, high-ranking military personnel, industrialists, bankers and professional athletes. Second class passengers were middle class travellers and included professors, authors, clergymen and tourists. Third class or steerage passengers were primarily immigrants moving to the United States and Canada.
In the file titanic.xls you will find part of the original list of passengers. The variables or columns are described below:
survival
: Survival (0 = No; 1 = Yes)pclass
: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)name
: Namesex
: Sexage
: Agesibsp
: Number of Siblings/Spouses Aboardparch
: Number of Parents/Children Aboardticket
: Ticket Numberfare
: Passenger Fare in pound sterling (£)cabin
: Cabinembarked
: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)boat
: Boat number used for survivalhome.dest
: Home / Final destinationConsider that pclass
is a proxy for socio-economic status:
And that age is given in years, with a couple of exceptions
age
less than 1, is given as a fractionage
is an estimation, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp
and parch
) some relations were ignored. The following are the definitions used for sibsp
and parch
.
Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0
for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.
When loaded into a DataFrame
, the dataset looks like this:
In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Set some Pandas options
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 25)
In [3]:
titanic = pd.read_excel("data/titanic.xls")
titanic.head()
Out[3]:
Before we can actually start playing with the data, we need to add some stuff. First, create a new column, cabin_type
, with the letter of the cabin if known. For example, if the cabin is 'C22 C26'
, cabin_type
would be C
; for something like 'A90 B11'
(which never happens), only the first code is used, being A
the cabin_type
.
In [4]:
titanic["cabin_type"] = titanic["cabin"].
titanic.head()
Out[4]:
We also need a column, name_title
, with the title included in the name but in lower case. For example, if the name is Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
, name_title
would be mrs
. Note the dot is excluded. If the title is none of Master
, Miss
, Ms
or Mrs
, it will be classified as other
.
In [5]:
def assign_name(name):
pass
titanic["name_title"] = titanic["name"].apply(assign_name)
titanic.head()
Out[5]:
Finally, a third new columnd age_cat
that maps the age
to one of the next values: children
(under 14 years), adolescents
(14-20), adult
(21-64), and senior
(65+).
In [6]:
def assign_age(age):
pass
titanic["age_cat"] = titanic["age"].apply(assign_age)
titanic.head()
Out[6]:
The last cosmetic change will be to replace the letters in the variable embarked
for the actual names of the cities:
C
: CherbourgQ
: QueenstownS
: SouthamptonNote that in this case we won't create a new column but replace the original one.
In [7]:
titanic["embarked"] = titanic["embarked"].
titanic.head()
Out[7]:
In [8]:
titanic.groupby()[["fare"]].aggregate().sort("fare")
Out[8]:
In [9]:
titanic.groupby()[[]].aggregate().sort()[:10]
Out[9]:
Let's consider now proportions (ratio) of survival. Calculate the proportion of passengers that survived by sex. And the same proportion by age category.
In [10]:
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio
Out[10]:
In [11]:
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio
Out[11]:
Calculate the same proportion, but by sex and class.
In [12]:
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio
Out[12]:
Calculate survival proportions by age category, class and sex.
In [13]:
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio.unstack()
Out[13]:
Calculate survival proportions by age category and sex.
In [14]:
survived = titanic.groupby()[[]].aggregate()
total = titanic.groupby()[[]].aggregate()
ratio = 100 * survived / total
ratio.unstack()
Out[14]:
So, women and children first?
Let's now see price. In order to see distributions of price, the first thing we need to do is to calculate the average fare per class, as well as the average age, and the number of people in each class.
In [15]:
pd.pivot_table(titanic,
index=[],
values=[],
aggfunc={"fare": , "age": , "name": },
margins=True)
Out[15]:
And even split the latter per cabin type and adding an aggregate counting the number of survivors.
In [16]:
pd.pivot_table(titanic,
index=[],
values=[],
aggfunc={"fare": , "age": , "name": , "survived": },
margins=)
Out[16]:
In [17]:
ax = titanic.boxplot(column=, by=, grid=False)
ax.set_title("")
ax.set_xlabel("")
ax.set_ylabel("")
ax.get_figure().suptitle("") # Do not change this statement
Out[17]:
And also, if we define a new column alone
that is True
if the passenger was travelling withouh family, and False
if with family, let's plot that into a box plot by age.
In [18]:
titanic["alone"] =
ax = titanic.boxplot(column="", by="", grid=False)
ax.set_title("")
ax.set_xlabel("")
ax.set_ylabel("")
ax.get_figure().suptitle("") # Do not change this statement
Out[18]:
On the other hand, to better see the relationship between proportion of survived and fare, we are going to create a table that has, in rows, two indices, sex and age category, and as the values, the average age, the average fare, and the proportion of passengers that survived. Define and use the function ratio
as an aggreate to calculate the proportion of people that survived.
In [19]:
def ratio(values):
pass
In [20]:
pt_titanic = pd.pivot_table(titanic,
index=[],
values=[],
aggfunc={
"fare": ,
"age": ,
"survived":
},
margins=)
pt_titanic
Out[20]:
Now let's create two plots, one for the female and other for male, comparing the series of average fare and survival proportion, and sorted by fare.
In [21]:
female = pt_titanic.query("sex == 'female'").sort()[[]]
male = pt_titanic.query().sort()[[]]
fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot
ax2 = fig.add_subplot
female.plot(ax=)
ax1.set_ylabel()
male.plot(ax=)
ax2.set_xlabel()
ax2.set_ylabel()
ax2.set_xticklabels(["children", "", "adolescent", "", "adult", "", "senior"])
Out[21]:
At least for woman, it looks like the richer the woman, the greater her chances to survive. Men, on the other hand, does not look like that; if any, it would be the opposite.
But let's take a closer look to the distribution of ages (ignoring the people with unknown age), by creating a histogram of the age distribution.
In [22]:
ax = titanic[].dropna()...(bins=30, alpha=0.8, grid=False, figsize=(12, 4))
ax.set_title()
Out[22]:
To further support this idea, let's create another pivot table, but instead of using the age categories, let's just calculate the corresponding values for fare and propotion of survival for each individual age.
In [23]:
pt_titanic2 = pd.pivot_table(titanic,
index=[],
values=[],
aggfunc={
},
margins=True)
pt_titanic2[[]]
Out[23]:
And now plot that information into a scatter plot, being the X-axis the average fare in pound sterling, and the Y-axis the average ratio of survival.
In [24]:
female2 = pt_titanic2.query
male2 = pt_titanic2.query
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(1, 1, 1)
ax.scatter(, , color='m')
ax.scatter(, , color='b')
ax.set_title()
ax.set_xlabel("... (£)")
ax.set_ylabel("")
ax.legend(["female", "male"])
Out[24]:
The pattern, if any, is hard to see using this visualization. However, we can use panda.cut()
to create bigger age groups and try again.
In [25]:
# This cell is complete
labels = ["{}-{}".format(i, i+9) for i in range(0, 71, 10)]
titanic["age_group"] = pd.cut(titanic["age"].dropna(), range(0, 81, 10), right=False, labels=labels)
titanic.head()
Out[25]:
In [26]:
pt_titanic3 = pd.pivot_table(titanic,
index=[],
values=[],
aggfunc={
},
margins=True)
female3 = pt_titanic3.query
male3 = pt_titanic3.query
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot
ax.scatter(, , color='m')
ax.scatter(, , color='b')
ax.set_title("")
ax.set_xlabel("... (£)")
ax.set_ylabel("")
ax.legend(["female", "male"])
Out[26]:
Effectively, once grouped in groups of 10 years, the more expensive the fare, the more likely you were to survive. But only if you were a woman, and actually the trend gets stronger for older woman rather than for children. And for the males, it doesn't matter; if any, the trend seems to be the poorer the better your chances for survival.
And now for a video.
In [27]:
YouTubeVideo("FHG2oizTlpY")
Out[27]:
In [28]:
import statsmodels.api as sm
First, we calculate the ordinary least squares for both, male and female, using the fare as our independent variable (X).
In [29]:
female_model3 = sm.OLS(, )
female_results3 = female_model3.fit()
print(female_results3.summary())
In [30]:
male_model3 = sm.OLS(, )
male_results3 = male_model3.fit()
print(male_results3.summary())
Now, we plot everything together.
In [31]:
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(, , 'mo', label="Female")
ax.plot(female3["fare"], female_results3.fittedvalues, 'r--.', label='$R^2$: {:.2}'.format(female_results3.rsquared))
ax.plot(, , 'bo', label="Male")
ax.plot(male3["fare"], male_results3.fittedvalues, 'g--.', label='$R^2$: {:.2}'.format(male_results3.rsquared))
ax.legend(loc='best')
ax.set_xlabel("... (£)")
ax.set_ylabel("")
ax.set_title("")
Out[31]:
And repeat the whole process for age group and proportion of survived.
In [32]:
pt_titanic4 = pd.pivot_table(titanic,
index=,
values=,
aggfunc=,
margins=True)
female4 = pt_titanic4.query
male4 = pt_titanic4.query
female_model4 = sm.OLS(, )
female_results4 = female_model4.
print(female_results4.)
male_model4 = sm.OLS(, )
male_results4 = male_model4.
print(male_results4.)
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(, , 'mo', label="Female")
ax.plot(, , 'r--.', label='$R^2$: {:.2}'.format(female_results4.))
ax.plot(, , 'bo', label="Male")
ax.plot(, , 'g--.', label='$R^2$: {:.2}'.format(male_results4.))
ax.legend(loc='best')
ax.set_xlabel()
ax.set_ylabel()
ax.set_title()
Out[32]:
Who survived at the end? What about the correlation between fare and ratio of survival for women And men? Is OLS the best approach for men?
What about class? And cabins?
Had any age category better chances at sruvival? Which one(s)?
Why? Did the children survive more than the older people?