This analysis was conducted on passengers traveling aboard the Titanic during its maiden voyage in April 1912. The key objective was to identify any possible variables that may have had an impact on survival.
Multiple variables were found to be associated with survival aboard the titanic, notably sex, age, and class. The analysis identified that women and children traveling in either 1st or 2nd class were the most likely to survive. Adult men and 3rd class passengers of all ages and sexes were the least likely to survive.
The results found within the Discussion section below represent a subset of the complete analysis. Additional details on how the data was prepared and a more comprehensive investigation of individual variables can be found in Appendix A and Appendix B.
Please note that this analysis was performed as an academic exercise and additional analysis is required to draw any truly meaningful conclusions. This study did not include an inferential investigation, nor did it explore causality though experimentation.
This analysis was conducted on a retrospective study of 891 Titanic passengers. The data was provided from Kaggle as part of the Udacity Nanodegree Program. The purpose of this investigation is to identify correlations in the data related to life and death aboard the Titanic. Specifically, whether sex, age, or money influenced survival on the Titanic.
In [1]:
%pylab inline
import numpy as np
import pandas as pd
import seaborn as sns
from dataworkflow.data import get_data
from dataworkflow.visualize import (freq1_display, freq2_display, freq3_display, freq4_display,
freq1_store, freq2_store, freq3_store, freq4_store,
ez_bar, ez_graph1, ez_graph2, ez_graph3)
titanic = get_data()
In [2]:
sex_a = freq2_store(titanic['Sex'], titanic['age_group'])
sex_tab1 = sex_a[2].ix[['female', 'male'],['Child','Adult','RowTotals']]
sex_tab1.index = ['Female', 'Male']
sex_tab1.columns = ['Child','Adult','Population']
sex_tab1
Out[2]:
In [3]:
sex_tab2 = sex_a[1].ix[['female', 'male', 'Totals'], ['Child', 'Adult', 'Unknown']]
sex_tab2.index = ['Female', 'Male', 'Population']
sex_tab2.columns = ['Child', 'Adult', 'Age Unknown']
sex_tab2.transpose()
Out[3]:
Figure 1a: (1) Frequency table depicting the percentages of males and females within each age group (Child < 18, Adult => 18), as well as the general population. (2) Frequency table depicting the percentage of children and adults within each of the passenger sexes. The overall percentage of adults and children in the population has also been included for comparison.
Note: The ship's population consisted of 65% male and 35% female passengers. Male passengers also made up the majority of both adults (66%) and children (51%). Children made up a larger percentage of female passengers (18%) than male passengers (10%). Each sex had roughly the same percentage of adults – female: 66%, male: 68%.
In [4]:
sex_c = freq2_store(titanic['Sex'],
titanic['Pclass'])[1][['1st','2nd', '3rd']]*100
sex_c.index = ['Female','Male', 'Population']
ez_bar(sex_c, 'Class Distribution', '% of Total')
sex_c.transpose()/100
Out[4]:
Figure 1b: Frequency table and bar graph showing how each sex and the overall population were spread across the 3 classes on the Titanic.
Notes: The majority of both sexes traveled in 3rd class, followed by 1st and then 2nd. The same pattern can be seen in the general population as well. Female passengers were fairly evenly spread across class, with the largest portion in 3rd class (45%). Male passengers were more skewed towards 3rd class. In fact, there were more male passengers in 3rd class than both the 1st and 2nd combined.
In [5]:
sex_surv = freq2_store(titanic['Survived'],
titanic['Sex'])[2].ix[['Died', 'Lived']]*100
sex_surv.columns = ['Female', 'Male', 'Population']
ez_bar(sex_surv.transpose(), 'Death & Survival Rates', '% of Total')
sex_surv/100
Out[5]:
Figure 1c: Frequency table & bar graph depicting passenger survival & death rates for both sexes and the overall population.
Notes: On whole, female passengers (74%) had more than 3x higher survival rates than male passengers (19%). Roughly, 3/4 of female passengers survived aboard the titanic, compared to only 1/5 of male passengers.
In [6]:
#frq = freq2_store(titanic['Sex'],titanic['Survived'])[2].ix[:-1]
sex_s = freq2_store(titanic['Survived'],
titanic['Sex'])[1][['female','male']]*100
sex_s.index = ['Died', 'Lived', 'Population']
sex_s.columns = ['Female','Male']
ez_bar(sex_s, 'Distribution of Sexes', '% of Total')
sex_s.transpose()/100
Out[6]:
Figure 1d: Frequency table & bar graph depicting the percentage of each sex who died, survived, and traveled aboard the Titanic.
Notes: Male passengers represented 65% of passengers aboard the Titanic, and only 32% of the survivors. Female passengers made up 35% of the population, but accounted for 68% of all surviving passengers.
In [7]:
ez_graph1(titanic['Sex'],titanic['Pclass'], titanic['Survived'],'1st', iVals=['Female', 'Male', 'Population'])
ez_graph1(titanic['Sex'],titanic['Pclass'], titanic['Survived'],'2nd', iVals=['Female', 'Male', 'Population'])
ez_graph1(titanic['Sex'],titanic['Pclass'], titanic['Survived'],'3rd', iVals=['Female', 'Male', 'Population'])
Out[7]:
Figure 1e: Bar graphs depicting survival rates across each class for the sexes and general population.
Note: Female passengers had the highest survival rates in each class. Male passengers faced low survival rates across all classes, however 1st class male passengers were more than twice as likely to survive compared to those traveling in 2nd or 3rd class. The data on class survival rates for the female population is also very interesting. It shows that female passengers located in the 1st (97%) and 2nd (92%) classes were almost certain to survive, while 3rd class travelers lost 50% of their population. The impact of class on survival will be explored further in Section 3.
Female passengers were found to have higher survival rates across all measured categorizations of the data. Female passengers were also represented in the surviving group at nearly twice the rate they existed in the general population. This evidence strongly suggests that females were more likely than males to survived. Therefore, the conclusion of this analysis is that sex did play a role in survival aboard the Titanic.
In [8]:
sns.boxplot(data=titanic['Age'])
ylabel('Passenger Age')
Out[8]:
Figure 2a: Box & whisker plot of Titanic passenger ages.
Notes: The average age of passenger on the titanic was 30. Passenger's ages ranged from just under 6 months to 80 years old.
In [9]:
age_sex = freq2_store(titanic['age_group'],
titanic['Sex'])[2].ix[['Child','Adult','Unknown'],['female','male','RowTotals']]*100
age_sex.index = ['Child', 'Adult', 'Age Unknown']
age_sex.columns = ['Female','Male', 'Population']
ez_bar(age_sex.transpose(), 'Sex Distribution', '% of Total')
age_sex/100
Out[9]:
Figure 2b: Stacked bar graph depicting the percentage of each age group across the sexes and general population.
Notes: Adults (18+) made up 67% of the population, children (under 18) made up 13%, and 20% of the passenger's ages were unknown. We know from the previous section that the majority of adults were male (66%) and that children were split more evenly between females (49%) and males (51%).
In [10]:
age_cls = freq2_store(titanic['age_group'],
titanic['Pclass'])[1].ix[[0,1,3],
['1st', '2nd', '3rd']]*100
age_cls.index = ['Child', 'Adult', 'Population']
ez_bar(age_cls, 'Class Distribution', '% of Total')
print('Class Dsitribution')
age_cls.transpose()/100
Out[10]:
Figure 2c: Frequency table and stacked bar graph depicting the class distribution of each age group and of the overall population.
Notes: The majority of adults and children traveled in 3rd class. There were more than twice as many children in 3rd class (69%) than in the 1st (11%) and 2nd (20%) classes combined. Adults were distributed across the classes in a pattern similar to the general population, with the majority in 3rd (46%) class followed by 1st (29%) and then 2nd (25%).
In [11]:
age_surv = freq2_store(titanic['Survived'], titanic['age_group'])
pt = age_surv[2].ix[['Died','Lived'],['Child','Adult','RowTotals']]
pt.columns = ['Child', 'Adult', 'Population']
print('Death & Survival Rates')
pt
Out[11]:
In [12]:
ag =age_surv[1][['Child', 'Adult', 'Unknown']].transpose()
ag.columns = ['Died','Lived','Population']
print('Age Group Distributions')
ag
Out[12]:
Figure 2d: Frequency tables depicting (1) death and survival rates for each age group and (2) distribution of age groups across dead & surviving passengers. Population statistics were added to each table for comparison.
_Notes: On average, children had higher survival rates (54%) than adults (38%). Surviving passengers were made up of 18% children, 67% adults, and 15% of an unknown age. Adults were represented in both the dead and surviving passengers at approximately the same rate as they existed in the general population. (~67%) Children were overrepresented in the surviving group (~5%) and underrepresented in the dead. (~3%)
In [13]:
#create tables
tab = freq3_store(titanic['Survived'],titanic['Pclass'], titanic['age_group'])[1]
tot_by_class = freq2_store(titanic['Survived'], titanic['Pclass'])[2].ix[['Died', 'Lived'],['1st', '2nd', '3rd']]
#1st class
tab_1st = tab['1st'].ix[['Died','Lived'],['Child','Adult']]*100
tab_1st.index.name = None
tab_1st.columns.name = None
tab_1st['Population'] = tot_by_class['1st']*100
#2nd class
tab_2nd = tab['2nd'].ix[['Died','Lived'],['Child','Adult']]*100
tab_2nd.index.name = None
tab_2nd.columns.name = None
tab_2nd['Population'] = tot_by_class['2nd']*100
#3rd class
tab_3rd = tab['3rd'].ix[['Died','Lived'],['Child','Adult']]*100
tab_3rd.index.name = None
tab_3rd.columns.name = None
tab_3rd['Population'] = tot_by_class['3rd']*100
#stacked bar graphs
ez_bar(tab_1st.transpose(), '1st Class Death & Survival Rates', '% of Total')
ez_bar(tab_2nd.transpose(), '2nd Class Death & Survival Rates', '% of Total')
ez_bar(tab_3rd.transpose(), '3rd Class Death & Survival Rates', '% of Total')
Out[13]:
Figure 2e: Stacked bar graphs displaying death and survival rates by class for each age group and population.
Notes: Children had the highest rates of survival in each class. The largest gap in survival statistics between age groups was found in the upper classes, where adults were more than 4 times as likely to die. The lowest rates of survival for both adults and children existed in 3rd class. Interestingly, the same pattern emerges for children as seen previously with female passengers. Children traveling in the upper classes had a ~91% rate of survival, while those traveling in 3rd class experienced a significantly lower survival rate of 37%. The impact of class will be explored further in section 3.
In [14]:
age_sex_surv = freq3_store(titanic['Survived'],
titanic['age_group'], titanic['Sex'])
fem_ch = age_sex_surv[1]['Child'].ix[['Died', 'Lived'], ['female']]
fem_a = age_sex_surv[1]['Adult'].ix[['Died', 'Lived'], ['female']]
fem_total = freq2_store(titanic['Survived'],
titanic['Sex'])[2].ix[['Died', 'Lived'], ['female']]
fem_surv = pd.concat([fem_ch, fem_a, fem_total], axis=1)*100
fem_surv.columns = ['Child', 'Adult', 'Population']
fem_surv.index.name = None
ez_bar(fem_surv.transpose(), 'Female Death & Survival Rates', '% of Total')
'----------------------------------------------'
male_ch = age_sex_surv[1]['Child'].ix[['Died', 'Lived'], ['male']]
male_a = age_sex_surv[1]['Adult'].ix[['Died', 'Lived'], ['male']]
male_total = freq2_store(titanic['Survived'],
titanic['Sex'])[2].ix[['Died', 'Lived'], ['male']]
mal_surv = pd.concat([male_ch, male_a, male_total], axis=1)*100
mal_surv.columns = ['Child', 'Adult', 'Population']
mal_surv.index.name = None
ez_bar(mal_surv.transpose(), 'Male Survival & Death Rates', '% of Total')
Out[14]:
Figure 2f: Stacked bar graphs depicting male and female survival rates within each age group. The survival & death rates for each population have also been included for comparison.
Notes: The data on the male population shows that children (40%) had higher survival rates than adults (18%), which is consistent with the previous findings in this analysis. Surprisingly, the data from the female population breaks this trend and provides the first example of adults (77%) with higher survival rates than children (69%). This will be explored further below.
In [15]:
data = freq4_store(titanic['Survived'],titanic['Pclass'],titanic['Sex'], titanic['age_group'])[1]
data_totals = freq3_store(titanic['Survived'], titanic['Pclass'], titanic['Sex'])[1]
for sex in ['female', 'male']:
for val in ['1st','2nd','3rd']:
#create data table
table = data[val][sex].ix[['Died', 'Lived'],['Child', 'Adult']]
table['Population'] = data_totals[val][sex]
table.index.name = None
table.columns.name = None
table*100
graph = ez_bar(table.transpose(), '{} Class, {} Death & Survival Rates'.format(val.capitalize(),
sex.capitalize()), '% of Total')
Figure 2g: Bar graphs of death & survival rates for age groups within each class/sex grouping. The death and survival rates for each class/sex population are also included.
Note: The data shows children had the highest rates of survival in each sex/class population, with the exception of 1st class, female passengers. Figure 2f revealed that on average, female adults had higher survival rates than female children. However, we now see that this only occurred within the 1st class population.
The numbers also show that children and adults traveling in 3rd class had the lowest rates of survival, with the exception of adult males. Adult male passengers traveling in 2nd class had lower survival rates than those in 3rd class, which expands on the findings concerning male averages in Figure 1e.
In [16]:
data = freq3_store(titanic['Pclass'],
titanic['Sex'],titanic['age_group'])
counts = data[0]['female'].ix[['1st', '2nd', '3rd', 'All'], ['Child', 'Adult']]
perc = data[1]['female'].ix[['1st', '2nd', '3rd', 'All'], ['Child', 'Adult']]
perc.columns = ['% of Children', '% of Adults']
table = pd.concat((counts['Child'], perc['% of Children'], counts['Adult'], perc['% of Adults']), axis=1)
table.index.name = None
table
Out[16]:
In [17]:
fem_data = freq4_store(titanic['Pclass'],
titanic['Sex'], titanic['Survived'],
titanic['age_group'])[0]['female']
child = fem_data['Died']['Child']
child_death = child / table['Child']
adult = fem_data['Died']['Adult']
adult_death = adult / table['Adult']
ftable = pd.concat((child, child_death, adult, adult_death), axis=1)
ftable.columns = ['Child Deaths', 'Death Rate', 'Adult Deaths', 'Death Rate']
ftable
Out[17]:
Figure 2h: (1) Frequency table depicting the class distribution of female adults and children. Values have been listed in counts and as a percent of the total. (2) Frequency table depicting the number of dead passengers and overall death rates of female adults and children in each class.
Note: There were almost 4 times as many female adults as children aboard the Titanic. Adult females in first class outnumbered children at nearly 10 to 1. This explains the higher adult female survival rates seen in Figure 2g. With only 8 total female children in first class, a single death dropped the overall survival rate by 13%. While 2 deaths out of 77 total adults decreased the survival rate by only 3%.
Female adults were spread fairly even across each class, while children were primarily located in 3rd class. Females traveling in 3rd class from both age groups experienced the highest death rates by a wide margin. The numbers show that 67% of the overall survival rate for adult females was calculated based on 1st & 2nd class survival rates (+90%), while 63% of the overall children’s survival rate was calculated on 3rd class survival rates (54%). These differences in population distributions and the lopsided survival rates explain why we saw higher survival rates for adult females in Figure 2f._
On average, passengers under the age of 18 were more likely to survive than adults (18+). Children represented 18% of survivors, while only representing 13% of the population. Children experienced higher survival rates across each sub-division of the population, with the exception of one. Upon further examination, the exception can be traced to a small sample size and lurking variable (i.e. class). The conclusion of these findings are that passengers under the age of 18 (i.e. children) were more likely to survive aboard the Titanic.
In [18]:
tab_class = freq1_store(titanic['Pclass'])[1].ix[['1st','2nd','3rd'], ['Count']]
tab_class.index.name = None
print('\n Titanic Class Distribution')
tab_class
Out[18]:
Figure 3a: Frequency table displaying percentage of population located within each class.
Note: Passengers aboard the titanic were separated into 3 classes. The majority of passengers traveled in 3rd class (55%) followed by 1st (24%) and 2nd (21%). Previously, we saw that 3rd class also contained the largest proportion of men (60%), women (46%), children (69%) and adults (46%).
In [19]:
graph_data = (freq2_store(titanic['Survived'],titanic['Pclass'])
[1][['1st', '2nd', '3rd']])*100
graph_data.index = ['Died', 'Lived', 'Population']
ez_bar(graph_data, 'Class Distribution', '% of total')
graph_data.transpose()
Out[19]:
Figure 3b: Frequency table & bar graph showing class distribution within the dead & surviving populations. The class breakdown for the entire population has been included for comparison.
Note: The majority of passengers who died aboard the titanic were traveling in 3rd class (68%). When comparing the dead & surviving passenger distributions to the population we see that 3rd class passengers accounted for a larger portion of the dead (+13%) and a smaller portion of the living (-20%). 1st and 2nd class followed the exact opposite pattern, accounting for less of the dead (-9%, -3%) and more of the survivors. (+16%, +4%)
In [20]:
tb = freq2_store(titanic['Pclass'],
titanic['Survived'])[1][['Died','Lived']]*100
tb.index = ['1st', '2nd', '3rd', 'Population']
ez_bar(tb,'Death & Survival Rates', '% of total')
tb.transpose()/100
Out[20]:
Figure 3c: Frequency table & bar graph depicting death & survival rates for each class and the overall population.
Note: Passengers traveling in 1st class were most likely to survive (63%) and passengers traveling 3rd class were least likely to survive (24%). Death rates for 3rd class passengers (76%) were slightly more than double that of 1st class passengers (37%). A familiar trend appears when comparing class rates against the population, 3rd class passengers have lower survival and higher death rates, while 1st and 2nd class passengers experience the opposite.
In [21]:
ez_graph3(titanic['Survived'],
titanic['age_group'],titanic['Pclass'],
'Child', 'Child Survival & Death Rates')
ez_graph3(titanic['Survived'],
titanic['Sex'],titanic['Pclass'],
'female', 'Female Survival & Death Rates')
ez_graph3(titanic['Survived'],
titanic['age_group'],titanic['Pclass'],
'Adult', 'Adult Survival & Death Rates')
ez_graph3(titanic['Survived'],
titanic['Sex'],titanic['Pclass'],
'male', 'Male Survival & Death Rates')
Out[21]:
Figure 3d: Bar graphs depicting death and survival rates across classes within each sex and age group. The population rates for each sex and age group were also added for comparison.
_Note: Again we see that 3rd class passengers experienced the lowest survival rates across each sex and age groups. The largest imblance is found within children, who survived at rates of 92% (1st) and 91%(2nd) in the upper classes and just 37% in 3rd class. A similar trend is seen in the female population, with 1st and 2nd class survival rates of 97% & 92%, respectively, compared to only 50% in 3rd class. Adult data followed suit, with survival rates of 64% in 1st class, 41% in 2nd, and 20% in 3rd. Interestingly, the male passenger data slighty deviated from this pattern. The highest survival rates still belonged to 1st class passengers (37%), however the survival rates for 2nd class (16%) were nearly as low as 3rd class (14%).__
In [22]:
ez_graph2(titanic['Pclass'],
titanic['age_group'], titanic['Survived'],
'Adult', 'Adult Class Distribution')
Out[22]:
In [23]:
ez_graph2(titanic['Pclass'],
titanic['age_group'], titanic['Survived'],
'Child', 'Child Class Distribution')
Out[23]:
In [24]:
ez_graph2(titanic['Pclass'],
titanic['Sex'], titanic['Survived'],
'male', 'Male Class Distribution')
Out[24]:
In [25]:
ez_graph2(titanic['Pclass'],
titanic['Sex'], titanic['Survived'],
'female', 'Female Class Distribution')
Out[25]:
Figure 3E: Frequency tables & bar graphs depicting the class distribution of passengers who died, survived, and existed within each sex and age group.
Note: The numbers show the majority of dead passengers from each group were traveling in 3rd class - adults (59%), children (94%), males (64%), females (88%). When comparing these numbers to the population statistics, we see that 3rd class passengers were overrepresented in each case. The most lopsided numbers are found with children, whose 3rd class travelers accounting for 94% of the dead and only 68% of the population.
The data also shows that 1st class passengers made up the largest portion of surviving adults (48%) and females (39%). In both cases, first class passengers appeared at higher percentages than they did in their respective populations. Men on the other hand had nearly as many 1st (41%) and 3rd (43%) class survivors. However, this is misleading as there were nearly 3 times as many male passengers traveling in 3rd class (60%) as there were in 1st (21%). The majority of surviving children were traveling in 2nd (34%) and 3rd (48%) class, but this too is misleading. In comparison to the rate they appeared in the population, both 1st (+7%) and 2nd (+14%) class survivors were overrepresented, while 3rd class (-21%) survivors were underrepresented.
In [26]:
#Death & Survival rates by class for each Sex/Age Group
data = freq4_store(titanic['Survived'],titanic['age_group'],titanic['Sex'],titanic['Pclass'])[1]
data_totals = freq3_store(titanic['Survived'], titanic['age_group'], titanic['Sex'])[1]
for sex in ['female', 'male']:
for val in ['Child', 'Adult']:
#create data table
table = data[val][sex].ix[['Died', 'Lived'],['1st', '2nd', '3rd']]
table['Population'] = data_totals[val][sex]
table.index.name = None
table.columns.name = None
table*100
print ("\n\n {} ({}) Death & Survival Rates \n\n".format(sex.capitalize(),val), table)
Figure 3f: Frequency table of death & survival rates by class for each sex/age group pairing. The death and survival rates for each population is also included for comparison.
Note: The 3rd class passengers from each sex/age grouping experienced higher than average death rates and lower than average survival rates in comparison to their populations. The 1st class passengers from each of these groups experienced the opposite, with below average death rates and above average survival rates.
Upper class (1st & 2nd) passengers experienced the highest survival rates across each sex/age grouping, with the exception of adult males. This is line with our findings from Figure 3d. However, we now see the reason male passengers didn't fit the pattern was due to the high mortality rate of 2nd class, adult males. In fact, this is the only instance where 3rd class passengers had higher survival rates than 2nd class. The effect of this on the overall male population was partially offset by the children, whose survival rates along class lines matched the trend seen in the other population sub-groups.
In [27]:
#Death/Survival Rates Per Sex/Age Group
data = freq4_store(titanic['Pclass'], titanic['Sex'], titanic['age_group'],titanic['Survived'])[1]
totals = freq3_store(titanic['Pclass'], titanic['Sex'], titanic['age_group'])[1]
for sex in ['female', 'male']:
for val in ['Child', 'Adult']:
#create data table
table = data[sex][val].ix[['1st', '2nd', '3rd']]
table['Population'] = totals[sex][val]
table.index.name = None
table.columns.name = None
table*100
print ("\n Class Distribution of {} ({}) Passengers \n\n".format(sex.capitalize(),val), table)
Figure 3G: Frequency tables depicting class distribution of the dead and surviving passengers for each sex/age grouping. The class distribution for each sex/age groups population has also been included for comparison.
Note: Passengers from 3rd class represent a larger portion of the dead and smaller portion of the survivors than they do in their respective populations for each of the sex/age groupings. The most dramatic swing is seen in the adult female population, whose 3rd class passengers accounted for 83% of the dead and only 33% of the population. The data for 1st class passengers in each sex/age group was reversed. They accounted for a smaller percentage of the dead and larger percentage of the survivors than existed in their entire populations. The largest discrepancy was found in adult males, with 51% of the survivors traveling in 1st class compared to only 25% of the overall population.
A clear pattern emerged along class lines when analyzing the data set. In nearly every instance, we saw the highest chances of survival existed in 1st class, followed by 2nd, and then 3rd. We also saw that 1st and 2nd class passengers represented a larger percentage of survivors and a lower percentage of the dead than they did in the population. While the 3rd class passengers experienced the opposite, they were underrepresented in the survivors and overrepresented in the dead compared to their population rates. This data indicates that class played a major role in survival aboard the titanic. The 1st and 2nd classes yielded much higher chances for survival, while traveling in 3rd class resulted in a higher likelihood of death.
While the findings in this study were interesting, additional analysis is required to validate the assumption included in this report. The results will need to be tested for statistical significance and expirementation will be required to confirm which (if any) of the variables identified in this analysis are causal.
There were also limitations in the dataset provided, which may have impacted the outcome. For example, we did not have passenger ages for about 20% of the population. I elected to seperate these passengers into their own category for analysis, but there are alternative approaches for handling these situations and each decision may influence results.
Lastly, the data provided did not include all possible variables surrounding death and survival aboard the titanic. It would have been useful to know details on how the ship sank, where passengers were located at the time of sinking, and whether passengers made it on to a lifeboat, to name a few.
Of the 891 passengers in this dataset, 549 (62%) perished aboard the Titanic. The analysis of this data determined that living and dying was largely dependent on the passengers age, sex, and class. Passengers who were female, under the age of 18, or traveling in 1st class were much more likely to survive than their counterparts. Adult males and passengers traveling in 3rd class were the least likely to survive aboard the Titanic. The next step. required for this analysis is to identify if any of these findings are statistically significant. Any results found to be statistically significant will require further testing/experimentation to assess causation.
This appendix contains the initial exploration of passenger data from Kaggle and all modifications made to the dataset for this analysis. Changes to the original dataset have been packaged into the dataworkflow.data module.
In [28]:
path = '/Users/jco/Desktop/data_science/Udacity/project_2/Titanic/train.csv'
titanic = pd.read_csv(path)
Dimension/Type
In [29]:
titanic.shape
Out[29]:
In [30]:
titanic.info()
Sample
In [31]:
titanic.head()
Out[31]:
Summary Stats
In [32]:
#Numerical Summary
titanic.describe()
Out[32]:
In [33]:
#Categorical Summary
cat_data = titanic.dtypes[titanic.dtypes == 'object'].index
titanic[cat_data].describe()
Out[33]:
PassangerID
Decision: Delete
In [34]:
titanic['PassengerId'].describe()
Out[34]:
Survived
Decision: Modify convert to category, replace nums with text (Died, Lived)
In [35]:
titanic['Survived'].describe()
Out[35]:
Pclass
Decision: Modify convert to ordered category, replace nums with text (1st, 2nd, 3rd)
In [36]:
titanic['Pclass'].describe()
Out[36]:
Name
Decision: Delete
In [37]:
titanic['Name'].describe()
Out[37]:
Sex
Decision: Modify keep values, convert dtype to category
In [38]:
titanic['Sex'].describe()
Out[38]:
In [39]:
titanic['Sex'].dtype
Out[39]:
Age
Decision: Transform keep current vals, keep missing data empyt for now, Create new variable for Child, Adults(>18), and Unknown
In [40]:
titanic['Age'].describe()
Out[40]:
SibSp
Decision: Merge combine with Parch to create new variables identifying passengers traveling with family and how many
In [41]:
titanic['SibSp'].describe()
Out[41]:
Parch
Decision: Merge combine with SibSp to create new var identifying passengers w/ fam & count
In [42]:
titanic['Parch'].describe()
Out[42]:
Ticket
Decision: Delete
In [43]:
titanic['Ticket'].describe()
Out[43]:
In [44]:
titanic['Ticket'].head(15)
Out[44]:
Fare
Decision: Keep
In [45]:
titanic['Fare'].describe()
Out[45]:
Cabin
Decision: Delete
In [46]:
titanic['Cabin'].describe()
Out[46]:
In [47]:
titanic['Cabin'].head(15)
Out[47]:
Embarked
Decision: Modify keep values, update dtype to category
In [48]:
titanic['Embarked'].describe()
Out[48]:
In [49]:
titanic['Embarked'].dtype
Out[49]:
Numerical Category Data
In [50]:
#Survived
new_survived = pd.Categorical(titanic['Survived'])
new_survived = new_survived.rename_categories(['Died', 'Lived'])
new_survived.describe()
Out[50]:
In [51]:
#Pclass
new_pclass = pd.Categorical(titanic['Pclass'], ordered=True)
new_pclass = new_pclass.rename_categories(['1st','2nd','3rd'])
new_pclass.describe()
Out[51]:
In [52]:
#Replace with categorical data
titanic['Survived'] = new_survived
titanic['Pclass'] = new_pclass
#sanity check
print (titanic['Survived'].dtype, titanic['Pclass'].dtype)
Update dtype
In [53]:
titanic['Sex'] = pd.Series(titanic['Sex'], dtype='category')
titanic['Embarked'] = pd.Series(titanic['Embarked'], dtype='category')
#sanity check
print(titanic['Sex'].dtype, titanic['Embarked'].dtype)
In [54]:
titanic['Sex'].unique()
titanic['Sex'].cat.categories
Out[54]:
Create New Variables
In [55]:
#Family Count
titanic['FamilyTot'] = titanic['SibSp'] + titanic['Parch']
In [56]:
#Single or Family Variable
bins = [-1,0,np.inf]
labels = ['Single','Family']
fam_status = pd.cut(titanic['FamilyTot'], bins, labels=labels)
fam_status.describe() #sanity check
Out[56]:
In [57]:
#Add FamStatus
titanic['FamStatus'] = fam_status
In [58]:
#Adult (>18) or Child (<18) Var
bins = [0, 17, 1000, np.inf]
labels = ['Child', 'Adult', 'Unknown']
age_groups = pd.cut(titanic['Age'], bins, labels=labels)
age_groups.describe() #sanity check
Out[58]:
In [59]:
#Add age_groups
titanic['age_group'] = age_groups
In [60]:
titanic.head() #sanity check
Out[60]:
In [61]:
# Remove PassengerId, Name, Ticket, Cabin
titanic.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
In [62]:
titanic.head() #sanity
Out[62]:
This appendix contains the exploration of relationships between different variables. The majority of the analysis was centered on the search for correlations to death and/or survival. The dataworkflow.visualize module contains the majority of the functions used in this section for manipulating and displaying the data.
The majority of functions used for manipulating and displaying data in this analysis were created in the dataworkflow.visualize module.
In [63]:
titanic = get_data()
In [64]:
titanic.head(10) #sanity
Out[64]:
Survival
Sex v. Survival
Age_Group v. Survival
Class v. Survival
FamilyStatus v. Survival
Port v. Survival
Fare v. Survival
Class & Sex v. Survival
AgeGroup & Sex v. Survival
Class & AgeGroup v. Survival
Class, AgeGroup, Sex v. Survival
In [65]:
titanic['Survived'].describe()
Out[65]:
In [66]:
titanic['Survived'].unique()
Out[66]:
In [67]:
freq1_display(titanic['Survived'])
_Survived by Sex__
In [68]:
freq2_display(titanic['Survived'], titanic['Sex'])
Sex & Class
In [69]:
freq2_display(titanic['Sex'], titanic['Pclass'])
Sex & Age Group
In [70]:
freq2_display(titanic['Sex'], titanic['age_group'])
Survived by AgeGroup
In [71]:
freq2_display(titanic['Survived'], titanic['age_group'])
Class & Age Group
In [72]:
freq2_display(titanic['Pclass'], titanic['age_group'])
Survived by Age
In [73]:
lived_age = titanic['Age'][titanic['Survived'] == 'Lived'].reset_index(drop=True)
died_age = titanic['Age'][titanic['Survived'] == 'Died'].reset_index(drop=True)
print('--------Lived------\n', lived_age.describe(), end='\n\n')
print('--------Died--------\n',died_age.describe())
In [74]:
sns.boxplot(data=[lived_age, died_age])
Out[74]:
_Survival by Class
In [75]:
freq2_display(titanic['Survived'], titanic['Pclass'])
Survival by FamilyStatus
In [76]:
freq2_display(titanic['Survived'], titanic['FamStatus'])
Survival by Port
In [77]:
freq2_display(titanic['Survived'], titanic['Embarked'])
Survival by Fare
In [78]:
lived_fares = titanic['Fare'][titanic['Survived']=='Lived'].reset_index(drop=True)
died_fares = titanic['Fare'][titanic['Survived']=='Died'].reset_index(drop=True)
print('\n--------------Lived-----------\n', lived_fares.describe())
print('\n--------------Died-------------\n', died_fares.describe())
In [79]:
sns.boxplot(data=[lived_fares, died_fares])
Out[79]:
In [80]:
sns.distplot(lived_fares, rug=True)
Out[80]:
In [81]:
sns.distplot(died_fares, rug=True)
Out[81]:
Survival by Class & Sex
In [82]:
freq3_display(titanic['Survived'],titanic['Pclass'],titanic['Sex'])
Survival by AgeGroup & Sex
In [83]:
freq3_display(titanic['Survived'], titanic['age_group'], titanic['Sex'])
Survival by Class & AgeGroup
In [84]:
freq3_display(titanic['Survived'], titanic['Pclass'], titanic['age_group'])
Survival by Class, Sex, AgeGroup
In [85]:
freq4_display(titanic['Survived'],titanic['Pclass'],titanic['age_group'],titanic['Sex'])
In [86]:
surv_class_sex_age = freq4_store(titanic['Survived'],titanic['Pclass'],titanic['age_group'],titanic['Sex'])
surv_class_sex_age[1]['1st']/surv_class_sex_age[1]['1st'].ix['All']
Out[86]:
In [87]:
surv_class_sex_age[1]['2nd']/surv_class_sex_age[1]['2nd'].ix['All']
Out[87]:
In [88]:
surv_class_sex_age[1]['3rd']/surv_class_sex_age[1]['3rd'].ix['All']
Out[88]: