In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style(style="darkgrid")
# plt.rcParams['figure.figsize'] = [12.0, 8.0] # make plots size, double of the notebook normal
plt.rcParams['figure.figsize'] = [9.0, 6.0] # make plots size, double of the notebook normal
from IPython.display import display, HTML # to use display() to always have well formatted html table output
Variable Name | Definition |
---|---|
PassengerId | A unique ID to each Passenger; 1-891 |
Survived | A boolean variable; 1 - Survived, 0 - Dead |
Pclass | Ticket Class; 1 - 1st, 2 - 2nd, 3 - 3rd class |
Name | Passenger Name |
Sex | Sex of Passenger |
Age | Age in Years |
SibSp | Number of Siblings / Spouses Aboard |
Parch | Number of parents / children aboard the titanic |
Ticket | Ticket number |
Fare | Passenger Fare |
Cabin | Cabin number |
Embarked | Port of Embarkation; C - Cherbourg, Q - Queenstown, S - Southampton |
Pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
SibSp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
Parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
*Source: [Kaggle's - Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/data)*
In [3]:
titanic_df = pd.read_csv("titanic-data.csv", index_col=["PassengerId"])
titanic_df.head()
Out[3]:
In [4]:
titanic_df.info()
Dataset summary above shows that there are 891 entries.
However, from above we can also see that we have missing values in - Age, Cabin, Embarked columns.
To review the data by distributions, and to tackle various questions - we first need to deal with this issue of missing ages.
If we assume that the missing ages will be distributed similarly, to the values that are present - then we can substitue values that represent the existing distribution.
For this we can replace the missing values with the mean.
To have best representative values populated - we will taken mean based on Sex and Pclass. In other words the mean of ages for Sex within the Pclass, and when replacing the missing age, these two factors will be kept in consideration - to use the related mean of ages.
In [5]:
mean_ages = titanic_df.groupby(['Sex','Pclass'])['Age'].mean()
display(mean_ages)
In [6]:
def replace_nan_age(row):
if pd.isnull(row['Age']):
return mean_ages[row['Sex'], row['Pclass']]
else:
return row['Age']
titanic_df['Age'] = titanic_df.apply(replace_nan_age, axis=1)
titanic_df.info()
From the above we can see that the missing ages have been filled.
In [7]:
titanic_df.describe()
Out[7]:
In [8]:
titanic_df.Parch.hist()
plt.xlabel('Parch')
plt.ylabel('Passengers')
plt.title('Number of parents / children aboard')
Out[8]:
In [9]:
titanic_df.SibSp.hist()
plt.xlabel('SibSp')
plt.ylabel('Passengers')
plt.title('Number of Siblings / Spouses aboard')
Out[9]:
From the above we notice
In [10]:
## SUBSET DATAFRAME TO JUST THE REQUIRED DATA
survived_plass_df = titanic_df[['Survived', 'Pclass']] # works - just have to say the columns required
survived_plass_df.head()
## GROUP DATA TO CALCULATE SURVIVED & TOTAL BY PCLASS
## calculate survived by pclass
survived_by_pclass = survived_plass_df.groupby(['Pclass']).sum()
total_by_pclass = survived_plass_df.groupby(['Pclass']).count()
# total are showed as survived - so change to column name Total
total_by_pclass.rename(columns = {'Survived':'Total'}, inplace = True)
# merge separate data into one dataframe
survived_total_by_pclass = pd.merge(survived_by_pclass, total_by_pclass, left_index=True, right_index=True) # merge by index
survived_total_by_pclass
Out[10]:
In [11]:
percent_survived = (survived_total_by_pclass['Survived'] / survived_total_by_pclass['Total']) * 100
survived_total_by_pclass['Percentage'] = percent_survived
survived_total_by_pclass
Out[11]:
In [12]:
x = survived_total_by_pclass.index.values
ht = survived_total_by_pclass.Total
hs = survived_total_by_pclass.Survived
pht = plt.bar(x, ht)
phs = plt.bar(x, hs)
plt.xticks(x, x)
plt.xlabel('Pclass')
plt.ylabel('Passengers')
plt.title('Survivors by Class')
plt.legend([pht,phs],['Died', 'Survived'])
Out[12]:
Conclusion
As can be seen from the visualization and also from the dataframe table above - 1st Class passengers had highest rate of survival, then 2nd class passengers, and the least survival rates was of 3rd class passengers. A large number of passengers were travelling in 3rd class (491), but only 24.24% survived.
In [13]:
## CALCULATE SURVIVED AND TOTAL BY SEX
# groupby Sex
group_by_sex = titanic_df.groupby('Sex')
# calculate survived by sex
survived_by_sex = group_by_sex['Survived'].sum()
survived_by_sex.name = 'Survived'
display(survived_by_sex)
# calculate total by sex
total_by_sex = group_by_sex['Survived'].size()
total_by_sex.name = 'Total'
display(total_by_sex)
# concat the separate results into one dataframe
survived_total_by_sex = pd.concat([survived_by_sex, total_by_sex], axis=1)
survived_total_by_sex
Out[13]:
In [14]:
percent_survived = (survived_total_by_sex['Survived'] / survived_total_by_sex['Total']) * 100
survived_total_by_sex['Percentage'] = percent_survived
survived_total_by_sex
Out[14]:
Now lets visualize it
In [15]:
x = range(len(survived_total_by_sex.index.values))
ht = survived_total_by_sex.Total
hs = survived_total_by_sex.Survived
pht = plt.bar(x, ht)
phs = plt.bar(x, hs)
plt.xticks(x, survived_total_by_sex.index.values)
plt.xlabel('Sex')
plt.ylabel('Passengers')
plt.title('Survivors by Gender')
plt.legend([pht,phs],['Died', 'Survived'])
Out[15]:
Conclusion
From the visualization and percentage of survival from the dataframe printout above - we can see that females had very high rate of survival. Female survial rate was 74.3%, and male survival rate was 18.9% - so female survival rate was about 4 times that of males.
It can be concluded that females were given preference in rescue operations, and males must have sacrificed themselves to let the females survive.
Lets first reivew the distribution of those who were alone, and those who were in company.
In [16]:
is_not_alone = (titanic_df.SibSp + titanic_df.Parch) >= 1
passengers_not_alone = titanic_df[is_not_alone]
is_alone = (titanic_df.SibSp + titanic_df.Parch) == 0
passengers_alone = titanic_df[is_alone]
print('Not alone - describe')
display(passengers_not_alone.describe())
print('Alone - describe')
display(passengers_alone.describe())
In [17]:
passengers_not_alone.Age.hist(label='Not alone')
passengers_alone.Age.hist(label='Alone', alpha=0.6)
plt.xlabel('Age')
plt.ylabel('Passengers')
plt.legend(loc='best')
plt.title('Alone & Not Alone Passenger\'s Ages')
Out[17]:
From the above distribution we can see that
Now lets review these by their survival
In [18]:
notalone = np.where((titanic_df.SibSp + titanic_df.Parch) >= 1, 'Not Alone', 'Alone')
loneliness_summary = titanic_df.groupby(notalone, as_index=False)['Survived'].agg([np.sum, np.size])
loneliness_summary = loneliness_summary.rename(columns={'sum':'Survived', 'size':'Total'})
loneliness_summary
Out[18]:
In [19]:
loneliness_summary['Percent survived'] = (loneliness_summary.Survived / loneliness_summary.Total) * 100
loneliness_summary
Out[19]:
Now lets visualize
In [20]:
x = range(len(loneliness_summary.index.values))
ht = loneliness_summary.Total
hs = loneliness_summary.Survived
pht = plt.bar(x, ht)
phs = plt.bar(x, hs)
plt.xticks(x, loneliness_summary.index.values)
plt.xlabel('Alone / Not Alone')
plt.ylabel('Passengers')
plt.title('Survivors by Alone / Not Alone')
plt.legend([pht,phs],['Died', 'Survived'])
Out[20]:
Conclusion
Percentage above and visualizations above clearly indicate that people having company had higher survival rate.
First lets review gender age distribution
In [21]:
male_ages = (titanic_df[titanic_df.Sex == 'male'])['Age']
male_ages.describe()
Out[21]:
In [22]:
female_ages = (titanic_df[titanic_df.Sex == 'female'])['Age']
female_ages.describe()
Out[22]:
In [23]:
male_ages.hist(label='Male')
female_ages.hist(label='Female')
plt.xlabel('Age')
plt.ylabel('Passengers')
plt.title('Male & Female passenger ages')
plt.legend(loc='best')
Out[23]:
From above distribution, we can see that:
Now lets do survival analysis by the age group
In [24]:
def age_group(age):
if age >= 80:
return '80-89'
if age >= 70:
return '70-79'
if age >= 60:
return '60-69'
if age >= 50:
return '50-59'
if age >= 40:
return '40-49'
if age >= 30:
return '30-39'
if age >= 20:
return '20-29'
if age >= 10:
return '10-19'
if age >= 0:
return '0-9'
titanic_df['AgeGroup'] = titanic_df.Age.apply(age_group)
titanic_df.head()
Out[24]:
In [25]:
age_group_summary = titanic_df.groupby(['AgeGroup'], as_index=False)['Survived'].agg([np.sum, np.size])
age_group_summary = age_group_summary.rename(columns={'sum':'Survived', 'size':'Total'})
age_group_summary
Out[25]:
In [26]:
x = range(len(age_group_summary.index.values))
ht = age_group_summary.Total
hs = age_group_summary.Survived
pht = plt.bar(x, ht)
phs = plt.bar(x, hs)
plt.xticks(x, age_group_summary.index.values)
plt.xlabel('Age groups')
plt.ylabel('Passengers')
plt.title('Survivors by Age group')
plt.legend([pht,phs],['Died', 'Survived'])
Out[26]:
In [27]:
age_group_summary['SurvivedPercent'] = (age_group_summary.Survived / age_group_summary.Total) * 100
age_group_summary['DiedPercent'] = ((age_group_summary.Total - age_group_summary.Survived) / age_group_summary.Total) * 100
age_group_summary
Out[27]:
From the above visualization and percentages we can see that most survivors were from 20-29 age group.
But interestingly survival percentage of 0-9 age group is best - at 61.29%.
Also above we have seen that female had better survial rate - so these survial rates must be mix of male and female survival rates - and hence to have better view, the gender aspect should also be taken into consideration.
In [28]:
sex_agegroup_summary = titanic_df.groupby(['Sex','AgeGroup'], as_index=False)['Survived'].mean()
sex_agegroup_summary
Out[28]:
In [29]:
male_agegroup_summary = sex_agegroup_summary[sex_agegroup_summary['Sex'] == 'male']
male_agegroup_summary
Out[29]:
In [30]:
female_agegroup_summary = sex_agegroup_summary[sex_agegroup_summary['Sex'] == 'female']
female_agegroup_summary
Out[30]:
In [31]:
age_group = titanic_df.AgeGroup.unique()
age_labels = sorted(age_group)
print age_labels
In [32]:
ax = sns.barplot(x='AgeGroup', y='Survived', data=titanic_df, hue='Sex', order=age_labels)
ax.set_title('Survivors by Gender by Age groups')
Out[32]:
Conclusion
From the proportions above, and the visualization, taking into consideration the gender and age group - it is clearly visible that female and children were given preference in rescue operations by the other male passengers. 0-9 age group both male and female children had very high rate of survival.
In [33]:
sns.swarmplot(x='Pclass', y='Age', data=titanic_df, hue='Sex', dodge=True).set_title('Male and Female Passenger Ages by Class')
Out[33]:
Above we can see that compared to first and second class, there were large number of passengers in third class. Particularly males were in large number ... by the look of swarm, they were in large number in age 18 to 32.
To understand the age distribution of male and female in different class - the better plot is box plot.
In [34]:
sns.boxplot(x='Pclass', y='Age', data=titanic_df, hue='Sex').set_title('Comparison of Male and Female Passenger Ages by Class')
Out[34]:
From above we can make out that mean age of male and female in 3rd class was less than that of males and females in 2nd and 1st class. Highest mean age of males was in 1st class.
But this plot gives only idea of the distribution of ages of males and females per class.
Now lets try to understand the survival of male and female, per class
In [35]:
def scatter(passengers, marker='o', legend_prefix=''):
survived = passengers[passengers.Survived == 1]
died = passengers[passengers.Survived == 0]
x = survived.Age
y = survived.Fare
plt.scatter(x, y, c='blue', alpha=0.5, marker=marker, label=legend_prefix + ' Survived')
x = died.Age
y = died.Fare
plt.scatter(x, y, c='red', alpha=0.5, marker=marker, label=legend_prefix + ' Died')
def scatter_by_class(pclass):
class_passengers = titanic_df[titanic_df.Pclass == pclass]
male_passengers = class_passengers[class_passengers.Sex == 'male']
female_passengers = class_passengers[class_passengers.Sex == 'female']
scatter(male_passengers, marker='o', legend_prefix='Male')
scatter(female_passengers, marker='^', legend_prefix='Female')
plt.legend(bbox_to_anchor=(0,1), loc='best') # bbox - to move legend out of plot/scatter
plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('Gender survival by Age, for Pclass = ' + str(pclass))
In [36]:
scatter_by_class(1)
In [37]:
scatter_by_class(2)
In [38]:
scatter_by_class(3)
The above three scatter plots give a view of male and female age and survival in each of the class.
But for better clarity and understanding, we can separate the scatter plots for male and female, per class. This is what we will do next.
In [39]:
def sns_scatter_by_class(pclass):
fg = sns.FacetGrid(titanic_df[titanic_df['Pclass'] == pclass],
col='Sex',
col_order=['male', 'female'],
hue='Survived',
hue_kws=dict(marker=['v', '^']),
size=6,
palette='Set1')
fg = (fg.map(plt.scatter, 'Age', 'Fare', edgecolor='w', alpha=0.7, s=80).add_legend())
plt.subplots_adjust(top=0.9)
fg.fig.suptitle('Gender survival by Age, for CLASS {}'.format(pclass))
# plotted separately because male and female data in same scatter plot difficult to understand comparitive
sns_scatter_by_class(1)
sns_scatter_by_class(2)
sns_scatter_by_class(3)
From the just above scatter plots we have lot of clarity about male female age spread, and survivals.
We can notice the following
We can confirm the above observation by following barplot - showing the rate of survival by class, by sex
In [40]:
sns.barplot(x='Pclass', y='Survived', data=titanic_df, hue='Sex').set_title('Gender Survival by Class')
Out[40]:
Female survial rate was 74.3%, and male survival rate was 18.9% - so female survival rate was about 4 times that of males. Hence, female and children were given preference in rescue operations, and must have been saved by other male passengers.
62.96 percent of 1st class passengers survived, whereas 3rd class passengers survival rate was 24.24% which is about one-third of the first class passengers. This is surprising that the 1st class passengers have high survival, that is they were given preference because of their social class.
50% of passengers travelling with family survived, whereas survival percentage was 30% for those travelling alone. Hence survival rate was high for passengers travelling with family, as compared to those travelling alone.
Children had higer rate of survival as compared to adults
Future work or potential areas to explore
Only 3 parameters were used in analysis - Age, Sex, and Passenger class - whereas more analysis is possible
More questions that can be asked are