In [2]:

    
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style(style="darkgrid")
# plt.rcParams['figure.figsize'] = [12.0, 8.0]   # make plots size, double of the notebook normal
plt.rcParams['figure.figsize'] = [9.0, 6.0]   # make plots size, double of the notebook normal

from IPython.display import display, HTML      # to use display() to always have well formatted html table output

Dataset Description

Variable Name	Definition
PassengerId	A unique ID to each Passenger; 1-891
Survived	A boolean variable; 1 - Survived, 0 - Dead
Pclass	Ticket Class; 1 - 1st, 2 - 2nd, 3 - 3rd class
Name	Passenger Name
Sex	Sex of Passenger
Age	Age in Years
SibSp	Number of Siblings / Spouses Aboard
Parch	Number of parents / children aboard the titanic
Ticket	Ticket number
Fare	Passenger Fare
Cabin	Cabin number
Embarked	Port of Embarkation; C - Cherbourg, Q - Queenstown, S - Southampton

Some Notes Regarding Dataset

Pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

SibSp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

Parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

*Source: [Kaggle's - Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/data)*

Loading data & Preview



In [3]:

    
titanic_df = pd.read_csv("titanic-data.csv", index_col=["PassengerId"])
titanic_df.head()









    Out[3]:







  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S



In [4]:

    
titanic_df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

Dataset summary above shows that there are 891 entries.

However, from above we can also see that we have missing values in - Age, Cabin, Embarked columns.

Missing values of Cabin and Embarked will not be fixed - because no questions are based on these factors
Missing values of Age will be fixed now - because involved in various questions and analysis below.

Fix missing ages

To review the data by distributions, and to tackle various questions - we first need to deal with this issue of missing ages.

If we assume that the missing ages will be distributed similarly, to the values that are present - then we can substitue values that represent the existing distribution.

For this we can replace the missing values with the mean.

To have best representative values populated - we will taken mean based on Sex and Pclass. In other words the mean of ages for Sex within the Pclass, and when replacing the missing age, these two factors will be kept in consideration - to use the related mean of ages.



In [5]:

    
mean_ages = titanic_df.groupby(['Sex','Pclass'])['Age'].mean()
display(mean_ages)









    





Sex     Pclass
female  1         34.611765
        2         28.722973
        3         21.750000
male    1         41.281386
        2         30.740707
        3         26.507589
Name: Age, dtype: float64



In [6]:

    
def replace_nan_age(row):
    if pd.isnull(row['Age']):
        return mean_ages[row['Sex'], row['Pclass']]
    else:
        return row['Age']
    
titanic_df['Age'] = titanic_df.apply(replace_nan_age, axis=1)
titanic_df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

From the above we can see that the missing ages have been filled.



In [7]:

    
titanic_df.describe()



In [8]:

    
titanic_df.Parch.hist()
plt.xlabel('Parch')
plt.ylabel('Passengers')
plt.title('Number of parents / children aboard')









    Out[8]:





<matplotlib.text.Text at 0x962ffd0>



In [9]:

    
titanic_df.SibSp.hist()
plt.xlabel('SibSp')
plt.ylabel('Passengers')
plt.title('Number of Siblings / Spouses aboard')









    Out[9]:





<matplotlib.text.Text at 0xb58d710>

From the above we notice

Oldest passenger was 80 years old
Youngest passenger was about 5 months old
Average age of passengers was 29.32 - but note this also has missing ages
Mean survival is 0.3838
Max fare charged was $512.33
Maximum number of Siblings / Spouses were 8
Maximum number of Parent / Child were 6

Questions in mind

Did passenger class made any difference to his survival?
Which gender had more survival?
Person travelling with others had more survival possibility?
Which age group had better chance of survival?
What was male and female survival per class and by age?

Question 1 - Did passenger class made any difference to his survival?



In [10]:

    
## SUBSET DATAFRAME TO JUST THE REQUIRED DATA

survived_plass_df = titanic_df[['Survived', 'Pclass']]          # works - just have to say the columns required
survived_plass_df.head()

## GROUP DATA TO CALCULATE SURVIVED & TOTAL BY PCLASS

## calculate survived by pclass
survived_by_pclass = survived_plass_df.groupby(['Pclass']).sum()
total_by_pclass = survived_plass_df.groupby(['Pclass']).count()

# total are showed as survived - so change to column name Total
total_by_pclass.rename(columns = {'Survived':'Total'}, inplace = True)

# merge separate data into one dataframe
survived_total_by_pclass = pd.merge(survived_by_pclass, total_by_pclass, left_index=True, right_index=True) # merge by index
survived_total_by_pclass



In [11]:

    
percent_survived = (survived_total_by_pclass['Survived'] / survived_total_by_pclass['Total']) * 100
survived_total_by_pclass['Percentage'] = percent_survived

survived_total_by_pclass



In [12]:

    
x = survived_total_by_pclass.index.values
ht = survived_total_by_pclass.Total
hs = survived_total_by_pclass.Survived

pht = plt.bar(x, ht)
phs = plt.bar(x, hs)

plt.xticks(x, x)
plt.xlabel('Pclass')
plt.ylabel('Passengers')
plt.title('Survivors by Class')


plt.legend([pht,phs],['Died', 'Survived'])









    Out[12]:





<matplotlib.legend.Legend at 0xb74e748>

Conclusion

As can be seen from the visualization and also from the dataframe table above - 1st Class passengers had highest rate of survival, then 2nd class passengers, and the least survival rates was of 3rd class passengers. A large number of passengers were travelling in 3rd class (491), but only 24.24% survived.

Question 2 - Which gender had more survival?



In [13]:

    
## CALCULATE SURVIVED AND TOTAL BY SEX

# groupby Sex
group_by_sex = titanic_df.groupby('Sex')

# calculate survived by sex
survived_by_sex = group_by_sex['Survived'].sum()
survived_by_sex.name = 'Survived'
display(survived_by_sex)

# calculate total by sex
total_by_sex = group_by_sex['Survived'].size()
total_by_sex.name = 'Total'
display(total_by_sex)

# concat the separate results into one dataframe
survived_total_by_sex = pd.concat([survived_by_sex, total_by_sex], axis=1)
survived_total_by_sex









    





Sex
female    233
male      109
Name: Survived, dtype: int64






    





Sex
female    314
male      577
Name: Total, dtype: int64






    Out[13]:







  
    
      
      Survived
      Total
    
    
      Sex
      
      
    
  
  
    
      female
      233
      314
    
    
      male
      109
      577



In [14]:

    
percent_survived = (survived_total_by_sex['Survived'] / survived_total_by_sex['Total']) * 100
survived_total_by_sex['Percentage'] = percent_survived

survived_total_by_sex

Now lets visualize it



In [15]:

    
x = range(len(survived_total_by_sex.index.values))
ht = survived_total_by_sex.Total
hs = survived_total_by_sex.Survived

pht = plt.bar(x, ht)
phs = plt.bar(x, hs)

plt.xticks(x, survived_total_by_sex.index.values)
plt.xlabel('Sex')
plt.ylabel('Passengers')
plt.title('Survivors by Gender')

plt.legend([pht,phs],['Died', 'Survived'])









    Out[15]:





<matplotlib.legend.Legend at 0xb94b5c0>

Conclusion

From the visualization and percentage of survival from the dataframe printout above - we can see that females had very high rate of survival. Female survial rate was 74.3%, and male survival rate was 18.9% - so female survival rate was about 4 times that of males.

It can be concluded that females were given preference in rescue operations, and males must have sacrificed themselves to let the females survive.

Question 3 - Person travelling with others had more survival possibility?

Lets first reivew the distribution of those who were alone, and those who were in company.



In [16]:

    
is_not_alone = (titanic_df.SibSp + titanic_df.Parch) >= 1
passengers_not_alone = titanic_df[is_not_alone]

is_alone = (titanic_df.SibSp + titanic_df.Parch) == 0
passengers_alone = titanic_df[is_alone]

print('Not alone - describe')
display(passengers_not_alone.describe())
print('Alone - describe')
display(passengers_alone.describe())









    



Not alone - describe






    







  
    
      
      Survived
      Pclass
      Age
      SibSp
      Parch
      Fare
    
  
  
    
      count
      354.000000
      354.000000
      354.000000
      354.000000
      354.000000
      354.000000
    
    
      mean
      0.505650
      2.169492
      26.316614
      1.316384
      0.960452
      48.832275
    
    
      std
      0.500676
      0.864520
      14.901225
      1.420774
      1.039512
      55.307615
    
    
      min
      0.000000
      1.000000
      0.420000
      0.000000
      0.000000
      6.495800
    
    
      25%
      0.000000
      1.000000
      17.000000
      1.000000
      0.000000
      18.000000
    
    
      50%
      1.000000
      2.000000
      26.000000
      1.000000
      1.000000
      27.750000
    
    
      75%
      1.000000
      3.000000
      36.000000
      1.000000
      2.000000
      59.044800
    
    
      max
      1.000000
      3.000000
      70.000000
      8.000000
      6.000000
      512.329200
    
  








    



Alone - describe






    







  
    
      
      Survived
      Pclass
      Age
      SibSp
      Parch
      Fare
    
  
  
    
      count
      537.000000
      537.000000
      537.000000
      537.0
      537.0
      537.000000
    
    
      mean
      0.303538
      2.400372
      31.297634
      0.0
      0.0
      21.242689
    
    
      std
      0.460214
      0.804511
      11.694910
      0.0
      0.0
      42.223510
    
    
      min
      0.000000
      1.000000
      5.000000
      0.0
      0.0
      0.000000
    
    
      25%
      0.000000
      2.000000
      23.000000
      0.0
      0.0
      7.775000
    
    
      50%
      0.000000
      3.000000
      27.000000
      0.0
      0.0
      8.137500
    
    
      75%
      1.000000
      3.000000
      36.000000
      0.0
      0.0
      15.000000
    
    
      max
      1.000000
      3.000000
      80.000000
      0.0
      0.0
      512.329200



In [17]:

    
passengers_not_alone.Age.hist(label='Not alone')
passengers_alone.Age.hist(label='Alone', alpha=0.6)

plt.xlabel('Age')
plt.ylabel('Passengers')
plt.legend(loc='best')
plt.title('Alone & Not Alone Passenger\'s Ages')









    Out[17]:





<matplotlib.text.Text at 0xbcfb400>

From the above distribution we can see that

Those in age range of 0-10, that is kids, were not alone - which makes sense
There however is one kid age 5 who was alone
There was an 80 year old person also who was alone
537 passengers were alone, whereas 354 were in company
Except for age group 0-10, for all other age groups, those travelling alone outnumbered those travelling in company

Now lets review these by their survival



In [18]:

    
notalone = np.where((titanic_df.SibSp + titanic_df.Parch) >= 1, 'Not Alone', 'Alone')
loneliness_summary = titanic_df.groupby(notalone, as_index=False)['Survived'].agg([np.sum, np.size])
loneliness_summary = loneliness_summary.rename(columns={'sum':'Survived', 'size':'Total'})

loneliness_summary









    Out[18]:







  
    
      
      Survived
      Total
    
  
  
    
      Alone
      163
      537
    
    
      Not Alone
      179
      354



In [19]:

    
loneliness_summary['Percent survived'] = (loneliness_summary.Survived / loneliness_summary.Total) * 100

loneliness_summary









    Out[19]:







  
    
      
      Survived
      Total
      Percent survived
    
  
  
    
      Alone
      163
      537
      30.353818
    
    
      Not Alone
      179
      354
      50.564972

Now lets visualize



In [20]:

    
x = range(len(loneliness_summary.index.values))
ht = loneliness_summary.Total
hs = loneliness_summary.Survived

pht = plt.bar(x, ht)
phs = plt.bar(x, hs)

plt.xticks(x, loneliness_summary.index.values)
plt.xlabel('Alone / Not Alone')
plt.ylabel('Passengers')
plt.title('Survivors by Alone / Not Alone')


plt.legend([pht,phs],['Died', 'Survived'])









    Out[20]:





<matplotlib.legend.Legend at 0xb9bd6a0>

Conclusion

Percentage above and visualizations above clearly indicate that people having company had higher survival rate.

Question 4 - Which age group had a better chance of survival?

First lets review gender age distribution



In [21]:

    
male_ages = (titanic_df[titanic_df.Sex == 'male'])['Age']
male_ages.describe()









    Out[21]:





count    577.000000
mean      30.423672
std       13.264336
min        0.420000
25%       23.000000
50%       27.000000
75%       37.000000
max       80.000000
Name: Age, dtype: float64



In [22]:

    
female_ages = (titanic_df[titanic_df.Sex == 'female'])['Age']
female_ages.describe()









    Out[22]:





count    314.000000
mean      27.288063
std       13.091327
min        0.750000
25%       21.000000
50%       24.000000
75%       35.000000
max       63.000000
Name: Age, dtype: float64



In [23]:

    
male_ages.hist(label='Male')
female_ages.hist(label='Female')

plt.xlabel('Age')
plt.ylabel('Passengers')
plt.title('Male & Female passenger ages')
plt.legend(loc='best')









    Out[23]:





<matplotlib.legend.Legend at 0xb72a400>

From above distribution, we can see that:

For every age group the number of females was less than number of males
The age of oldest female was 63, whereas age of oldest male was 80

Now lets do survival analysis by the age group



In [24]:

    
def age_group(age):
    if age >= 80:
        return '80-89'
    if age >= 70:
        return '70-79'
    if age >= 60:
        return '60-69'
    if age >= 50:
        return '50-59'
    if age >= 40:
        return '40-49'
    if age >= 30:
        return '30-39'
    if age >= 20:
        return '20-29'
    if age >= 10:
        return '10-19'
    if age >= 0:
        return '0-9'
    
titanic_df['AgeGroup'] = titanic_df.Age.apply(age_group)
titanic_df.head()









    Out[24]:







  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
      AgeGroup
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
      20-29
    
    
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
      30-39
    
    
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
      20-29
    
    
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
      30-39
    
    
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S
      30-39



In [25]:

    
age_group_summary = titanic_df.groupby(['AgeGroup'], as_index=False)['Survived'].agg([np.sum, np.size])
age_group_summary = age_group_summary.rename(columns={'sum':'Survived', 'size':'Total'})
age_group_summary



In [26]:

    
x = range(len(age_group_summary.index.values))
ht = age_group_summary.Total
hs = age_group_summary.Survived

pht = plt.bar(x, ht)
phs = plt.bar(x, hs)

plt.xticks(x, age_group_summary.index.values)
plt.xlabel('Age groups')
plt.ylabel('Passengers')
plt.title('Survivors by Age group')


plt.legend([pht,phs],['Died', 'Survived'])









    Out[26]:





<matplotlib.legend.Legend at 0xc6574a8>



In [27]:

    
age_group_summary['SurvivedPercent'] = (age_group_summary.Survived / age_group_summary.Total) * 100
age_group_summary['DiedPercent'] = ((age_group_summary.Total - age_group_summary.Survived) / age_group_summary.Total) * 100
age_group_summary









    Out[27]:







  
    
      
      Survived
      Total
      SurvivedPercent
      DiedPercent
    
    
      AgeGroup
      
      
      
      
    
  
  
    
      0-9
      38
      62
      61.290323
      38.709677
    
    
      10-19
      41
      102
      40.196078
      59.803922
    
    
      20-29
      113
      358
      31.564246
      68.435754
    
    
      30-39
      84
      185
      45.405405
      54.594595
    
    
      40-49
      39
      110
      35.454545
      64.545455
    
    
      50-59
      20
      48
      41.666667
      58.333333
    
    
      60-69
      6
      19
      31.578947
      68.421053
    
    
      70-79
      0
      6
      0.000000
      100.000000
    
    
      80-89
      1
      1
      100.000000
      0.000000

From the above visualization and percentages we can see that most survivors were from 20-29 age group.

But interestingly survival percentage of 0-9 age group is best - at 61.29%.

Also above we have seen that female had better survial rate - so these survial rates must be mix of male and female survival rates - and hence to have better view, the gender aspect should also be taken into consideration.



In [28]:

    
sex_agegroup_summary = titanic_df.groupby(['Sex','AgeGroup'], as_index=False)['Survived'].mean()

sex_agegroup_summary



In [29]:

    
male_agegroup_summary = sex_agegroup_summary[sex_agegroup_summary['Sex'] == 'male']

male_agegroup_summary



In [30]:

    
female_agegroup_summary = sex_agegroup_summary[sex_agegroup_summary['Sex'] == 'female']

female_agegroup_summary



In [31]:

    
age_group = titanic_df.AgeGroup.unique()
age_labels = sorted(age_group)
print age_labels









    



['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89']



In [32]:

    
ax = sns.barplot(x='AgeGroup', y='Survived', data=titanic_df, hue='Sex', order=age_labels)
ax.set_title('Survivors by Gender by Age groups')









    Out[32]:





<matplotlib.text.Text at 0xc984080>

Conclusion

From the proportions above, and the visualization, taking into consideration the gender and age group - it is clearly visible that female and children were given preference in rescue operations by the other male passengers. 0-9 age group both male and female children had very high rate of survival.

Question 5 - What was male and female survival per class and by age?

Male and Female per Pclass

Lets review the males and females, per the passenger classes



In [33]:

    
sns.swarmplot(x='Pclass', y='Age', data=titanic_df, hue='Sex', dodge=True).set_title('Male and Female Passenger Ages by Class')









    Out[33]:





<matplotlib.text.Text at 0xcc60e80>

Above we can see that compared to first and second class, there were large number of passengers in third class. Particularly males were in large number ... by the look of swarm, they were in large number in age 18 to 32.

To understand the age distribution of male and female in different class - the better plot is box plot.



In [34]:

    
sns.boxplot(x='Pclass', y='Age', data=titanic_df, hue='Sex').set_title('Comparison of Male and Female Passenger Ages by Class')









    Out[34]:





<matplotlib.text.Text at 0xd0e6fd0>

From above we can make out that mean age of male and female in 3rd class was less than that of males and females in 2nd and 1st class. Highest mean age of males was in 1st class.

But this plot gives only idea of the distribution of ages of males and females per class.

Now lets try to understand the survival of male and female, per class

Male and Female Survival per Pclass, and by Age



In [35]:

    
def scatter(passengers, marker='o', legend_prefix=''):
    survived = passengers[passengers.Survived == 1]
    died = passengers[passengers.Survived == 0]

    x = survived.Age
    y = survived.Fare
    plt.scatter(x, y, c='blue', alpha=0.5, marker=marker, label=legend_prefix + ' Survived')

    x = died.Age
    y = died.Fare
    plt.scatter(x, y, c='red', alpha=0.5, marker=marker, label=legend_prefix + ' Died')

def scatter_by_class(pclass):
    class_passengers = titanic_df[titanic_df.Pclass == pclass]
    
    male_passengers = class_passengers[class_passengers.Sex == 'male']
    female_passengers = class_passengers[class_passengers.Sex == 'female']
    
    scatter(male_passengers, marker='o', legend_prefix='Male')
    scatter(female_passengers, marker='^', legend_prefix='Female')

    plt.legend(bbox_to_anchor=(0,1), loc='best') # bbox - to move legend out of plot/scatter
    plt.xlabel('Age')
    plt.ylabel('Fare')
    plt.title('Gender survival by Age, for Pclass = ' + str(pclass))



In [36]:

    
scatter_by_class(1)



In [37]:

    
scatter_by_class(2)



In [38]:

    
scatter_by_class(3)

The above three scatter plots give a view of male and female age and survival in each of the class.

But for better clarity and understanding, we can separate the scatter plots for male and female, per class. This is what we will do next.



In [39]:

    
def sns_scatter_by_class(pclass):
    fg = sns.FacetGrid(titanic_df[titanic_df['Pclass'] == pclass], 
                      col='Sex',
                      col_order=['male', 'female'],
                      hue='Survived', 
                      hue_kws=dict(marker=['v', '^']), 
                      size=6,
                      palette='Set1')
    fg = (fg.map(plt.scatter, 'Age', 'Fare', edgecolor='w', alpha=0.7, s=80).add_legend())
    plt.subplots_adjust(top=0.9)
    fg.fig.suptitle('Gender survival by Age, for CLASS {}'.format(pclass))

# plotted separately because male and female data in same scatter plot difficult to understand comparitive
sns_scatter_by_class(1)
sns_scatter_by_class(2)
sns_scatter_by_class(3)

Conclusion

From the just above scatter plots we have lot of clarity about male female age spread, and survivals.

We can notice the following

Females in first and second class were mostly all saved/survived.
In first and second class male and female kids (age group 0-10) almost all survived
In third class also the survival rate of female was higher than of males - but the female survival was less compared the female survival compared to 1st and 2nd class females

We can confirm the above observation by following barplot - showing the rate of survival by class, by sex



In [40]:

    
sns.barplot(x='Pclass', y='Survived', data=titanic_df, hue='Sex').set_title('Gender Survival by Class')









    Out[40]:





<matplotlib.text.Text at 0xf27cfd0>

Overall Conclusion

Findings

Female survial rate was 74.3%, and male survival rate was 18.9% - so female survival rate was about 4 times that of males. Hence, female and children were given preference in rescue operations, and must have been saved by other male passengers.
62.96 percent of 1st class passengers survived, whereas 3rd class passengers survival rate was 24.24% which is about one-third of the first class passengers. This is surprising that the 1st class passengers have high survival, that is they were given preference because of their social class.
50% of passengers travelling with family survived, whereas survival percentage was 30% for those travelling alone. Hence survival rate was high for passengers travelling with family, as compared to those travelling alone.
Children had higer rate of survival as compared to adults

Limitations

Limitations in the analysis
- Above only visualization, proportions, and percentages have been used to come to conclusions. However improvements could be made in the analysis by using statistical tests.
Limitations of the data
- The dataset itself has some limitations - it has data missing for certain features/properties of passengers, like age
- Also this data is only sample data from the population/full-data of titanic.
- Missing data and sample size could skew the results, for example becuase of missing ages
- Also we do not know if this sample was properly randomly selected

Future plans

Future work or potential areas to explore

Only 3 parameters were used in analysis - Age, Sex, and Passenger class - whereas more analysis is possible
More questions that can be asked are
- Does a person with higher fare had higher chances of survival?
- Does the deck of the cabin increased/effected the rate of survivval?
- Does the survival had any relation with the title of a person (like Mr. Mrs, Miss, etc)

References

Methods for handling missing values | Cortana Intelligence Gallery

	Survived	Total
Pclass
1	136	216
2	87	184
3	119	491

	Survived	Total	Percentage
Pclass
1	136	216	62.962963
2	87	184	47.282609
3	119	491	24.236253

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000
mean	0.383838	2.308642	29.318643	0.523008	0.381594	32.204208
std	0.486592	0.836071	13.281103	1.102743	0.806057	49.693429
min	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	0.000000	2.000000	21.750000	0.000000	0.000000	7.910400
50%	0.000000	3.000000	26.507589	0.000000	0.000000	14.454200
75%	1.000000	3.000000	36.000000	1.000000	0.000000	31.000000
max	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	Survived	Pclass	Age	SibSp	Parch	Fare
count	354.000000	354.000000	354.000000	354.000000	354.000000	354.000000
mean	0.505650	2.169492	26.316614	1.316384	0.960452	48.832275
std	0.500676	0.864520	14.901225	1.420774	1.039512	55.307615
min	0.000000	1.000000	0.420000	0.000000	0.000000	6.495800
25%	0.000000	1.000000	17.000000	1.000000	0.000000	18.000000
50%	1.000000	2.000000	26.000000	1.000000	1.000000	27.750000
75%	1.000000	3.000000	36.000000	1.000000	2.000000	59.044800
max	1.000000	3.000000	70.000000	8.000000	6.000000	512.329200

	Survived	Pclass	Age	SibSp	Parch	Fare
count	537.000000	537.000000	537.000000	537.0	537.0	537.000000
mean	0.303538	2.400372	31.297634	0.0	0.0	21.242689
std	0.460214	0.804511	11.694910	0.0	0.0	42.223510
min	0.000000	1.000000	5.000000	0.0	0.0	0.000000
25%	0.000000	2.000000	23.000000	0.0	0.0	7.775000
50%	0.000000	3.000000	27.000000	0.0	0.0	8.137500
75%	1.000000	3.000000	36.000000	0.0	0.0	15.000000
max	1.000000	3.000000	80.000000	0.0	0.0	512.329200

	Survived	Total
AgeGroup
0-9	38	62
10-19	41	102
20-29	113	358
30-39	84	185
40-49	39	110
50-59	20	48
60-69	6	19
70-79	0	6
80-89	1	1

	Survived	Total	SurvivedPercent	DiedPercent
AgeGroup
0-9	38	62	61.290323	38.709677
10-19	41	102	40.196078	59.803922
20-29	113	358	31.564246	68.435754
30-39	84	185	45.405405	54.594595
40-49	39	110	35.454545	64.545455
50-59	20	48	41.666667	58.333333
60-69	6	19	31.578947	68.421053
70-79	0	6	0.000000	100.000000
80-89	1	1	100.000000	0.000000

	Sex	AgeGroup	Survived
0	female	0-9	0.633333
1	female	10-19	0.755556
2	female	20-29	0.681034
3	female	30-39	0.855072
4	female	40-49	0.687500
5	female	50-59	0.888889
6	female	60-69	1.000000
7	male	0-9	0.593750
8	male	10-19	0.122807
9	male	20-29	0.140496
10	male	30-39	0.215517
11	male	40-49	0.217949
12	male	50-59	0.133333
13	male	60-69	0.133333
14	male	70-79	0.000000
15	male	80-89	1.000000

Table of Contents

Dataset Description

Some Notes Regarding Dataset

Loading data & Preview

Fix missing ages

Questions in mind

Question 1 - Did passenger class made any difference to his survival?

Question 2 - Which gender had more survival?

Question 3 - Person travelling with others had more survival possibility?

Question 4 - Which age group had a better chance of survival?

Question 5 - What was male and female survival per class and by age?

Male and Female per Pclass

Male and Female Survival per Pclass, and by Age

Conclusion

Overall Conclusion

Findings

Limitations

Future plans

References