The Titanic Project

For this project, I want to investigate the unfortunate tragedy of the sinking of the Titanic. The movie "Titanic"- which I watched when I was still a child left a strong memory for me. The event occurred in the early morning of 15 April 1912, when the ship collided with an iceberg, and out of 2,224 passengers, more than 1500 died.

The dataset I am working with contains the demographic information, and other information including ticket class, cabin number, fare price of 891 passengers. The main question I am curious about: What are the factors that correlate with the survival outcome of passengers?

Load the dataset

First of all, I want to get an overview of the data and identify whether there is additional data cleaning/wrangling to be done before diving deeper. I start off by reading the CSV file into a Pandas Dataframe.



In [2]:

    
#load the libraries that I might need to use
%matplotlib inline
import pandas as pd 
import numpy as np
import csv
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
#read the csv file into a pandas dataframe 
titanic_original = pd.DataFrame.from_csv('titanic-data.csv', index_col=None)
titanic_original









    Out[2]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S
    
    
      5
      6
      0
      3
      Moran, Mr. James
      male
      NaN
      0
      0
      330877
      8.4583
      NaN
      Q
    
    
      6
      7
      0
      1
      McCarthy, Mr. Timothy J
      male
      54.0
      0
      0
      17463
      51.8625
      E46
      S
    
    
      7
      8
      0
      3
      Palsson, Master. Gosta Leonard
      male
      2.0
      3
      1
      349909
      21.0750
      NaN
      S
    
    
      8
      9
      1
      3
      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
      female
      27.0
      0
      2
      347742
      11.1333
      NaN
      S
    
    
      9
      10
      1
      2
      Nasser, Mrs. Nicholas (Adele Achem)
      female
      14.0
      1
      0
      237736
      30.0708
      NaN
      C
    
    
      10
      11
      1
      3
      Sandstrom, Miss. Marguerite Rut
      female
      4.0
      1
      1
      PP 9549
      16.7000
      G6
      S
    
    
      11
      12
      1
      1
      Bonnell, Miss. Elizabeth
      female
      58.0
      0
      0
      113783
      26.5500
      C103
      S
    
    
      12
      13
      0
      3
      Saundercock, Mr. William Henry
      male
      20.0
      0
      0
      A/5. 2151
      8.0500
      NaN
      S
    
    
      13
      14
      0
      3
      Andersson, Mr. Anders Johan
      male
      39.0
      1
      5
      347082
      31.2750
      NaN
      S
    
    
      14
      15
      0
      3
      Vestrom, Miss. Hulda Amanda Adolfina
      female
      14.0
      0
      0
      350406
      7.8542
      NaN
      S
    
    
      15
      16
      1
      2
      Hewlett, Mrs. (Mary D Kingcome)
      female
      55.0
      0
      0
      248706
      16.0000
      NaN
      S
    
    
      16
      17
      0
      3
      Rice, Master. Eugene
      male
      2.0
      4
      1
      382652
      29.1250
      NaN
      Q
    
    
      17
      18
      1
      2
      Williams, Mr. Charles Eugene
      male
      NaN
      0
      0
      244373
      13.0000
      NaN
      S
    
    
      18
      19
      0
      3
      Vander Planke, Mrs. Julius (Emelia Maria Vande...
      female
      31.0
      1
      0
      345763
      18.0000
      NaN
      S
    
    
      19
      20
      1
      3
      Masselmani, Mrs. Fatima
      female
      NaN
      0
      0
      2649
      7.2250
      NaN
      C
    
    
      20
      21
      0
      2
      Fynney, Mr. Joseph J
      male
      35.0
      0
      0
      239865
      26.0000
      NaN
      S
    
    
      21
      22
      1
      2
      Beesley, Mr. Lawrence
      male
      34.0
      0
      0
      248698
      13.0000
      D56
      S
    
    
      22
      23
      1
      3
      McGowan, Miss. Anna "Annie"
      female
      15.0
      0
      0
      330923
      8.0292
      NaN
      Q
    
    
      23
      24
      1
      1
      Sloper, Mr. William Thompson
      male
      28.0
      0
      0
      113788
      35.5000
      A6
      S
    
    
      24
      25
      0
      3
      Palsson, Miss. Torborg Danira
      female
      8.0
      3
      1
      349909
      21.0750
      NaN
      S
    
    
      25
      26
      1
      3
      Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...
      female
      38.0
      1
      5
      347077
      31.3875
      NaN
      S
    
    
      26
      27
      0
      3
      Emir, Mr. Farred Chehab
      male
      NaN
      0
      0
      2631
      7.2250
      NaN
      C
    
    
      27
      28
      0
      1
      Fortune, Mr. Charles Alexander
      male
      19.0
      3
      2
      19950
      263.0000
      C23 C25 C27
      S
    
    
      28
      29
      1
      3
      O'Dwyer, Miss. Ellen "Nellie"
      female
      NaN
      0
      0
      330959
      7.8792
      NaN
      Q
    
    
      29
      30
      0
      3
      Todoroff, Mr. Lalio
      male
      NaN
      0
      0
      349216
      7.8958
      NaN
      S
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      861
      862
      0
      2
      Giles, Mr. Frederick Edward
      male
      21.0
      1
      0
      28134
      11.5000
      NaN
      S
    
    
      862
      863
      1
      1
      Swift, Mrs. Frederick Joel (Margaret Welles Ba...
      female
      48.0
      0
      0
      17466
      25.9292
      D17
      S
    
    
      863
      864
      0
      3
      Sage, Miss. Dorothy Edith "Dolly"
      female
      NaN
      8
      2
      CA. 2343
      69.5500
      NaN
      S
    
    
      864
      865
      0
      2
      Gill, Mr. John William
      male
      24.0
      0
      0
      233866
      13.0000
      NaN
      S
    
    
      865
      866
      1
      2
      Bystrom, Mrs. (Karolina)
      female
      42.0
      0
      0
      236852
      13.0000
      NaN
      S
    
    
      866
      867
      1
      2
      Duran y More, Miss. Asuncion
      female
      27.0
      1
      0
      SC/PARIS 2149
      13.8583
      NaN
      C
    
    
      867
      868
      0
      1
      Roebling, Mr. Washington Augustus II
      male
      31.0
      0
      0
      PC 17590
      50.4958
      A24
      S
    
    
      868
      869
      0
      3
      van Melkebeke, Mr. Philemon
      male
      NaN
      0
      0
      345777
      9.5000
      NaN
      S
    
    
      869
      870
      1
      3
      Johnson, Master. Harold Theodor
      male
      4.0
      1
      1
      347742
      11.1333
      NaN
      S
    
    
      870
      871
      0
      3
      Balkic, Mr. Cerin
      male
      26.0
      0
      0
      349248
      7.8958
      NaN
      S
    
    
      871
      872
      1
      1
      Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
      female
      47.0
      1
      1
      11751
      52.5542
      D35
      S
    
    
      872
      873
      0
      1
      Carlsson, Mr. Frans Olof
      male
      33.0
      0
      0
      695
      5.0000
      B51 B53 B55
      S
    
    
      873
      874
      0
      3
      Vander Cruyssen, Mr. Victor
      male
      47.0
      0
      0
      345765
      9.0000
      NaN
      S
    
    
      874
      875
      1
      2
      Abelson, Mrs. Samuel (Hannah Wizosky)
      female
      28.0
      1
      0
      P/PP 3381
      24.0000
      NaN
      C
    
    
      875
      876
      1
      3
      Najib, Miss. Adele Kiamie "Jane"
      female
      15.0
      0
      0
      2667
      7.2250
      NaN
      C
    
    
      876
      877
      0
      3
      Gustafsson, Mr. Alfred Ossian
      male
      20.0
      0
      0
      7534
      9.8458
      NaN
      S
    
    
      877
      878
      0
      3
      Petroff, Mr. Nedelio
      male
      19.0
      0
      0
      349212
      7.8958
      NaN
      S
    
    
      878
      879
      0
      3
      Laleff, Mr. Kristo
      male
      NaN
      0
      0
      349217
      7.8958
      NaN
      S
    
    
      879
      880
      1
      1
      Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
      female
      56.0
      0
      1
      11767
      83.1583
      C50
      C
    
    
      880
      881
      1
      2
      Shelley, Mrs. William (Imanita Parrish Hall)
      female
      25.0
      0
      1
      230433
      26.0000
      NaN
      S
    
    
      881
      882
      0
      3
      Markun, Mr. Johann
      male
      33.0
      0
      0
      349257
      7.8958
      NaN
      S
    
    
      882
      883
      0
      3
      Dahlberg, Miss. Gerda Ulrika
      female
      22.0
      0
      0
      7552
      10.5167
      NaN
      S
    
    
      883
      884
      0
      2
      Banfield, Mr. Frederick James
      male
      28.0
      0
      0
      C.A./SOTON 34068
      10.5000
      NaN
      S
    
    
      884
      885
      0
      3
      Sutehall, Mr. Henry Jr
      male
      25.0
      0
      0
      SOTON/OQ 392076
      7.0500
      NaN
      S
    
    
      885
      886
      0
      3
      Rice, Mrs. William (Margaret Norton)
      female
      39.0
      0
      5
      382652
      29.1250
      NaN
      Q
    
    
      886
      887
      0
      2
      Montvila, Rev. Juozas
      male
      27.0
      0
      0
      211536
      13.0000
      NaN
      S
    
    
      887
      888
      1
      1
      Graham, Miss. Margaret Edith
      female
      19.0
      0
      0
      112053
      30.0000
      B42
      S
    
    
      888
      889
      0
      3
      Johnston, Miss. Catherine Helen "Carrie"
      female
      NaN
      1
      2
      W./C. 6607
      23.4500
      NaN
      S
    
    
      889
      890
      1
      1
      Behr, Mr. Karl Howell
      male
      26.0
      0
      0
      111369
      30.0000
      C148
      C
    
    
      890
      891
      0
      3
      Dooley, Mr. Patrick
      male
      32.0
      0
      0
      370376
      7.7500
      NaN
      Q
    
  

891 rows × 12 columns

Data Dictionary

Variables Definitions

survival (Survival 0 = No, 1 = Yes)
pclass (Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd)
sex (Sex)
Age (Age in years)
sibsp (# of siblings / spouses aboard the Titanic)
parch (# of parents / children aboard the Titanic)
ticket (Ticket number)
fare (Passenger fare price)
cabin (Cabin number)
embarked(Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton)

Note:

pclass: A proxy for socio-economic status (SES)
```
-1nd = Upper
-2nd = Middle
-3rd = Lower
```
age: Age is fractional if less than 1.

sibsp: number of siblings and spouse

-Sibling = brother, sister, stepbrother, stepsister
-Spouse = husband, wife (mistresses and fiancés were ignored)

parch: number of parents and children

-Parent = mother, father
-Child = daughter, son, stepdaughter, stepson 
-Some children travelled only with a nanny, therefore parch=0 for them.

Data Cleaning

I want to check if there is duplicated data. By using the unique(), I checked the passenger ID to see there is duplicated entries.



In [3]:

    
#check if there is duplicated data by checking passenger ID.
len(titanic_original['PassengerId'].unique())









    Out[3]:





891

Looks like there are no duplicated entries based on passengers ID. We have in total 891 passengers in the dataset. However I have noticed there is a lot of missing values in 'Cabin' feature, and the 'Ticket' feature does not provide useful information for my analysis. I decided to remove them from the dataset by using drop() function

There are also some missing values in the 'Age', I can either removed them or replace them with the mean. Considering there is still a good sample size (>700 entries) after removal, I decide to remove the missing values with dropNa()



In [4]:

    
#make a copy of dataset
titanic_cleaned=titanic_original.copy()
#remove ticket and cabin feature from dataset
titanic_cleaned=titanic_cleaned.drop(['Ticket','Cabin'], axis=1)
#Remove missing values. 
titanic_cleaned=titanic_cleaned.dropna()
#Check to see if the cleaning is successful
titanic_cleaned.head()









    Out[4]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Fare
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      7.2500
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      71.2833
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      7.9250
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      53.1000
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      8.0500
      S

Take a look at Survived and Pclass columns. They are not very descriptive, so I decided to add two additional columns called Survival and Class with more descriptive values.



In [5]:

    
# Create Survival Label Column
titanic_cleaned['Survival'] = titanic_cleaned.Survived.map({0 : 'Died', 1 : 'Survived'})
titanic_cleaned.head()









    Out[5]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Fare
      Embarked
      Survival
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      7.2500
      S
      Died
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      71.2833
      C
      Survived
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      7.9250
      S
      Survived
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      53.1000
      S
      Survived
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      8.0500
      S
      Died



In [6]:

    
# Create Class Label Column
titanic_cleaned['Class'] = titanic_cleaned.Pclass.map({1 : 'Upper Class', 2 : 'Middle Class', 3 : 'Lower Class'})
titanic_cleaned.head()









    Out[6]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Fare
      Embarked
      Survival
      Class
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      7.2500
      S
      Died
      Lower Class
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      71.2833
      C
      Survived
      Upper Class
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      7.9250
      S
      Survived
      Lower Class
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      53.1000
      S
      Survived
      Upper Class
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      8.0500
      S
      Died
      Lower Class

Data overview

Now with a clean dataset, I am ready to formulate my hypothesis. I want to get a general overview of statistics for the dataset first. I use the describe() function on the data set. The useful statistic to look at is the mean, which gives us a general idea what the average value is for each feature. The standard deviation provides information on the spread of the data. The min and max give me information regarding whether there are outliers in the dataset. We should be careful and take these outliers into account when analyzing our data. I also calculate the median for each column in case there are outliers.



In [7]:

    
#describe() provides a statistical overview of the dataset
titanic_cleaned.describe()









    Out[7]:







  
    
      
      PassengerId
      Survived
      Pclass
      Age
      SibSp
      Parch
      Fare
    
  
  
    
      count
      712.000000
      712.000000
      712.000000
      712.000000
      712.000000
      712.000000
      712.000000
    
    
      mean
      448.589888
      0.404494
      2.240169
      29.642093
      0.514045
      0.432584
      34.567251
    
    
      std
      258.683191
      0.491139
      0.836854
      14.492933
      0.930692
      0.854181
      52.938648
    
    
      min
      1.000000
      0.000000
      1.000000
      0.420000
      0.000000
      0.000000
      0.000000
    
    
      25%
      222.750000
      0.000000
      1.000000
      20.000000
      0.000000
      0.000000
      8.050000
    
    
      50%
      445.000000
      0.000000
      2.000000
      28.000000
      0.000000
      0.000000
      15.645850
    
    
      75%
      677.250000
      1.000000
      3.000000
      38.000000
      1.000000
      1.000000
      33.000000
    
    
      max
      891.000000
      1.000000
      3.000000
      80.000000
      5.000000
      6.000000
      512.329200



In [8]:

    
#calculate the median for each column
titanic_cleaned.median()









    Out[8]:





PassengerId    445.00000
Survived         0.00000
Pclass           2.00000
Age             28.00000
SibSp            0.00000
Parch            0.00000
Fare            15.64585
dtype: float64

Looking at the means and medians, we see that the biggest difference is between mean and median of fare price. The mean is 34.57 while the median is only 15.65. It is likely due to the presence of outliers, the wealthy individuals who could afford the best suits. For example, the highest price fare is well over 500 dollars. I also see that the lowest fare price is 0, I suspect that those are the ship crews.

Now let's study the distribution of variables of interest. The countplot() from seaborn library plots a barplot that shows the counts of the variables. Let's take a look at our dependent variable - "Survived"



In [9]:

    
#I am using seaborn.countplot() to count and show the distribution of a single variable
sns.set(style="darkgrid")

ax = sns.countplot(x="Survival", data=titanic_cleaned)
plt.title("Distribution of Survival")









    Out[9]:





<matplotlib.text.Text at 0x111291a90>

We see that there were 342 passengers survived the disaster or around 38% of the sample.

Now, we also want to look at the distribution of some of other data including gender, socioeconomic class, age, and fare price. Gender, socioeconomic class, age are all categorical data, and barplot is best suited to show their count distribution. Fare price is a continuous variable, and a frequency distribution plot is used to study it.



In [10]:

    
#plt.figure() allows me to specify the size of the graph. 
#using fig.add_subplot allows me to display two subplots side by side
fig = plt.figure(figsize=(10,5)) 

ax1 = fig.add_subplot(121)
ax1=sns.countplot(x="Sex", data=titanic_cleaned)
plt.title("Distribution of Gender")

fig.add_subplot(122)
ax2 = sns.countplot(x="Class", data=titanic_cleaned)
plt.title('Distributrion of Class')









    Out[10]:





<matplotlib.text.Text at 0x11474b3d0>

It is now a good idea to combine the two graph to see how is gender and socioeconomic class intertwined. We see that among men, there is a much higher number of lower socioeconomic class individuals compared to women. For middle and upper class, the number of men and women are very similar. It is likely that families made up of the majority middle and upper-class passengers, while the lower class passengers are mostly single men.



In [11]:

    
#By using hue argument, we can study the another variable, combine with our original variable
sns.countplot(x='Sex', hue='Class', data=titanic_cleaned)
plt.title('Gender and Socioeconomic class')









    Out[11]:





<matplotlib.text.Text at 0x114818a50>

Fare price is a continuous variable, and for this type of variable, we use seaborn.distplot() to study its frequency distribution.

In comparison, age is a discrete variable and can be plotted by seaborn.countplot() which plots a bar plot that shows the counts.

We align the two plots horizontal using add_subplot to better demonstrate this difference.



In [12]:

    
#Use fig to store plot dimension
#use add_subplot to display two plots side by side
fig = plt.figure(figsize=(10,5)) 

ax1 = fig.add_subplot(121)
sns.distplot(titanic_cleaned.Fare)
plt.title('Distribution of fare price')
plt.ylabel('Density')

#for this plot, kde must be explicitly turn off for the y axis to counts instead of frequency density
axe2=fig.add_subplot(122)
sns.distplot(titanic_cleaned.Age,bins=40,hist=True, kde=False)
plt.title('Distribution of age')
plt.ylabel('Count')









    Out[12]:





<matplotlib.text.Text at 0x114b22f10>

We can see that the shape of two plots is quite different.

The fare price distribution plot shows a positively skewed curve, as most of the prices are concentrated below 30 dollars, and highest prices are well over 500 dollars
The age distribution plot demonstrates more of a bell-shaped curve (Gaussian distribution) with a slight mode for infants and young children. I suspect the slight spike for infants and young children to due to the presence of young families.

Observations on the dataset

342 passengers or roughly 38% of total survived.
There were significantly more men than women on board.
There are significantly higher numbers of lower class passengers compared to the mid and upper class.
The majority of fares sold are below 30 dollars, however, the upper price range of fare is very high, the most expensive ones are over 500 dollars, which should be considered outliers.

Hypothesis

Based on the overview of the data, I formulated 3 potential features that may have influenced the survival.

Fare price: What is the effect of fare price on survival rate? Are passengers who could afford more expensive tickets more likely to survive?
Gender: Does gender plays a role in survival? Are women more likely to survive than men?
Age: What age groups of the passengers are more likely to survive?

Fare Price and survival

Let's investigate fare price a bit deeper. First I am interested in looking at its relationship with socioeconomic class. Considering the large range of fare price, we use boxplot to better demonstrate the spread and confidence intervals of the data. The strip plot is used to show the density of data points, and more importantly the outliers.



In [13]:

    
#multiple plots can be overlayed. Boxplot() and striplot() turned out to be a good combination
sns.boxplot(x="Class", y="Fare", data=titanic_cleaned)
sns.stripplot(x="Class", y="Fare", data=titanic_cleaned, color=".25")
plt.title('Class and fare price')









    Out[13]:





<matplotlib.text.Text at 0x114dfd450>

This is not surprising that the outliers existed exclusively in the high socioeconomic class group, as only the wealthy individuals can afford the higher fare price.

This is clear that the upper class were able to afford more expensive fares, with highest fares above 500 dollars. To look at the survival rate, I break down the fare data into two groups:

Passengers with fare <=35 dollars
passengers with fare >35 dollars



In [14]:

    
#make a copy of the dataset and named it titanic_fare
#copy is used instead of assigning is to perserve the dataset in case anything goes wrong
#add a new column stating whether the fare >35 (value=1) or <=35 dollars (value=0)

titanic_fare = titanic_cleaned.copy()
titanic_fare['Fare>35'] = np.where(titanic_cleaned['Fare']>35,'Yes','No')

#check to see if the column creation is succesful
titanic_fare.head()









    Out[14]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Fare
      Embarked
      Survival
      Class
      Fare>35
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      7.2500
      S
      Died
      Lower Class
      No
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      71.2833
      C
      Survived
      Upper Class
      Yes
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      7.9250
      S
      Survived
      Lower Class
      No
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      53.1000
      S
      Survived
      Upper Class
      Yes
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      8.0500
      S
      Died
      Lower Class
      No



In [15]:

    
#Calculate the survival rate for passenger who holds fare > $35. 
#float() was used to forced a decimal result due to the limitation of python 2

high_fare_survival=titanic_fare.loc[(titanic_fare['Survived'] == 1)&(titanic_fare['Fare>35']=='Yes')]

high_fare_holder=titanic_fare.loc[(titanic_fare['Fare>35']=='Yes')]

high_fare_survival_rate=len(high_fare_survival)/float(len(high_fare_holder))

print high_fare_survival_rate









    



0.641176470588



In [16]:

    
#Calculate the survival rate for passenger who holds fare <= $35. 

low_fare_survival=titanic_fare.loc[(titanic_fare['Survived'] == 1)&(titanic_fare['Fare>35']=='No')]

low_fare_holder=titanic_fare.loc[(titanic_fare['Fare>35']=='No')]

low_fare_survival_rate=len(low_fare_survival)/float(len(low_fare_holder))

print low_fare_survival_rate









    



0.330258302583



In [17]:

    
#plot a barplot for survival rate for fare price > $35 and <= $35

fare_survival_table=pd.DataFrame({'Fare Price':pd.Categorical(['No','Yes']),
                                 'Survival Rate':pd.Series([0.32,0.62], dtype='float64')
                                 })
bar=fare_survival_table.plot(kind='bar', x='Fare Price', rot=0)
plt.ylabel('Survival Rate')
plt.xlabel('Fare>35')
plt.title('Fare price and survival rate')









    Out[17]:





<matplotlib.text.Text at 0x114fc5bd0>

The bar plot using matplotlib.pyplot does a reasonable job of showing the difference in survival rate between the two groups.

However with seaborn.barplot(), confidence intervals are directly calculated and displayed. This is an advantage of seaborn library.



In [18]:

    
#seaborn.barplot() can directly calculate/display the survival rate and confidence interval from the dataset

sns.barplot(x='Fare>35',y='Survived',data=titanic_fare, palette="Blues_d")
plt.title('Fare price and survival rate')
plt.ylabel('Survival Rate')









    Out[18]:





<matplotlib.text.Text at 0x114e91b90>

As seen from the graph, taking into account of confidence intervals, higher fare group is associated with significantly higher survival rate (~0.62) compared to lower fare group (~0.31).

How about if we just look at fare price as the continuous variable in relation to survival outcome?

When the Y variable is binary like survival outcome in this case, the statistical analysis suitable is "logistic Regression", where x variable is used as an estimator for the binary outcome of Y variable.

Fortunately, Seaborn.lmplot() allows us to graph the logistic regression function using fare price as an estimator for survival, the function displays a sigmoid shape and higher fare price is indeed associated with the better chance of survival.

Note: the area around the line shows the confidence interval of the function.



In [19]:

    
#use seaborn.lmplot to graph the logistic regression function
sns.lmplot(x="Fare", y="Survived", data=titanic_fare,
           logistic=True, y_jitter=.03)
plt.title('Logistic regression using fare price as estimator for survival outcome')
plt.yticks([0, 1], ['Died', 'Survived'])









    Out[19]:





([<matplotlib.axis.YTick at 0x11527fc10>,
  <matplotlib.axis.YTick at 0x11519aa90>],
 <a list of 2 Text yticklabel objects>)



In [35]:

    
fare_bins = np.arange(0,500,10)
sns.distplot(titanic_cleaned.loc[(titanic_cleaned['Survived']==0) & (titanic_cleaned['Fare']),'Fare'], bins=fare_bins)
sns.distplot(titanic_cleaned.loc[(titanic_cleaned['Survived']==1) & (titanic_cleaned['Fare']),'Fare'], bins=fare_bins)
plt.title('fare distribution among survival classes')
plt.ylabel('frequency')
plt.legend(['did not survive', 'survived']);

The fare distribution between survivors and non-survivors shows that there is peak in mortality for low fare price.

Gender and Survival

For this section, I am interested in investigation gender and survival rate. I will first calculate the survival rate for both female and male. Then plot a few graphs to visualize the relationship between gender and survival, and combine with other factors such as fare price and socioeconomic class.



In [ ]:

    
#Calculate the survival rate for female
female_survived=titanic_fare.loc[(titanic_cleaned['Survived'] == 1)&(titanic_cleaned['Sex']=='female')]

female_total=titanic_fare.loc[(titanic_cleaned['Sex']=='female')]

female_survival_rate=len(female_survived)/(len(female_total)*1.00)

print female_survival_rate



In [ ]:

    
#Calculate the survival rate for male
male_survived=titanic_fare.loc[(titanic_cleaned['Survived'] == 1)&(titanic_cleaned['Sex']=='male')]

male_total=titanic_fare.loc[(titanic_cleaned['Sex']=='male')]

male_survival_rate=len(male_survived)/(len(male_total)*1.00)

print male_survival_rate



In [ ]:

    
#plot a barplot for survival rate for female and male
#we can see that seaborn.barplot 
sns.barplot(x='Sex',y='Survived',data=titanic_fare)
plt.title('Gender and survival rate')
plt.ylabel('Survival Rate')



In [ ]:

    
##plot a barplot for survival rate for female and male, combine with fare price group
sns.barplot(x='Sex',y='Survived', hue='Fare>35',data=titanic_fare)
plt.title('Gender and survival rate')
plt.ylabel('Survival Rate')



In [ ]:

    
#plot a barplot for survival rate for female and male, combine with socioeconomic class
sns.barplot(x='Sex',y='Survived', hue='Class',data=titanic_fare)
plt.title('Socioeconomic class and survival rate')
plt.ylabel('Survival Rate')

Therefore, being a female is associated with significantly higher survival rate compared to male.

In addition, being in the higher socioeconomic group and higher fare group are associated with a higher survival rate in both male and female.

The difference is that in the male the survival rates are similar for class 2 and 3 with class 1 being much higher, while in the female the survival rates are similar for class 1 and 2 with class 3 being much lower.

Age and Survival

To study the relationship between age and survival rate. First, I seperate age into 6 groups number from 1 to 6:

1. newborn to 10 years old 
2. 10 to 20 years old 
3. 20 to 30 years old 
4. 30 to 40 years old 
5. 40 to 50 years old
6. over 50 years old

Then, I added the age group number as a new column to the dataset.



In [ ]:

    
#create a age_group function
def age_group(age):
    age_group=0
    if age<10:
        age_group=1
    elif age <20:
        age_group=2
    elif age <30:
        age_group=3
    elif age <40:
        age_group=4
    elif age <50:
        age_group=5
    else:
        age_group=6
    return age_group

#create a series of age group number by applying the age_group function to age column
ageGroup_column = titanic_fare['Age'].apply(age_group)

#make a copy of titanic_fare and name it titanic_age
titanic_age=titanic_fare.copy()

#add age group column
titanic_age['Age Group'] = ageGroup_column

#check to see if age group column was added properly
titanic_age.head()

Now, we want to plot a bar graph showing the relationship between age group and survival rate. Age group is used here instead of age because visually age group is easier to observe than using age variable when dealing with survival rate.



In [ ]:

    
#Seaborn.barplot is used to plot a bargraph and confidence intervals for survival rate
sns.barplot(x='Age Group', y='Survived',data=titanic_age)
plt.title('Age group and survival rate')
plt.ylabel('Survival Rate')

Age Group 1: < 10
Age Group 2: >= 10 and < 20
Age Group 3: >= 20 and < 30
Age Group 4: >= 30 and < 40
Age Group 5: >= 40 and < 50
Age Group 6: >= 50



In [ ]:

    
#draw bargram and bring additional factors including gender and class
fig = plt.figure(figsize=(10,5)) 

ax1 = fig.add_subplot(121)
sns.barplot(x='Age Group', y='Survived', hue='Sex',data=titanic_age)
plt.title('Age group, gender and survival rate')
plt.ylabel('Survival Rate')

ax1 = fig.add_subplot(122)
sns.barplot(x='Age Group', y='Survived',hue='Pclass',data=titanic_age)
plt.title('Age group, class and survival rate')
plt.ylabel('Survival Rate')

Age Group 1: < 10
Age Group 2: >= 10 and < 20
Age Group 3: >= 20 and < 30
Age Group 4: >= 30 and < 40
Age Group 5: >= 40 and < 50
Age Group 6: >= 50

The bar graphs demonstrate that only age group 1 (infants/young children) is associated with significantly higher survival rate. There are no clear distinctions on survival rate between the rest of age groups.

How about using age instead of age group. Is there a linear relationship between age and survival outcome? By using seaborn.lmplot(), We can perform a logistic regression on survival outcome using age as an estimator. Let's take a look.



In [ ]:

    
#use seaborn.lmplot to graph the logistic regression function
sns.lmplot(x="Age", y="Survived", data=titanic_age,
           logistic=True, y_jitter=.03)
plt.title('Logistic regression using age as the estimator for survival outcome')
plt.yticks([0, 1], ['Died', 'Survived'])

From the graph, we can see there is a negative linear relationship between age and survival outcome.



In [39]:

    
fare_bins = np.arange(0,100,2)
sns.distplot(titanic_cleaned.loc[(titanic_cleaned['Survived']==0) & (titanic_cleaned['Age']),'Age'], bins=fare_bins)
sns.distplot(titanic_cleaned.loc[(titanic_cleaned['Survived']==1) & (titanic_cleaned['Age']),'Age'], bins=fare_bins)
plt.title('age distribution among survival classes')
plt.ylabel('frequency')
plt.legend(['did not survive', 'survived']);

The age distribution comparison between survivors and non-survivors confirmed the survival spike in young children.

Limitations

There are limitations on our analysis:

Missing values: due to too much missing values(688) for the cabin. I decided to remove this column from my analysis. However, the 178 missing values for age data posed problems for us. In my analysis, I decided to drop the missing values because I felt we still had a reasonable sample size of >700, but selection bias definitely increased as the sample size decreased. Another option could be using the mean of existing age data to fill in for the missing values, this approach could be a good option if we had lots of missing value and still wants to incorporate age variable into our analysis. In this case, bias also increases as we are making assumptions for the passengers with missing age.
Survival bias: the data was partially collected from survivors of the disaster, and there could be a lot of data that were missing for people who did not survive. This leads the dataset becomes more representative toward the survivors. This limitation is difficult to overcome, as data we have today is the best of what we could gather due to the disaster has been happened over 100 years ago.
Outliers: for the fare price analysis, we saw that the fare prices had a large difference between mean(34.57) and median(15.65). The highest fares were well over 500 dollars. As a result, the distribution of our fare prices distribution is very positive skewed. This can affect the validity and the accuracy of our analysis. However, because I really wanted to see the survival outcome for the wealthier individuals, I decided to incorporate those outliers into my analysis. An alternative approach is to drop the outliers (e.g. fare prices >500) from our analysis, especially if we are only interested in studying the majority of the sample.



In [ ]:

    
# using the apply function and lambda to count missing values for each column

print titanic_original.apply(lambda x: sum(x.isnull().values), axis = 0)

The table shows the number of missing data in the data set, which is an important factor when considering the limitations of the analysis.

Conclusion

In conclusion, the dataset on Titanic's 891 passengers provided valuable insights for us. Through data analysis and visualizations, we saw that factors such as being in a higher socioeconomic class, higher fare price, being a female, being a young child/infant were all associated with significantly higher survival rate.

However, the missing values, survivor bias, and outliers introduced bias and affected the validity and accuracy of our study. For future studies, if we can incorporate several datasets of disasters that are similar to Titanic into one large dataset, it may allow us to gain additional insights on what features separate survivors and non-survivors, and possibly allow us to make predictions on disaster survival outcomes.

References

"RMS Titanic", 2017, Wikipedia, from https://en.wikipedia.org/wiki/RMS_Titanic

"Titanic: Machine Learning from Disaster", 2017, Kaggle, from https://www.kaggle.com/c/titanic

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S
12	13	0	3	Saundercock, Mr. William Henry	male	20.0	0	0	A/5. 2151	8.0500	NaN	S
13	14	0	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	NaN	S
14	15	0	3	Vestrom, Miss. Hulda Amanda Adolfina	female	14.0	0	0	350406	7.8542	NaN	S
15	16	1	2	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	0	0	248706	16.0000	NaN	S
16	17	0	3	Rice, Master. Eugene	male	2.0	4	1	382652	29.1250	NaN	Q
17	18	1	2	Williams, Mr. Charles Eugene	male	NaN	0	0	244373	13.0000	NaN	S
18	19	0	3	Vander Planke, Mrs. Julius (Emelia Maria Vande...	female	31.0	1	0	345763	18.0000	NaN	S
19	20	1	3	Masselmani, Mrs. Fatima	female	NaN	0	0	2649	7.2250	NaN	C
20	21	0	2	Fynney, Mr. Joseph J	male	35.0	0	0	239865	26.0000	NaN	S
21	22	1	2	Beesley, Mr. Lawrence	male	34.0	0	0	248698	13.0000	D56	S
22	23	1	3	McGowan, Miss. Anna "Annie"	female	15.0	0	0	330923	8.0292	NaN	Q
23	24	1	1	Sloper, Mr. William Thompson	male	28.0	0	0	113788	35.5000	A6	S
24	25	0	3	Palsson, Miss. Torborg Danira	female	8.0	3	1	349909	21.0750	NaN	S
25	26	1	3	Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...	female	38.0	1	5	347077	31.3875	NaN	S
26	27	0	3	Emir, Mr. Farred Chehab	male	NaN	0	0	2631	7.2250	NaN	C
27	28	0	1	Fortune, Mr. Charles Alexander	male	19.0	3	2	19950	263.0000	C23 C25 C27	S
28	29	1	3	O'Dwyer, Miss. Ellen "Nellie"	female	NaN	0	0	330959	7.8792	NaN	Q
29	30	0	3	Todoroff, Mr. Lalio	male	NaN	0	0	349216	7.8958	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
861	862	0	2	Giles, Mr. Frederick Edward	male	21.0	1	0	28134	11.5000	NaN	S
862	863	1	1	Swift, Mrs. Frederick Joel (Margaret Welles Ba...	female	48.0	0	0	17466	25.9292	D17	S
863	864	0	3	Sage, Miss. Dorothy Edith "Dolly"	female	NaN	8	2	CA. 2343	69.5500	NaN	S
864	865	0	2	Gill, Mr. John William	male	24.0	0	0	233866	13.0000	NaN	S
865	866	1	2	Bystrom, Mrs. (Karolina)	female	42.0	0	0	236852	13.0000	NaN	S
866	867	1	2	Duran y More, Miss. Asuncion	female	27.0	1	0	SC/PARIS 2149	13.8583	NaN	C
867	868	0	1	Roebling, Mr. Washington Augustus II	male	31.0	0	0	PC 17590	50.4958	A24	S
868	869	0	3	van Melkebeke, Mr. Philemon	male	NaN	0	0	345777	9.5000	NaN	S
869	870	1	3	Johnson, Master. Harold Theodor	male	4.0	1	1	347742	11.1333	NaN	S
870	871	0	3	Balkic, Mr. Cerin	male	26.0	0	0	349248	7.8958	NaN	S
871	872	1	1	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	1	1	11751	52.5542	D35	S
872	873	0	1	Carlsson, Mr. Frans Olof	male	33.0	0	0	695	5.0000	B51 B53 B55	S
873	874	0	3	Vander Cruyssen, Mr. Victor	male	47.0	0	0	345765	9.0000	NaN	S
874	875	1	2	Abelson, Mrs. Samuel (Hannah Wizosky)	female	28.0	1	0	P/PP 3381	24.0000	NaN	C
875	876	1	3	Najib, Miss. Adele Kiamie "Jane"	female	15.0	0	0	2667	7.2250	NaN	C
876	877	0	3	Gustafsson, Mr. Alfred Ossian	male	20.0	0	0	7534	9.8458	NaN	S
877	878	0	3	Petroff, Mr. Nedelio	male	19.0	0	0	349212	7.8958	NaN	S
878	879	0	3	Laleff, Mr. Kristo	male	NaN	0	0	349217	7.8958	NaN	S
879	880	1	1	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	0	1	11767	83.1583	C50	C
880	881	1	2	Shelley, Mrs. William (Imanita Parrish Hall)	female	25.0	0	1	230433	26.0000	NaN	S
881	882	0	3	Markun, Mr. Johann	male	33.0	0	0	349257	7.8958	NaN	S
882	883	0	3	Dahlberg, Miss. Gerda Ulrika	female	22.0	0	0	7552	10.5167	NaN	S
883	884	0	2	Banfield, Mr. Frederick James	male	28.0	0	0	C.A./SOTON 34068	10.5000	NaN	S
884	885	0	3	Sutehall, Mr. Henry Jr	male	25.0	0	0	SOTON/OQ 392076	7.0500	NaN	S
885	886	0	3	Rice, Mrs. William (Margaret Norton)	female	39.0	0	5	382652	29.1250	NaN	Q
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	712.000000	712.000000	712.000000	712.000000	712.000000	712.000000	712.000000
mean	448.589888	0.404494	2.240169	29.642093	0.514045	0.432584	34.567251
std	258.683191	0.491139	0.836854	14.492933	0.930692	0.854181	52.938648
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	222.750000	0.000000	1.000000	20.000000	0.000000	0.000000	8.050000
50%	445.000000	0.000000	2.000000	28.000000	0.000000	0.000000	15.645850
75%	677.250000	1.000000	3.000000	38.000000	1.000000	1.000000	33.000000
max	891.000000	1.000000	3.000000	80.000000	5.000000	6.000000	512.329200