Author: Navina Govindaraj
Date: April 2017
The Titanic dataset has been chosen for this project. It contains demographics and passenger information from 891 of the 2224 passengers and crew on board the Titanic.
Source: Kaggle
1) Did survival differ by age and gender?
2) Did class play a role in survival?
In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
%pylab inline
In [2]:
titanic_data = pd.read_csv('./titanic_data.csv')
titanic_data.head()
Out[2]:
In [3]:
# Checking data types by column
titanic_data.dtypes
Out[3]:
In [4]:
# Checking for duplicate entries
duplicates = titanic_data.duplicated().sum()
print 'Duplicate Entries = ', duplicates
In [5]:
# Removing variables that are not relevant to the analysis
titanic_clean = titanic_data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1)
titanic_clean.head()
Out[5]:
In [6]:
# Checking for missing values
titanic_clean.isnull().sum()
Out[6]:
In [7]:
# Group those with missing age based on Sex
titanicNullAge = titanic_clean[titanic_clean['Age'].isnull()]
titanicNullAge.groupby('Sex').size()
Out[7]:
To answer the research questions, 'Age' is the only variable that needs to be dealt with. The 177 missing values will be filled in with the mean age for sex ="male" and sex = "female" separately. Rows with missing ages are not being dropped from the analysis, since this constitute 20% of the dataset and losing this data would interfere with the results.
In [8]:
titanic_clean.isnull().sum()
Out[8]:
In [9]:
# Find mean age for each group (based on Sex)
mean_age = titanic_clean.groupby("Sex")["Age"].mean()
mean_age
Out[9]:
In [10]:
# Populating NA with mean ages for "male" and "female"
titanic_clean["Age"].fillna(titanic_clean.groupby("Sex")["Age"].
transform("mean"), inplace=True)
In [11]:
# Checking if ["Age"] has been populated with the mean values
if titanic_clean.isnull()['Age'].sum() != 0:
print("Fill all entities with NA Age failed!")
In [12]:
# Checking if the mean remains unchanged after populating missing ages
mean_age
Out[12]:
In [13]:
titanic_clean.describe()
Out[13]:
The table above gives an overview of the dataset. Here are some important points to note.
Looking at the survival data graphically:
In [14]:
f, axs = plt.subplots(figsize=(18, 5), ncols = 3)
sns.set_palette("Set2")
# Fig 1 - Survival Distribution
sns.countplot(x="Survived", data=titanic_clean, alpha=.65,
ax=axs[0]).set_title("Fig 1: Survival Distribution, (1 = Survived)")
# Fig 2 - Survival by Age
sns.boxplot(x="Survived", y="Age",data=titanic_clean,
ax=axs[1]).set_title("Fig 2: Survival by Age, (1 = Survived)")
# Fig 3 - Survival by Gender
sns.countplot(y="Survived", hue="Sex", palette={"male":"m","female":"orange"}, data=titanic_clean,
alpha=.55, ax=axs[2]).set_title("Fig 3: Survival by Gender, 1 = Survived")
Out[14]:
Drilling down into survival by class :
In [15]:
g = sns.factorplot(x="Sex", hue = "Survived", col="Pclass",
data=titanic_clean, kind="count", size=4, aspect=1, alpha=.65)
Chi-Squared Test for Independence: 'Pclass' vs. 'Survived'
$H_0$: Survival is NOT dependent of the travel class of the passenger
$H_a$: Survival IS dependent on the travel class of the passenger
In [16]:
# Chi-squared test
def chi_squared_test(col1, col2, isPrint=False):
contingency_table = pd.crosstab(col1, col2)
chi, p, dof, expected = stats.chi2_contingency(contingency_table)
if(isPrint):
print(contingency_table)
print "Chi square: ", chi
print "p-value: ", np.round(p, decimals=4)
print "Degrees of freedom: ", dof
print "\nExpected frequency:\n", expected
return contingency_table, chi, p, dof, expected
In [17]:
#Pclass vs. Survived (Chi-squared test)
contingency_table, chi, p, dof, expected = chi_squared_test(titanic_clean["Pclass"],
titanic_clean["Survived"], isPrint=True)
Chi-Squared Test for Independence: 'Sex' vs. 'Survived'
$H_0$: Survival is NOT dependent on the passenger's sex
$H_a$: Survival IS dependent on the passenger's sex
In [18]:
#Sex vs. Survived (Chi-squared test)
contingency_table, chi, p, dof, expected = chi_squared_test(titanic_clean["Sex"],
titanic_clean["Survived"],
isPrint=True)
Independent two-sample t-test: 'Age' vs. 'Survived'
$H_0 : \mu_0 = \mu_1 $ There is no difference in mean age between the survivors and the victims
$H_a : \mu_0 \neq \mu_1 $ The mean age of survivors is signifcantly different than the mean age of victims
In [19]:
# Visualizing both the distributions
h = sns.FacetGrid(titanic_clean, col="Survived", hue="Survived").map(sns.distplot, "Age")
h.set_axis_labels("Age", "KDE")
Out[19]:
In [20]:
# Mean age of passengers who died vs. survived (survived=1)
print "Variance:"
print titanic_clean.groupby("Survived")["Age"].var()
print ('\n')
print "Number of passengers:"
print titanic_clean.groupby("Survived")["Age"].size()
Assumptions for Welch's t-test:
In [21]:
x = titanic_clean[titanic_clean["Survived"]==0]["Age"]
y = titanic_clean[titanic_clean["Survived"]==1]["Age"]
stats.ttest_ind(x, y, equal_var=False)
Out[21]:
From the given dataset, only 38.4% of the 891 passengers survived the sinking of the Titanic. The analysis helps us understand if there was any variable or combination of variables that influenced a person's chance of survival.
To answer the research questions:
1) Did survival differ by age and gender?
2) Did class play a role in survival?