The Data set I analyzed is Titanic Data.
At first we need understand data, then ask question.
We can find the descriptions of this csv file.
These are definitions of variables.
VARIABLE DESCRIPTIONS:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
Then we can ask questions.
As kaggle suggested some groups of people more likely to survive, like children, women, and the upper-class.
So I will ask, Is these factors really relate to the survive rate?
Add: Do different Sex in same class have different survive rate ?
Or same Sex have different survive rate in different class?
And more, when I seach the structure and cabins' location of titanic online,
I find the factor of cabin may also connect to the survive rate,
such as some cabins is far from boat deck,and living with crowd of people.
Therefore, I will ask, Will people living in different cabin have different survive rate?
Revise: What is connection between fare and survive rate?
Let's wrangle data.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%pylab inline
# Get a glimpse of data
titanic_df = pd.read_csv('../input/train.csv')
titanic_df.head()
In [2]:
# Check the information of our data
titanic_df.info()
# Check Cabin column
# print titanic_df['Cabin']
As we see, unfortunately, there are too few data about the cabins.
And some of them even have several cabins' name in it.
We need change question, or consider a way to solve it.
At first, I try to solve it.
As different class people will live in the different area and different room. Like there said.
And for different class, the ticket price is also differernt, like 3-8 pounds for 3rd class and 12 pounds for 2nd class.
So, I come up with an idea. Can we guess their room from their ticket price?
However, when search information about coordinate room for different classes,
I find in some floor's room, like D, E, and F floor, is hard to determine which class lives here.
But for 1st class, they mainly live from A to E, 2nd class D to F, and 3rd class F to G.
Therefore, people with different fare will live in different area.
I change my Question to What is connection between fare and survive rate?
In [3]:
# At first drop data it seems useless for this analysis
# they are ID, name, ticket number, embark place, cabin, SibSp, and Parch
titanic_df = titanic_df.drop(['PassengerId','Name','Ticket','Embarked','Cabin','SibSp','Parch'],axis = 1)
titanic_df.head()
In [4]:
# At first let's analyse from sex and age view
# Divide children from male and female type
titanic_df.loc[titanic_df['Age'] <= 16, 'Sex'] = 'child'
titanic_df = titanic_df.drop(['Age'],axis=1)
titanic_df.head()
In [5]:
# Draw pictures to see more clearly of the relations
# about sex and age factor
sns.factorplot(data=titanic_df,x='Sex',y='Survived',kind="violin",size=4,aspect=3)
# Plot basic information about sex and age
fig, (axis1,axis2) = plt.subplots(1,2,figsize=(15,5))
sns.countplot(data=titanic_df, x='Sex',ax=axis1)
sns.countplot(data=titanic_df,x='Survived',hue='Sex',order=[0,1],ax=axis2)
fig, (axis3,axis4) = plt.subplots(1,2,figsize=(15,5))
# Group data by sex and whether child
sex_survi_groups = titanic_df[['Sex','Survived']].groupby(['Sex'],as_index=True)
#Divide into three groups
men_group = sex_survi_groups.get_group('male')
women_group = sex_survi_groups.get_group('female')
children_group = sex_survi_groups.get_group('child')
# Plot survive rate between different sex
men_women_group = pd.concat([men_group,women_group])
# sns.barplot(data=men_women_group,x='Sex',y='Survived',order=['male','female'],ax=axis3)
sns.barplot(data=titanic_df[['Sex','Survived']],x='Sex',y='Survived',order=['male','female','child'],ax=axis3)
# Child and not child
child_dummy = pd.get_dummies(titanic_df['Sex'])
child_dummy.drop(['male','female'],axis=1,inplace=True)
child_dummy['Survived'] = titanic_df['Survived']
child_groups = child_dummy.groupby(['child'],as_index=False)
child_group_perc = child_groups.mean()
sns.barplot(data=child_group_perc,x='child',y='Survived',order=[0,1],ax=axis4)
axis3.set_title('Survive rate compare by Sex')
axis4.set_title('Survive rate compare by whether child')
In [6]:
# Statistic Hypothesis Test
# T-test for men and women
# H0: Men and women have same survive rate, mean(men)=mean(women)
from scipy.stats import ttest_ind
ttest_ind(men_group['Survived'],women_group['Survived'])
In [7]:
# T-test for child and non-child
# H0: Children and non-child have same survive rate, mean(child)=mean(non-child)
ttest_ind(child_groups.get_group(0)['Survived'],child_groups.get_group(1)['Survived'])
We can see that for men and women t-statistical is a large negative number, and p value is very small, even far small than 0.01.
Therefore we can confidently reject our null hypothesis to say women have much higher survive rate than man.
For child and non-child, even it is not significant than sex.
We can still say we have 99% confident say children have higher survive rate than adults, as small pvalue.
In [8]:
# Then let's analyze class factor
sns.factorplot(data=titanic_df,x='Pclass',y='Survived',kind="violin",size=4,aspect=3)
# Group by class
class_survi_prec = titanic_df[['Pclass','Survived']].groupby(['Pclass'],as_index=False).mean()
# Compare number and survived rate between three classes
fig, (axis1,axis2) = plt.subplots(1,2,figsize=(15,5))
sns.countplot(data=titanic_df, x='Pclass',ax=axis1)
sns.barplot(data=class_survi_prec,x='Pclass',y='Survived',order=[1,2,3],ax=axis2)
In [9]:
# Statistic Hypothesis Test: as there are three classes, we use ANOVA
# H0:Three classes have same survive rate mean(class1)=mean(class2)=mean(class3)
from scipy.stats import f_oneway
class1_group = titanic_df[['Pclass','Survived']][titanic_df["Pclass"]==1]
class2_group = titanic_df[['Pclass','Survived']][titanic_df["Pclass"]==2]
class3_group = titanic_df[['Pclass','Survived']][titanic_df["Pclass"]==3]
f_oneway(class1_group['Survived'],class2_group['Survived'],class3_group['Survived'])
In [10]:
# T-test between class 1 and class 2
# H0: mean(class1)=mean(class2)
ttest_ind(class1_group['Survived'],class2_group['Survived'])
In [11]:
# T-test between class 2 and class 3
# H0: mean(class2)=mean(class3)
ttest_ind(class2_group['Survived'],class3_group['Survived'])
In [12]:
# T-test between class 1 and class 3
# H0: mean(class1)=mean(class3)
ttest_ind(class1_group['Survived'],class3_group['Survived'])
At first we can see from graphs that there are actually some difference between three classes.
1st class have highest survive rates, 2nd class follow, and then 3rd class.
Especially, 3rd class is very different from the upper two classes.
3rd class has much lower survive rate than other classes.
To confirm this observation, we carry on some tests.
At first carry on ANOVA on these three classes, we have a very high F-statistic and a very low p-value.
So we can confidently reject its H0, and say Pclass actually relate to survive rates.
But, how difference between each groups?
I use three T-test to test them. I find Class 1 and Class 2 people actually have different survive rate.
However, when compare to difference between class3 and them, its reletively small.
We can conclude that class actually affect survive rate, particularly between upper two classes and Class 3.
In [13]:
# Last let's analyze fare factor
titanic_df['Fare'].plot(kind='hist', figsize=(15,3),bins=100)
# We clear out people have very high fare
normal_people = titanic_df[['Fare','Survived']][titanic_df['Fare']<200]
fare_survi_group = normal_people[['Fare','Survived']].groupby(['Survived'],as_index=False)
fare_survi_perc = fare_survi_group.mean()
figure(2)
# sns.barplot(data=fare_survi_perc,x='Survived',y='Fare',order=[0,1])
sns.factorplot(data=normal_people,x='Survived',y='Fare',aspect=2)
In [14]:
# Statitic Test
# H0: People survived and not survived have same fare, mean(survive_fare)=mean(non_survive_fare)
ttest_ind(fare_survi_group.get_group(0)['Fare'],fare_survi_group.get_group(1)['Fare'])
At first, we can find there are some people with very high fare, and we clear them out for a fair analysis.
Then from bar chart, we can find people survived have higher mean fare than people not survived.
We can do t-test to confirm this.
From T-test, p value is so small that we can confidently say people survied and not survied have different fare.
And more, people survived have higher fare than people not survived.
In [15]:
# To explore more details
# let's see sex distrubution in different classes
sns.countplot(data=titanic_df,x='Pclass',hue='Sex',order=[1,2,3])
In [16]:
# From above we could see class 3 have large percent of men
# So we can guess the low survived rate of men is caused by class3 men
# the survive rate in higher class between sex may not very distinct
# Draw chart of different classes's survive rate detail
class_sex_group = titanic_df[['Sex','Pclass','Survived']].groupby(['Sex','Pclass'],as_index=False)
class_sex_survive_prec = class_sex_group.mean()
sns.barplot(data=class_sex_survive_prec, x='Sex',y='Survived',hue='Pclass',order=['male','female','child'])
In [17]:
# Between class1 and class2 women they have similar survive rates
# H0 = Survived mean(female_class1)=mean(female_class2)
female_class1 = class_sex_group.get_group(('female',1))
female_class2 = class_sex_group.get_group(('female',2))
ttest_ind(female_class1['Survived'],female_class2['Survived'])
In [18]:
# Also between class1 and class2 child they have much similar survive rates
# H0 = Survived mean(child_class1)=mean(child_class2)
child_class1 = class_sex_group.get_group(('child',1))
child_class2 = class_sex_group.get_group(('child',2))
ttest_ind(child_class1['Survived'],child_class2['Survived'])
In [19]:
# And class2 and class3 male they also have similar survive rate
# H0 =Survived mean(male_class2)=mean(male_class3)
male_class2 = class_sex_group.get_group(('male',2))
male_class3 = class_sex_group.get_group(('male',3))
ttest_ind(male_class2['Survived'],male_class3['Survived'])
From chart, we can see women is actually have higher survive rate than men, even in different classes.
And 1st class have higher survive rate for men, 3rd class children and women have lower survive rate.
However, when we test class 1 female and class 2 female, class 1 child and class 2 child, as well as class 2 male and class 3 male,
we can't reject the hypothesis in high significance.
So we can conclude even in the whole higher class have higher survive rate,
for women and children class 1 and class 2 have no much diffrerence;
for male class 2 and class 3 have no much difference.
From this violin chart, we can see clearly the survived distribution of male, female, and child.
For children, it is nearly half and half.
We can look at more details at bar tables.
We can also use statistical hypothesis test confirm this.
We use t-test, and we get t-statistic=-19.921, pvalue=7.546e-72 for male and female.
t-statistic=-3.649, pvalue=0.00028 for children and non-children.
These actually reject the null hypothesis and confirm our hypothesis in a high significance.
As above, we show violin plot first.
We can see most of the 1st class survived, most of 3rd class died, and nearly half of 2nd class survived.
Using ANOVA analyze three class, and then using T-test for each pair of them.
It shows Class actually relate to survival rate, especially between class 3 and upper classes.
At first, show people distribution of different fares.
For fair, we clean them out, and plot mean fare for survive and non-survive groups.
T-test also confirms our idea.
At first, plot bar chart for each sex and coordinate class.
Some interesting things emerge out.
For female and child, 1st class and 2nd class seems have similar survive rate.
To confirm our observation, carry on T-test between 1st and 2nd class female and child, and 2nd and 3rd class men.
If we hold 99% significance, all the three null hypothesis can't be rejected.
Therefore, what we conclude above don't work here.