I imported all necessary libraries and directories including the .csv file. I also made sure to inclue inline matphotlibrary and setting seaborn style to white. this is to make sure all my graphs will not only show up in the notebook but also will be with a white bagkround to make it more readable.
In [16]:
##import everything
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sea
%matplotlib inline
sea.set(style="whitegrid")
titanic_ds = pd.read_csv('titanic-data.csv')
For the purpose of cleaning up the data I determined that the Name and Ticket columns were not necessary for my future analysis.
In [5]:
##dont need name and ticket for what I am going to be tackeling
titanic_ds = titanic_ds.drop(["Name", "Ticket"], axis=1)
Using .info and .describe, I am able to get a quick overview of what the data set has to offer and if anything stands out. In this instance we can see the embarked number is less then the number of passengers, but this will not be an issue for my analysis.
In [75]:
##over view of data
titanic_ds.info()
titanic_ds.describe()
Out[75]:
For my next few blocks of code and graphs I will be looking at two groups of individuals on the boat, Male and Female. To make the analysis a litle easier I created two variables that would define all males and all females on the boat.
In [7]:
##defining men and women from data
men_ds = titanic_ds[titanic_ds.Sex == 'male']
women_ds = titanic_ds[titanic_ds.Sex == 'female']
Using those two data sets that I created in the previous block, I printed the counts to understand how many of each sex were on the boat.
In [8]:
# idea of the spread between men and women
print("Males: ")
print(men_ds.count()['Sex'])
print("Females: ")
print(women_ds.count()['Sex'])
For this section I utilized Seaborn's factorplot function to graph the count of male's and females in each class.
In [80]:
##Gender distribution by class
gender_class= sea.factorplot('Pclass',order=[1,2,3], data=titanic_ds, hue='Sex', kind='count')
gender_class.set_ylabels("count of passengers")
Out[80]:
To begin answering my first question of who has a higer probability of surviving I created two variables, men_prob and women_prob. From there I grouped by sex and survival then taking the mean and thn printing out each statement.
In [76]:
##Probability of Survival by Gender
men_prob = men_ds.groupby('Sex').Survived.mean()
women_prob = women_ds.groupby('Sex').Survived.mean()
print("Male ability to survive: ")
print(men_prob[0])
print("Women ability to survive: ")
print(women_prob[0])
To visually answer the questions of what sex had a higher probability of surviving I utitlized the factorplot function with seaborn to map the sex, and survived in the form of a bar graph. I also incudled a y-axis label for presentaiton.
In [77]:
sbg = sea.factorplot("Sex", "Survived", data=titanic_ds, kind="bar", ci=None, size=5)
sbg.set_ylabels("survival probability")
Out[77]:
To answer my section question of what the age range range was of survivors vs. non-survivors I first wanted to see the distribution of age acorss the board. To do this I used the histogram function as well as printed the median age.
To validate the finding that Females do have a higher probability of surviving over Males, I will be applying stitastical analysis, chi-squared test, to gain the necessary understanding. My findings and code are bellow.
In [15]:
print("Total Count of Males and Females on ship: ")
print(titanic_ds.count()['Sex'])
print("Total Males:")
print(men_ds.count()['Sex'])
print("Males (Survived, Deseased): ")
print(men_ds[men_ds.Survived == 1].count()['Sex'], men_ds[men_ds.Survived == 0].count()['Sex'])
print("Total Women:")
print(women_ds.count()['Sex'])
print("Females (Survived, Deseased): ")
print(women_ds[women_ds.Survived == 1].count()['Sex'], women_ds[women_ds.Survived == 0].count()['Sex'])
In [20]:
men_women_survival = np.array([[men_ds[men_ds.Survived == 1].count()['Sex'], men_ds[men_ds.Survived == 0].count()['Sex']],[women_ds[women_ds.Survived == 1].count()['Sex'], women_ds[women_ds.Survived == 0].count()['Sex']]])
print(men_women_survival)
In [21]:
# Chi-square calculations
sp.stats.chi2_contingency(men_women_survival)
Out[21]:
Chi-square value: 260.71702016732104 p-value: 1.1973570627755645e-58 degrees of freedom: 1 expected frequencies table: 221.47474747, 355.52525253 120.52525253, 193.47474747
Given the p-value is 1.1973570627755645e-58 (.011973570627755645e-58) is less than the significance level of .05, there is an indicuation that there is a relationtion between gender and survivability. That means we accept the alternative hypothesis that gender and survivability are dependant of each other.
In [68]:
##Distribution of age; Median age 28.0
titanic_ds['Age'].hist(bins=100)
print ("Median Age: ")
print titanic_ds['Age'].median()
To answer my second questoins I showed survived data with age in a box plot to show average age as well as its distruvtion for both deseased and survived.
In [61]:
##Age box plot, survived and did not survive
##fewer people survived as compared to deseased
age_box=sea.boxplot(x="Survived", y="Age", data=titanic_ds)
age_box.set(xlabel = 'Survived', ylabel = 'Age', xticklabels = ['Desased', 'Survived'])
Out[61]:
To tackle the third question of what is the probability as well as who has a higher probability of survivng, being Alone or in a Family. I first went ahead and greated my function that would return True if the number of people reported was above 0 (Family) and Fale if it was not (Alone). Bellow you can see the function as well as the new column crated with the True and False statements.
In [16]:
titanic_ds['Family']=(titanic_ds.SibSp + titanic_ds.Parch > 0)
print titanic_ds.head()
To now show the probability visually as well as its output I have gone ahead and created a factorplot as well as printed out the probabilities of the two (Alone = False and Family = True). To get the probabilities I had to divide the sum of survivors by family type and divide by the count of family type.
In [33]:
fanda = sea.factorplot('Family', "Survived", data=titanic_ds, kind='bar', ci=None, size=5)
fanda.set(xticklabels = ['Alone', 'Family'])
print ((titanic_ds.groupby('Family')['Survived'].sum()/titanic_ds.groupby('Family')['Survived'].count()))
Finally, to answer my last question of if being in a higher class affected the probability of you surviving, I used the same seaborn factorplot but to create this graph i had to take the sum of survivors and divide them by the count of survivors in each class.
In [32]:
sea.factorplot('Pclass', "Survived", data=titanic_ds, kind='bar', ci=None, size=5)
PS=(titanic_ds.groupby('Pclass')['Survived'].sum())
PC=(titanic_ds.groupby('Pclass')['Survived'].count())
print ("Class Survivability: ")
print (PS/PC)
1) There we a total of 577 males and 314 females on the boat. From those totals I was able to dervice that females had a higher probablity of surviving as compared to male. Females had 74.20% probably of surviving as compared to Males who had a 18.89%.
2) The median age on the boat was 28 years old and that did not change dramtically when I looked at the ages of survivors vs. those who didn't. I did obvser that a greater number of older indviduals, those over 40, did not surive and while the number of survivors were much less then those who didnt its interesting to see the distribution.
3) When looking at whether being in a family meant a higher probability of surving as compared to being alone we can see that being in a family had a 50.56% probably of surviving as compared to 30.35% if the individual were alone.
4) Finally I looked at if being in a higher class meant a higher probabilty of living and as my data showed being in Class 1 meant a 62.96% of surviving as compared to 47.28% and 24.23% for Class 2 and 3.
So what does this all mean? Well the simpliest way to put it is that if you are a Female, with a family, in your 20's and in first class, you probably survied the tragedy and made it out on a life boat. Femailes and families surviving are fairly obvious points as they did not want to break up families in their persuit out as well Women and Children are prodominitly seen as priorities in times of emergency. The conclusion on class having an affect is also a clear indiciation of the times and "value" of individuals in that socioeconomic conditition. Higher ranked individuals were able to not only gain quicker access but increased access to saftey as they potentially had the means to sway decisons.