In [2]:
# standard.
import pandas as pd
from pandas import Series, DataFrame
In [3]:
# reading the data from csv file
titanic_df = pd.read_csv('train.csv')
# preview of data
titanic_df.head()
Out[3]:
In [4]:
# quick grab of data
titanic_df.info()
All good data analysis projects begin with trying to answer questions. Now that we know what column category data we have let's think of some questions or insights we would like to obtain from the data. So here's a list of questions we'll try to answer using our new data analysis skills!
First some basic questions:
1.) Who were the passengers on the Titanic? (Ages,Gender,Class,..etc)
2.) What deck were the passengers on and how does that relate to their class?
3.) Where did the passengers come from?
4.) Who was alone and who was with family?
Then we'll dig deeper, with a broader question:
5.) What factors helped someone survive the sinking?
So let's start with the first question: Who were the passengers on the titanic?
In [4]:
# plotting library
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
In [7]:
# quick look at the sex of people on the titanic
""" we will use factor plot for this which takes a coloum name and divide on the basis of the avaliable data. """
sns.factorplot('Sex', data=titanic_df)
Out[7]:
In [10]:
# Male and female in each class
sns.factorplot('Sex', data=titanic_df, hue='Pclass')
Out[10]:
In [11]:
# better way
sns.factorplot('Pclass', data=titanic_df, hue='Sex')
Out[11]:
Wow, quite a few more males in the 3rd class than females, an interesting find. However, it might be useful to know the split between males,females,and children. How can we go about this?
In [20]:
# We'll treat anyone as under 16 as a child, and then use the apply technique with a function to create a new column
def male_female_child(passenger):
age, sex = passenger
if age < 16:
return 'child'
else:
return sex
# creating a passenger coloumn in the titanic_df
# since it is a coloumn and not index we need to set axis to 1
titanic_df['Person'] = titanic_df[['Age', 'Sex']].apply(male_female_child, axis=1)
In [21]:
titanic_df[0:10]
Out[21]:
In [22]:
# factor plot for person
sns.factorplot("Pclass", data=titanic_df, hue="Person")
Out[22]:
In [33]:
# hist plot of ages
titanic_df['Age'].hist(bins=70)
Out[33]:
In [34]:
# find the mean age
titanic_df['Age'].mean()
Out[34]:
In [35]:
titanic_df.info()
In [36]:
# removinf a coloumn from data frame
titanic_df.drop('person', axis=1, inplace=True)
In [38]:
titanic_df.head()
Out[38]:
In [41]:
# to get details of each individual sex in person
titanic_df['Person'].value_counts()
Out[41]:
In [47]:
# Another way to visualize the data is to use FacetGrid to plot multiple kedplots on one plot
# Set the figure equal to a facetgrid with the pandas
# dataframe as its data source, set the hue, and change the aspect ratio.
fig = sns.FacetGrid(titanic_df, hue='Sex', aspect=4)
# Next use map to plot all the possible kdeplots for the 'Age' column by the hue choice
fig.map(sns.kdeplot, 'Age', shade=True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(1, oldest))
fig.add_legend()
In [48]:
# similary plotting for person
fig = sns.FacetGrid(titanic_df, hue='Person', aspect=4)
# Next use map to plot all the possible kdeplots for the 'Age' column by the hue choice
fig.map(sns.kdeplot, 'Age', shade=True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(1, oldest))
fig.add_legend()
In [51]:
# similary plotting for class
fig = sns.FacetGrid(titanic_df, hue='Pclass', aspect=4)
# Next use map to plot all the possible kdeplots for the 'Age' column by the hue choice
fig.map(sns.kdeplot, 'Age', shade=True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(1, oldest))
fig.add_legend()
In [53]:
titanic_df.head()
Out[53]:
In [5]:
# First we'll drop the NaN values and create a new object, deck
deck = titanic_df['Cabin'].dropna()
deck.head()
Out[5]:
In [6]:
# We need only first letter in the deck not we will remove the rest of the data
levels = []
# grabbing the First letter
for i in deck:
levels.append(i[0])
# Make a cabin dataFrame
cabin_df = DataFrame(levels)
cabin_df.columns = ['Cabins']
sns.factorplot('Cabins', data=cabin_df, palette='winter_d')
Out[6]:
In [77]:
# Redefine cabin_df as everything but where the row was equal to 'T'
cabin_df = cabin_df[cabin_df.Cabins != 'T']
sns.factorplot('Cabins', data=cabin_df, palette='summer')
Out[77]:
Now that we've analyzed the distribution by decks, let's go ahead and answer our third question:
3.) Where did the passengers come from?
In [87]:
# Factor plot of where people came from.
''' plot tells how many people came from which places and which class '''
# by using x_order we can remove the nan plots
sns.factorplot('Embarked', data=titanic_df, x_order=['C', 'Q', 'S'], hue='Pclass', aspect=2)
Out[87]:
An interesting find here is that in Queenstown, almost all the passengers that boarded there were 3rd class. It would be intersting to look at the economics of that town in that time period for further investigation.
Now let's take a look at the 4th question:
4.) Who was alone and who was with family?
In [92]:
# Adding alone column
titanic_df['Alone'] = titanic_df['SibSp'] + titanic_df['Parch']
titanic_df['Alone'].loc[titanic_df['Alone'] > 0] = 'With family'
titanic_df['Alone'].loc[titanic_df['Alone'] == 0] = 'Alone'
In [93]:
titanic_df[0:10]
Out[93]:
In [96]:
# plotting
sns.factorplot('Alone', data=titanic_df, palette='Blues', hue='Sex')
Out[96]:
Now that we've throughly analyzed the data let's go ahead and take a look at the most interesting (and open-ended) question: What factors helped someone survive the sinking?
In [99]:
# Making a survivor column using surivived values
titanic_df['survivor'] = titanic_df.Survived.map({0: 'No', 1:"Yes"})
titanic_df[0:10]
Out[99]:
In [101]:
# survival of men and women
sns.factorplot('survivor', data=titanic_df, palette='Reds', hue='Sex')
Out[101]:
In [109]:
# survival based on class
#sns.factorplot('survivor', data=titanic_df, palette='Reds', hue='Pclass')
sns.factorplot('Pclass', 'Survived', data=titanic_df)
Out[109]:
In [108]:
sns.factorplot('Survived', data=titanic_df, palette='Reds', hue='Sex')
Out[108]:
In [110]:
# unfavouriable conditions for survival.
sns.factorplot('Pclass', 'Survived', data=titanic_df, hue='Person')
Out[110]:
From this data it looks like being a male or being in 3rd class were both not favourable for survival. Even regardless of class the result of being a male in any class dramatically decreases your chances of survival.
But what about age? Did being younger or older have an effect on survival rate?
In [111]:
sns.lmplot('Age', 'Survived', data=titanic_df)
Out[111]:
Looks like there is a general trend that the older the passenger was, the less likely they survived. Let's go ahead and use hue to take a look at the effect of class and age.
In [112]:
# Let's use a linear plot on age versus survival using hue for class seperation
sns.lmplot('Age','Survived',hue='Pclass',data=titanic_df,palette='winter')
Out[112]:
We can also use the x_bin argument to clean up this figure and grab the data and bin it by age with a std attached!
In [114]:
# cleaning up the plot
generations = [10, 20, 40, 60, 80]
sns.lmplot('Age','Survived',hue='Pclass',data=titanic_df,palette='winter', x_bins=generations)
Out[114]:
What about if we relate gender and age with the survival set?
In [115]:
sns.lmplot('Age','Survived',hue='Sex',data=titanic_df,palette='winter', x_bins=generations)
Out[115]:
1.) Did the deck have an effect on the passengers survival rate? Did this answer match up with your intuition?
In [168]:
# concatinating the data frames
deck = []
for i in titanic_df['Cabin']:
if str(i) == 'NaN':
deck.append(0)
else:
deck.append(str(i)[0])
titanic_df['Deck'] = deck
sns.factorplot('Deck', x_order=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'], data=titanic_df, hue='Survived')
Out[168]:
2.) Did having a family member increase the odds of surviving the crash?
In [123]:
sns.factorplot('Survived', data=titanic_df, hue='Alone')
Out[123]:
In [125]:
titanic_df['Alone'].value_counts()
Out[125]:
In [130]:
from IPython.display import Image
Image(url='http://i.imgur.com/DGNjT.gif')
Out[130]:
In [ ]:
In [ ]:
In [ ]: