In this lab, we will perform a data analysis on the RMS Titanic passenger list. The RMS Titanic is one of the most famous ocean liners in history. On April 15, 1912 it sank after colliding with an iceberg in the North Atlantic Ocean. To learn more, read here: https://en.wikipedia.org/wiki/RMS_Titanic
Our goal today is to perform a data analysis on a subset of the passenger list. We're looking for insights as to which types of passengers did and didn't survive. Women? Children? 1st Class Passengers? 3rd class? Etc.
I'm sure you've heard the expression often said during emergencies: "Women and Children first" Let's explore this data set and find out if that's true!
Before we begin you should read up on what each of the columns mean in the data dictionary. You can find this information on this page: https://www.kaggle.com/c/titanic/data
First we load the dataset into a Pandas DataFrame
variable. The sample(10)
method takes a random sample of 10 passengers from the data set.
In [ ]:
import pandas as pd
import numpy as np
# this turns off warning messages
import warnings
warnings.filterwarnings('ignore')
passengers = pd.read_csv('CCL-titanic.csv')
passengers.sample(10)
In [ ]:
passengers['Survived'].sample(10)
There's too many to display so we just display a random sample of 10 passengers.
What we really want is to count the number of survivors and deaths. We do this by querying the value_counts()
of the ['Survived']
column, which returns a Series
of counts, like this:
In [ ]:
passengers['Survived'].value_counts()
Only 342 passengers survived, and 549 perished. Let's observe this same data as percentages of the whole. We do this by adding the normalize=True
named argument to the value_counts()
method.
In [ ]:
passengers['Survived'].value_counts(normalize=True)
Just 38% of passengers in this dataset survived.
In [ ]:
# todo write code here
NEXT Write a Pandas expression to display male /female passenger counts as a percentage of the whole number of passengers in the data set.
In [ ]:
# todo write code here
If you got things working, you now know that 35% of passengers were female.
We now know that 35% of the passengers were female, and 65% we male.
The next thing to think about is how do survivial rates affect these numbers?
If the ratio is about the same for surviviors only, then we can conclude that your Sex did not play a role in your survival on the RMS Titanic.
Let's find out.
In [ ]:
survivors = passengers[passengers['Survived'] ==1]
survivors['PassengerId'].count()
Still 342 like we discovered originally. Now let's check the Sex split among survivors only:
In [ ]:
survivors['Sex'].value_counts()
WOW! That is a huge difference! But you probably can't see it easily. Let's represent it in a DataFrame
, so that it's easier to visualize:
In [ ]:
sex_all_series = passengers['Sex'].value_counts()
sex_survivor_series = survivors['Sex'].value_counts()
sex_comparision_df = pd.DataFrame({ 'AllPassengers' : sex_all_series, 'Survivors' : sex_survivor_series })
sex_comparision_df['SexSurvivialRate'] = sex_comparision_df['Survivors'] / sex_comparision_df['AllPassengers']
sex_comparision_df
So, females had a 74% survival rate. Much better than the overall rate of 38%
We should probably briefly explain the code above.
Sometimes the variable we want to analyze is not readily available, but can be created from existing data. This is commonly referred to as feature engineering. The name comes from machine learning where we use data called features to predict an outcome.
Let's create a new feature called 'AgeCat'
as follows:
This is easy to do in pandas. First we create the column and set all values to np.nan
which means 'Not a number'. This is Pandas way of saying no value. Then we set the values based on the rules we set for the feature.
In [ ]:
passengers['AgeCat'] = np.nan # Not a number
passengers['AgeCat'][ passengers['Age'] <=18 ] = 'Child'
passengers['AgeCat'][ passengers['Age'] > 18 ] = 'Adult'
passengers.sample(5)
Let's get the count and distrubutions of Adults and Children on the passenger list.
In [ ]:
passengers['AgeCat'].value_counts()
And here's the percentage as a whole:
In [ ]:
passengers['AgeCat'].value_counts(normalize=True)
So close to 80% of the passengers were adults. Once again let's look at the ratio of AgeCat
for survivors only. If your age has no bearing of survivial, then the rates should be the same.
Here are the counts of Adult / Children among the survivors only:
In [ ]:
survivors = passengers[passengers['Survived'] ==1]
survivors['AgeCat'].value_counts()
In [ ]:
agecat_all_series = passengers['AgeCat'].value_counts()
agecat_survivor_series = survivors['AgeCat'].value_counts()
# todo make a data frame, add AgeCatSurvivialRate column, display dataframe
So, children had a 50% survival rate, better than the overall rate of 38%
It looks like the RMS really did have the motto: "Women and Children First."
Here are our insights. We know:
Repeat this process for Pclass
The passenger class variable. Display the survival rates for each passenger class. What does the information tell you about passenger class and survival rates?
I'll give you a hint... "Money Talks"
In [ ]:
# todo: repeat the analysis in the previous cell for Pclass
Please answer the following questions. This should be a personal narrative, in your own voice. Answer the questions by double clicking on the question and placing your answer next to the Answer: prompt.
Answer:
Answer:
Answer:
1 ==> I can do this on my own and explain how to do it.
2 ==> I can do this on my own without any help.
3 ==> I can do this with help or guidance from others. If you choose this level please list those who helped you.
4 ==> I don't understand this at all yet and need extra help. If you choose this please try to articulate that which you do not understand.
Answer:
In [ ]:
# SAVE YOUR WORK FIRST! CTRL+S
# RUN THIS CODE CELL TO TURN IN YOUR WORK!
from ist256.submission import Submission
Submission().submit()