Kaggle has a nice dataset with information about passengers on the Titanic. It's meant as an introduction to predictive models -- here, predicting who survived the sinking. Let's explore it using seaborn. This notebook mostly demonstrates features in development for version 0.3. Please get in touch if you have ideas for how they could be improved.
In [11]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style="white")
First we load in the data and take a look
In [3]:
url = "https://raw.github.com/mattdelhey/kaggle-titanic/master/Data/train.csv"
titanic = pd.read_csv(url)
titanic.info()
In [4]:
titanic.head()
Out[4]:
Let's do little bit of processing to make some different variables that might be more interesting to plot. Since this notebook is focused on visualization, we're going to do this without much comment.
In [5]:
def woman_child_or_man(passenger):
age, sex = passenger
if age < 16:
return "child"
else:
return dict(male="man", female="woman")[sex]
In [6]:
titanic["class"] = titanic.pclass.map({1: "First", 2: "Second", 3: "Third"})
titanic["who"] = titanic[["age", "sex"]].apply(woman_child_or_man, axis=1)
titanic["adult_male"] = titanic.who == "man"
titanic["deck"] = titanic.cabin.str[0].map(lambda s: np.nan if s == "T" else s)
titanic["embark_town"] = titanic.embarked.map({"C": "Cherbourg", "Q": "Queenstown", "S": "Southampton"})
titanic["alive"] = titanic.survived.map({0: "no", 1: "yes"})
titanic["alone"] = ~(titanic.parch + titanic.sibsp).astype(bool)
titanic = titanic.drop(["name", "ticket", "cabin"], axis=1)
In [7]:
titanic.head()
Out[7]:
Finally set up a palette dictionary for some of the plots.
In [8]:
pal = dict(man="#4682B4", woman="#CD5C5C", child="#2E8B57", male="#6495ED", female="#F08080")
Before getting to the main question (who survived), let's take a look at the dataset to get a sense for how the observations are distributed into the different levels of our factors of interest.
First let's count the number of males and females, ignoring age.
In [8]:
sns.factorplot("sex", data=titanic, palette=pal);
Then we can look at how this is distributed into the three classes.
In [9]:
sns.factorplot("class", data=titanic, hue="sex", palette=pal);
We also have a separate classification that splits off children (recall, this is going to be relevant because of the "women and children first" policy followed during the evacuation).
In [10]:
sns.factorplot("who", data=titanic, palette=pal);
In [11]:
sns.factorplot("class", data=titanic, hue="who", palette=pal);
Finally, we made a variable that indicates whether a passanger was an adult male.
In [12]:
sns.factorplot("adult_male", data=titanic, palette="Blues");
In [12]:
sns.factorplot("class", data=titanic, hue="adult_male", palette="Blues");
Next let's look at the distribution of ages within the groups we defined above.
In [13]:
fg = sns.FacetGrid(titanic, hue="sex", aspect=3, palette=pal)
fg.map(sns.kdeplot, "age", shade=True)
fg.set(xlim=(0, 80));
In [14]:
fg = sns.FacetGrid(titanic, hue="who", aspect=3, palette=pal)
fg.map(sns.kdeplot, "age", shade=True)
fg.set(xlim=(0, 80));
Although have some information about the distribution into classes from the sex plots, let's directly visualize it an then see how the classes break down by age.
In [15]:
sns.factorplot("class", data=titanic, palette="BuPu_d");
In [16]:
fg = sns.FacetGrid(titanic, hue="class", aspect=3, palette="BuPu_d")
fg.map(sns.kdeplot, "age", shade=True)
fg.set(xlim=(0, 80));
Finally let's look at the breakdown by age and sex.
In [17]:
fg = sns.FacetGrid(titanic, col="sex", row="class", hue="sex", size=2.5, aspect=2.5, palette=pal)
fg.map(sns.kdeplot, "age", shade=True)
fg.map(sns.rugplot, "age")
sns.despine(left=True)
fg.set(xlim=(0, 80));
We also have information about what deck each passgener's cabin was on, which may be relevant.
In [19]:
sns.factorplot("deck", data=titanic, palette="PuBu_d");
How did the decks break down by class for the passengers we have data about?
In [20]:
sns.factorplot("deck", hue="class", data=titanic, palette="BuPu_d");
Note that we're missing a lot of deck data for the second and third class passengers, which will be important to keep in mind later.
Since we have data about fares, let's see how those broke down by classes.
In [21]:
from seaborn import linearmodels
reload(linearmodels)
reload(sns)
sns.set(style="nogrid")
In [22]:
sns.factorplot("class", "fare", data=titanic, palette="BuPu_d");
In [23]:
sns.violinplot(titanic["fare"], titanic["class"], color="BuPu_d").set_ylim(0, 600)
sns.despine(left=True);
There are some extreme outliers in the first class distribution; let's winsorize those to get a better sense for how much each class paid.
In [24]:
titanic["fare_winsor"] = titanic.fare.map(lambda f: min(f, 200))
In [25]:
sns.violinplot(titanic["fare_winsor"], titanic["class"], color="BuPu_d").set_ylim(0, 250)
sns.despine(left=True);
How did the fares break down by deck? Let's look both at the mean and the distribution.
In [26]:
sns.factorplot("deck", "fare", data=titanic, palette="PuBu_d");
In [27]:
sns.violinplot(titanic["fare_winsor"], titanic["deck"], color="PuBu_d")
sns.despine(left=True);
It might make more sense to plot the median fare, since the distributions aren't normal.
In [28]:
sns.factorplot("deck", "fare", data=titanic, palette="PuBu_d", estimator=np.median);
We can also look at a regression of fare on age to see if older passengers paid more. We'll use robust methods here too, which will accound for the skewed distribution on fare.
In [29]:
sns.regplot("age", "fare", data=titanic, robust=True, ci=None, color="seagreen")
sns.despine();
The Titanic passengers embarked at one of three ports before the voyage.
In [30]:
sns.factorplot("class", data=titanic, hue="embark_town", palette="Set2");
We also have some data, although it's not coded very well, about the number of parents/children and the numbe of siblings/spouses on board for each passenger.
In [31]:
sns.factorplot("class", data=titanic, hue="parch", palette="BuGn");
In [32]:
sns.factorplot("class", data=titanic, hue="sibsp", palette="YlGn");
We defined a variable that just measures whether someone was traveling alone, i.e. without family.
In [33]:
sns.factorplot("alone", data=titanic, palette="Greens");
Now that we have a feel for the characteristics of our sample, let's get down to the main question and ask what factors seem to predict whether our passengers survived. But first, one more count plot just to see how many of our passengers perished in the sinking.
In [34]:
sns.factorplot("alive", data=titanic, palette="OrRd_d");
It's part of popular lore that the third-class (or steerage) passengers fared much more poorly than their wealthier shipmates. Is this borne out in the data?
In [35]:
sns.factorplot("class", "survived", data=titanic).set(ylim=(0, 1))
Out[35]:
We also of course know that women were given high priority during the evacuation, and we saw above that Third class was disproportionately male. Maybe that's driving the class effect?
In [36]:
sns.factorplot("class", "survived", data=titanic, hue="sex", palette=pal).set(ylim=(0, 1));
Nope, in general it was not good to be a male or to be in steerage.
Were they at least successful in evacuating the children?
In [37]:
fg = sns.factorplot("class", "survived", data=titanic, hue="who", col="who", palette=pal, aspect=.4)
fg.set(ylim=(0, 1))
fg.despine(left=True)
Out[37]:
Pretty good for first and second class (although the precise estimates are unreliable because there weren't that many children traveling in the upper classes. It's actually the case that every second-class child survived, though).
We suspect that the best way to predict survival is to look at whether a passenger was an adult male and what class he or she was in.
In [38]:
sns.factorplot("class", "survived", data=titanic, hue="adult_male", palette="Blues").set(ylim=(0, 1))
Out[38]:
Another way to plot the same data emphasizes the different outcomes for men and other passengers even more dramatically.
In [39]:
fg = sns.factorplot("adult_male", "survived", data=titanic, col="class", hue="class",
aspect=.33, palette="BuPu_d")
fg.set(ylim=(0, 1))
fg.despine(left=True);
We can also ask whether age as a contiunous variable mattered. We'll draw logistic regression plots, first jittering the survival datapoints to get a sense of the distribution.
In [40]:
sns.lmplot("age", "survived", titanic, logistic=True, y_jitter=.05);
We can also plot the same data with the survival observations grouped into discrete bins.
In [41]:
sns.lmplot("age", "survived", titanic, logistic=True, x_bins=4, truncate=True);
We know that sex is important, though, so we probably want to separate out these predictions for men and women.
In [42]:
age_bins = [15, 30, 45, 60]
sns.lmplot("age", "survived", titanic, hue="sex",
palette=pal, x_bins=age_bins, logistic=True).set(xlim=(0, 80));
Class is imporant too, let's see whether it interacts with the age variable as well.
In [43]:
sns.lmplot("age", "survived", titanic, hue="class",
palette="BuPu_d", x_bins=age_bins, logistic=True).set(xlim=(0, 80));
Because the above plot is rather busy, it might make sense to split the three classes onto separate facets.
In [44]:
sns.lmplot("age", "survived", titanic, col="class", hue="class",
palette="BuPu_d", x_bins=4, logistic=True, size=3).set(xlim=(0, 80));
We know that class matters, but we can also use the fare
variable as a proxy for a contiuous measure of wealth.
In [45]:
sns.lmplot("fare_winsor", "survived", titanic, x_bins=4, logistic=True, truncate=True);
Perhaps it mattered what deck each passenger's cabin was on?
In [46]:
sns.factorplot("deck", "survived", data=titanic, palette="PuBu_d", join=False);
In [47]:
sns.factorplot("deck", "survived", data=titanic, col="class", size=3, palette="PuBu_d", join=False);
Although the way our data on family members was coded, we don't know for sure what sort of companions these passengers had, but it's worth asking how they influenced survival.
In [48]:
sns.lmplot("parch", "survived", titanic, x_estimator=np.mean, logistic=True);
In [49]:
sns.lmplot("parch", "survived", titanic, hue="sex", x_estimator=np.mean, logistic=True, palette=pal);
In [50]:
sns.lmplot("sibsp", "survived", titanic, x_estimator=np.mean, logistic=True);
We also have a more interpretable alone
variable (although it's reasonable to assume that this is going to be confounded with age).
In [51]:
sns.factorplot("alone", "survived", data=titanic).set(ylim=(0, 1));
Did traveling alone have a greater effect depending on what class you were in?
In [52]:
fg = sns.factorplot("alone", "survived", data=titanic, col="class", hue="class",
aspect=.33, palette="BuPu_d")
fg.set(ylim=(0, 1))
fg.despine(left=True);
As above, a different presentation of the same data emphasizes different comparisons.
In [53]:
sns.factorplot("class", "survived", data=titanic, hue="alone", palette="Greens").set(ylim=(0, 1));
What about men and women who were traveling alone?
In [54]:
sns.factorplot("alone", "survived", data=titanic, hue="sex", palette=pal).set(ylim=(0, 1));
In [55]:
fg = sns.factorplot("alone", "survived", data=titanic, hue="sex",
col="class", palette=pal, aspect=.33)
fg.despine(left=True);
In [ ]: