Touring Seaborn with Titanic

In this lab, we will use a familiar dataset to explore the use of visualizations in feature analysis and selection.

The objective of this lab is to work through some of the visualization capabilities available in Seaborn. For a more thorough investigation of the capabilities offered by Seaborn, you are encouraged to do the full tutorial linked below. Seaborn is an API to matplotlib. It integrates with pandas dataframes, simplifying the process of visualizing data. It provides simple functions for plotting.

Some of the features that seaborn offers are

  • Several built-in themes that improve on the default matplotlib aesthetics
  • Tools for choosing color palettes to make beautiful plots that reveal patterns in your data
  • Functions for visualizing univariate and bivariate distributions or for comparing them between subsets of data
  • Tools that fit and visualize linear regression models for different kinds of independent and dependent variables
  • Functions that visualize matrices of data and use clustering algorithms to discover structure in those matrices
  • A function to plot statistical timeseries data with flexible estimation and representation of uncertainty around the estimate
  • High-level abstractions for structuring grids of plots that let you easily build complex visualizations

We are going to look at 3 useful functions in seaborn: factorplot, pairplot, and joinplot.

Before running the code in this lab, articulate to your partner what you expect the visualization to look like. Look at the code and the Seaborn documentation to figure out what data is being plotted and what the type of plot may look like.

sources:

Previous Titanic work: https://github.com/rebeccabilbro/titanic

Seaborn Tutorial: https://stanford.edu/~mwaskom/software/seaborn/tutorial.html


In [ ]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [ ]:
%matplotlib inline
pd.set_option('display.max_columns', 500)

Like scikit-learn, Seaborn has "toy" datasets available to import for exploration. This includes the Titanic data we have previously looked at. Let's load the Seaborn Titanic dataset and take a look.

(https://github.com/mwaskom/seaborn-data shows the datasets available to load via this method in Seaborn.)


In [ ]:
df = sns.load_dataset('titanic')

In [ ]:
# Write the code to look at the head of the dataframe

As you can see, the data has been cleaned up a bit.

We performed some rudimentary visualization for exploratory data analysis previously. For example, we created a histogram using matplotlib to look at the age distirbution of passengers.


In [ ]:
# Create a histogram to examine age distribution of the passengers.

fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(df['age'], bins = 10, range = (df['age'].min(),df['age'].max()))
plt.title('Age distribution')
plt.xlabel('Age')
plt.ylabel('Count of Passengers')
plt.show()

Factorplot

Our prior work with the Titanic data focused on the available numeric data. Factorplot gives us an easy method to explore some of the categorical data as well. Factorplots allow us to look at a parameter's distribution in bins defined by another parameter.

For example, we can look at the survival rate based on the deck a passenger's cabin was on.

Remember: take a look at the documentation first (https://stanford.edu/~mwaskom/software/seaborn/index.html) and figure out what the code is doing. Being able to understand documentation will help you a lot in your projects.


In [ ]:
# What is a factorplot? Check the documentation! Which data are we using? What is the count a count of?

g = sns.factorplot("alive", col="deck", col_wrap=4, 
                   data=df[df.deck.notnull()], kind="count", size=4, aspect=.8)

What other options can you set with a factorplot in Seaborn? Using the code above as a starting point, create some code to create a factorplot with the data above, but in a different configuration. For example- make 2 plots per column, change the colors, add a legend, change the size, etc.


In [ ]:
# Try your own variation of the factorplot above.

As you saw in the factorplot documentation, you can specify several different types of plots in the parameters. Let's use factorplot to create a nested barplot showing passenger survival based on their class and sex. Fill in the missing pieces of the code below.

The goal is a barplot showing survival probablility by class that further shows the sex of the passengers in each class. (Hint: how can you use the hue parameter?)


In [ ]:
# Draw a nested barplot to show survival for class and sex
g = sns.factorplot(x="CHANGE TO THE CORRECT FEATURE", 
                   y="CHANGE TO THE CORRECT FEATURE", 
                   hue="CHANGE TO THE CORRECT FEATURE", 
                   data=df,
                   size=6, kind="bar", palette="muted")
g.despine(left=True)
g.set_ylabels("survival probability")

Take a look at the code below. Let's again plot passenger survival based on their class and who they were (man, woman, child) but using a different plot for each class, like what we did above for the deck information.


In [ ]:
g = sns.factorplot(x="CHANGE TO THE CORRECT FEATURE", 
                   y="CHANGE TO THE CORRECT FEATURE", 
                   col="CHANGE TO THE CORRECT FEATURE", 
                   data=df, 
                   saturation=.5, kind="bar", ci=None,aspect=.6)
(g.set_axis_labels("", "Survival Rate").set_xticklabels(["Men", "Women", "Children"]).set_titles
 ("{col_name} {col_var}").set(ylim=(0, 1)).despine(left=True))

Factorplot has 6 different kinds of plots, we explored two of them above. Using the documentation, try out one of the remaining plot types. A suggestion is provided below. You can follow it, and/or create your own visualization.


In [ ]:
# With factorplot, make a violin plot that shows the age of the passengers at each embarkation point 
# based on their class. Use the hue parameter to show the sex of the passengers

Pairplot

In the Wheat Classification notebook, we saw a scatter matrix. A scatter matrix plots each feature against every other feature. The diaganol showed us a density plot of just that data. Seaborn gives us this ability in the pairplot. In order to make a useful pairplot with the data, let's update some information.


In [ ]:
df.age = df.age.fillna(df.age.mean())

In [ ]:
g = sns.pairplot(data=df[['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare']], hue='survived', dropna=True)

The Titanic data gives an idea of what we can see with a pairplot, but it might not be the most illustrative example. Using the information provided so far, make a pairplot using the seaborn car crashes data.


In [ ]:
# Pairplot of the crash data

Jointplot

Like pairplots, a jointplot shows the distribution between features. It also shows individual distributions of the features being compared.


In [ ]:
g = sns.jointplot("fare", "age", df)

Using either the Titanic or crash data, create some jointplots.


In [ ]:
# Jointplot, titanic data

In [ ]:
# Jointplot, crash data

Bonus

Use the Titanic data to create a boxplot of the age distribution on each deck by class.

Extra Bonus

Plot the same inforamtion using FacetGrid.


In [ ]:
#  boxplot of the age distribution on each deck by class

In [ ]:
#  boxplot of the age distribution on each deck by class using FacetGrid