In this lab, we will use a familiar dataset to explore the use of visualizations in feature analysis and selection.
The objective of this lab is to work through some of the visualization capabilities available in Seaborn. For a more thorough investigation of the capabilities offered by Seaborn, you are encouraged to do the full tutorial linked below. Seaborn is an API to matplotlib. It integrates with pandas dataframes, simplifying the process of visualizing data. It provides simple functions for plotting.
Some of the features that seaborn offers are
We are going to look at 3 useful functions in seaborn: factorplot, pairplot, and joinplot.
Before running the code in this lab, articulate to your partner what you expect the visualization to look like. Look at the code and the Seaborn documentation to figure out what data is being plotted and what the type of plot may look like.
sources:
Previous Titanic work: https://github.com/rebeccabilbro/titanic
Seaborn Tutorial: https://stanford.edu/~mwaskom/software/seaborn/tutorial.html
In [ ]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
In [ ]:
%matplotlib inline
pd.set_option('display.max_columns', 500)
Like scikit-learn, Seaborn has "toy" datasets available to import for exploration. This includes the Titanic data we have previously looked at. Let's load the Seaborn Titanic dataset and take a look.
(https://github.com/mwaskom/seaborn-data shows the datasets available to load via this method in Seaborn.)
In [ ]:
df = sns.load_dataset('titanic')
In [ ]:
# Write the code to look at the head of the dataframe
As you can see, the data has been cleaned up a bit.
We performed some rudimentary visualization for exploratory data analysis previously. For example, we created a histogram using matplotlib to look at the age distirbution of passengers.
In [ ]:
# Create a histogram to examine age distribution of the passengers.
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(df['age'], bins = 10, range = (df['age'].min(),df['age'].max()))
plt.title('Age distribution')
plt.xlabel('Age')
plt.ylabel('Count of Passengers')
plt.show()
Our prior work with the Titanic data focused on the available numeric data. Factorplot gives us an easy method to explore some of the categorical data as well. Factorplots allow us to look at a parameter's distribution in bins defined by another parameter.
For example, we can look at the survival rate based on the deck a passenger's cabin was on.
Remember: take a look at the documentation first (https://stanford.edu/~mwaskom/software/seaborn/index.html) and figure out what the code is doing. Being able to understand documentation will help you a lot in your projects.
In [ ]:
# What is a factorplot? Check the documentation! Which data are we using? What is the count a count of?
g = sns.factorplot("alive", col="deck", col_wrap=4,
data=df[df.deck.notnull()], kind="count", size=4, aspect=.8)
What other options can you set with a factorplot in Seaborn? Using the code above as a starting point, create some code to create a factorplot with the data above, but in a different configuration. For example- make 2 plots per column, change the colors, add a legend, change the size, etc.
In [ ]:
# Try your own variation of the factorplot above.
As you saw in the factorplot documentation, you can specify several different types of plots in the parameters. Let's use factorplot to create a nested barplot showing passenger survival based on their class and sex. Fill in the missing pieces of the code below.
The goal is a barplot showing survival probablility by class that further shows the sex of the passengers in each class. (Hint: how can you use the hue parameter?)
In [ ]:
# Draw a nested barplot to show survival for class and sex
g = sns.factorplot(x="CHANGE TO THE CORRECT FEATURE",
y="CHANGE TO THE CORRECT FEATURE",
hue="CHANGE TO THE CORRECT FEATURE",
data=df,
size=6, kind="bar", palette="muted")
g.despine(left=True)
g.set_ylabels("survival probability")
Take a look at the code below. Let's again plot passenger survival based on their class and who they were (man, woman, child) but using a different plot for each class, like what we did above for the deck information.
In [ ]:
g = sns.factorplot(x="CHANGE TO THE CORRECT FEATURE",
y="CHANGE TO THE CORRECT FEATURE",
col="CHANGE TO THE CORRECT FEATURE",
data=df,
saturation=.5, kind="bar", ci=None,aspect=.6)
(g.set_axis_labels("", "Survival Rate").set_xticklabels(["Men", "Women", "Children"]).set_titles
("{col_name} {col_var}").set(ylim=(0, 1)).despine(left=True))
Factorplot has 6 different kinds of plots, we explored two of them above. Using the documentation, try out one of the remaining plot types. A suggestion is provided below. You can follow it, and/or create your own visualization.
In [ ]:
# With factorplot, make a violin plot that shows the age of the passengers at each embarkation point
# based on their class. Use the hue parameter to show the sex of the passengers
In the Wheat Classification notebook, we saw a scatter matrix. A scatter matrix plots each feature against every other feature. The diaganol showed us a density plot of just that data. Seaborn gives us this ability in the pairplot. In order to make a useful pairplot with the data, let's update some information.
In [ ]:
df.age = df.age.fillna(df.age.mean())
In [ ]:
g = sns.pairplot(data=df[['survived', 'pclass', 'age', 'sibsp', 'parch', 'fare']], hue='survived', dropna=True)
The Titanic data gives an idea of what we can see with a pairplot, but it might not be the most illustrative example. Using the information provided so far, make a pairplot using the seaborn car crashes data.
In [ ]:
# Pairplot of the crash data
In [ ]:
g = sns.jointplot("fare", "age", df)
Using either the Titanic or crash data, create some jointplots.
In [ ]:
# Jointplot, titanic data
In [ ]:
# Jointplot, crash data
In [ ]:
# boxplot of the age distribution on each deck by class
In [ ]:
# boxplot of the age distribution on each deck by class using FacetGrid