We previously have covered how to do some basic graphics using matplotlib. In this notebook we introduce a package called seaborn. seaborn builds on top of matplotlib by doing 2 things:
seaborn could be made by matplotlib, but you shouldn't have to worry about doing this)Before we start, make sure that you have seaborn installed. If not, then you can install it by
conda install seaborn
This notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.
In [1]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import sys
%matplotlib inline
As per usual, we begin by listing the versions of each package that is used in this notebook.
In [2]:
# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Matplotlib version: ', mpl.__version__)
print('Seaborn version: ', sns.__version__)
There are some classical datasets that get used to demonstrate different types of plots. We will use several of them here.
In [3]:
tips = sns.load_dataset("tips")
ansc = sns.load_dataset("anscombe")
tita = sns.load_dataset("titanic")
Recall that in our previous notebook that we used plt.style.use to set styles. We will begin by setting the style to "classic"; this sets all of our default settings back to matplotlib's default values.
In [4]:
tips.head()
Out[4]:
In [5]:
tips[tips['sex']=='Female'].head()
Out[5]:
Use html color names. List
In [6]:
fig, ax = plt.subplots(figsize = (8, 4))
tips[tips["sex"] == "Male"].plot(x="total_bill", y="tip", ax=ax, kind="scatter",
color="blue", label ='male')
tips[tips["sex"] == "Female"].plot(x="total_bill", y="tip", ax=ax, kind="scatter",
color="#7CFC00", label ='female')
ax.set_xlim(0, 52)
ax.legend(loc='best')
Out[6]:
In [7]:
plt.style.use("classic")
fig, ax = plt.subplots(2, 1)
tips[tips["sex"] == "Male"].plot(x="total_bill", y="tip", ax=ax[0], kind="scatter",
color="blue")
tips[tips["sex"] == "Female"].plot(x="total_bill", y="tip", ax=ax[1], kind="scatter",
color="#F52887")
ax[0].set_xlim(0, 52)
ax[1].set_xlim(0, 52)
ax[0].set_ylim(0, 15)
ax[1].set_ylim(0, 15)
ax[0].set_title("Male Tips")
ax[1].set_title("Female Tips")
fig.tight_layout()
# fig.savefig("/home/chase/Desktop/foo.png")
Quite different picture
In [8]:
tips['percent'] = (tips['tip']/tips['total_bill'])*100
In [9]:
fig, ax = plt.subplots(2, 1)
tips[tips["sex"] == "Male"].plot(x="total_bill", y="percent", ax=ax[0], kind="scatter",
color="blue")
tips[tips["sex"] == "Female"].plot(x="total_bill", y="percent", ax=ax[1], kind="scatter",
color="#F52887")
ax[0].set_xlim(0, 52)
ax[1].set_xlim(0, 52)
ax[0].set_title("Male Tips")
ax[1].set_title("Female Tips")
fig.tight_layout()
In [10]:
# sns.set() resets default seaborn settings
sns.set()
fig, ax = plt.subplots(2)
tips[tips["sex"] == "Male"].plot(x="total_bill", y="percent", ax=ax[0], kind="scatter",
color="blue")
tips[tips["sex"] == "Female"].plot(x="total_bill", y="percent", ax=ax[1], kind="scatter",
color="#F52887")
ax[0].set_xlim(0, 52)
ax[1].set_xlim(0, 52)
ax[0].set_ylim(0, 45)
ax[1].set_ylim(0, 45)
ax[0].set_title("Male Tips")
ax[1].set_title("Female Tips")
fig.tight_layout()
What did you notice about the differences in the settings of the plot?
Which do you like better? We like the second better.
Investigate other styles and create the same plot as above using a style you like. You can choose from the list in the code below.
If you have additional time, visit the seaborn docs and try changing other default settings.
In [11]:
plt.style.available
Out[11]:
In [12]:
sns.set_style('dark')
We could do the same for a different style (like ggplot)
In [13]:
plt.style.use('ggplot')
fig, ax = plt.subplots(2)
tips[tips["sex"] == "Male"].plot(x="total_bill", y="percent", ax=ax[0], kind="scatter")
tips[tips["sex"] == "Female"].plot(x="total_bill", y="percent", ax=ax[1], kind="scatter")
ax[0].set_xlim(0, 52)
ax[1].set_xlim(0, 52)
ax[0].set_ylim(0, 45)
ax[1].set_ylim(0, 45)
ax[0].set_title("Male Tips")
ax[1].set_title("Female Tips")
fig.tight_layout()
Exercise: Find a style you like and recreate the plot above using that style.
In [ ]:
While having seaborn set sensible defaults is convenient, it isn't a particularly large innovation. We could choose sensible defaults and set them to be our default. The main benefit of seaborn is the types of graphs that it gives you access to -- All of which could be done in matplotlib, but, instead of 5 lines of code, it would require possibly hundreds of lines of code. Trust us... This is a good thing.
We don't have time to cover everything that can be done in seaborn, but we suggest having a look at the gallery of examples.
We will cover:
kdeplotjointplotviolinplotpairplot
In [14]:
# Move back to seaborn defaults
sns.set()
In [15]:
tita.head()
Out[15]:
Try a simple histogram
In [16]:
# hist cannot take NaN's
tita_nonan = tita[~tita['age'].isnull()]
fig, ax = plt.subplots()
ax.hist(tita_nonan['age'], bins=25)
ax.set_title("Histogram of age")
plt.show()
Express this in terms of frequencies (~probabilities) instead of number of occurrences
In [17]:
fig, ax = plt.subplots()
ax.hist(tita_nonan['age'], bins=25, normed=True)
ax.set_title("Histogram of age")
plt.show()
kdeplot essentially smoothes this histogram out
If you get a TypeError for kdeplot, try
conda uninstall statsmodels --yes conda install -c taugspurger statsmodels=0.8.0
In [18]:
fig, ax = plt.subplots()
ax.hist(tita_nonan['age'], bins=25, normed=True)
sns.kdeplot(tita_nonan['age'], ax=ax, lw = 3)
ax.set_title("Histogram of age")
plt.show()
In [23]:
fig, ax = plt.subplots(1, 2, figsize = (12, 5), sharey=True)
ax[0].hist(tita_nonan['age'][tita_nonan['survived']==0], bins=25, normed=True, label='hist')
sns.kdeplot(tita_nonan['age'][tita_nonan['survived']==0], ax=ax[0], lw = 3, label='smooth')
ax[1].hist(tita_nonan['age'][tita_nonan['survived']==1], bins=25, normed=True, label='hist')
sns.kdeplot(tita_nonan['age'][tita_nonan['survived']==1], ax=ax[1], lw = 3, label='smooth')
ax[0].set_title("Histogram of age (not survived)")
ax[1].set_title("Histogram of age (survived)")
plt.show()
In [20]:
tita_survived = tita_nonan[tita_nonan['survived']==1]
fig, ax = plt.subplots(1, 2, figsize = (12, 5), sharey=True)
ax[0].hist(tita_survived['age'][tita_survived['sex']=='male'], bins=25, normed=True)
sns.kdeplot(tita_survived['age'][tita_survived['sex']=='male'], ax=ax[0], lw = 3)
ax[1].hist(tita_survived['age'][tita_survived['sex']=='female'], bins=25, normed=True)
sns.kdeplot(tita_survived['age'][tita_survived['sex']=='female'], ax=ax[1], lw = 3)
ax[0].set_title("Histogram of age (male)")
ax[1].set_title("Histogram of age (female)")
plt.show()
In [21]:
fig, ax = plt.subplots()
sns.kdeplot(tips["percent"], ax=ax, label='smooth')
ax.hist(tips["percent"], bins=25, alpha=0.25, normed=True, label="tip (percent)")
ax.legend()
fig.suptitle("Kernel Density of tips with Histogram")
plt.show()
In [24]:
tips.head()
Out[24]:
In [25]:
fig, ax = plt.subplots(1, 2, sharey=True, figsize = (12, 4))
sns.kdeplot(tips["percent"][tips['sex']=='Male'], ax=ax[0], label='smooth')
ax[0].hist(tips["percent"][tips['sex']=='Male'], bins=25, alpha=0.25, normed=True, label="tip (percent)")
ax[0].set_title('Male waiter')
ax[0].legend()
sns.kdeplot(tips["percent"][tips['sex']=='Female'], ax=ax[1], label='smooth')
ax[1].hist(tips["percent"][tips['sex']=='Female'], bins=25, alpha=0.25, normed=True, label="tip (percent)")
ax[1].set_title('Female waiter')
ax[1].legend()
fig.suptitle("Kernel Density of tips with Histogram", y=1.03)
plt.show()
Exercise: Create your own kernel density plot using sns.kdeplot of "total_bill" from the tips dataframe
In [ ]:
In [24]:
sns.jointplot(x="total_bill", y="percent", data=tips)
Out[24]:
We can also plot everything as a kernel density estimate -- Notice the main plot is now a contour map.
In [25]:
sns.jointplot(x="total_bill", y="percent", data=tips, kind="kde")
Out[25]:
In [26]:
sns.set(style="darkgrid", color_codes=True)
g = sns.jointplot("total_bill", "tip", data=tips, kind="reg",
xlim=(0, 60), ylim=(0, 12), color="r", size=6)
Exercise: Create your own jointplot. Feel free to choose your own x and y data (if you can't decide then use x=size and y=tip). Interpret the output of the plot.
In [26]:
tips.head()
Out[26]:
If
violinplottends to kill your kernel, try to upload your numpy and scipy packagesconda update numpy
In [27]:
sns.set(style="whitegrid", palette="pastel", color_codes=True)
In [28]:
sns.violinplot(x="day", y="total_bill", data=tips)
Out[28]:
In [29]:
sns.violinplot(x="day", y="total_bill", hue='sex', data=tips)
Out[29]:
In [30]:
sns.violinplot(x="day", y="total_bill", hue='sex', data=tips, split=True, inner='quart')
Out[30]:
In [31]:
# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="day", y="total_bill", hue="sex", data=tips, split=True,
inner="quart", palette={"Male": "b", "Female": "y"})
Out[31]:
In [32]:
sns.violinplot(x="class", y="age", hue='sex', split=True, data=tita_survived)
Out[32]:
Pair plots show us two things. They show us the histograms of the variables along the diagonal and then the scatter plot of each pair of variables on the off diagonal pictures.
Why might this be useful? It allows us to look get an idea of the correlations across each pair of variables and gives us an idea of their relationships across the variables.
In [33]:
sns.set()
In [34]:
sns.pairplot(tips[["percent", "total_bill", "size"]], size=3.5)
Out[34]:
Below is the same plot, but slightly different. What is different?
In [35]:
sns.pairplot(tips[["tip", "total_bill", "size"]], size=3.5, diag_kind="kde")
Out[35]:
What's different about this plot?
Different colors for each company.
In [36]:
tips.head()
Out[36]:
In [37]:
sns.pairplot(tips[["tip", "total_bill", "size", "time"]], size=3.5, diag_kind="kde", hue="time")
Out[37]:
In [55]:
fig, ax = plt.subplots()
sns.swarmplot(data=tips, x="day", y="total_bill")
Out[55]:
We often want to think about running regressions of variables. A statistician named Francis Anscombe came up with four datasets that:
Below we show the scatter plot of the datasets to give you an idea of how different they are.
In [56]:
fig, ax = plt.subplots(2, 2, figsize=(10, 9))
ansc[ansc["dataset"] == "I"].plot.scatter(x="x", y="y", ax=ax[0, 0])
ansc[ansc["dataset"] == "II"].plot.scatter(x="x", y="y", ax=ax[0, 1])
ansc[ansc["dataset"] == "III"].plot.scatter(x="x", y="y", ax=ax[1, 0])
ansc[ansc["dataset"] == "IV"].plot.scatter(x="x", y="y", ax=ax[1, 1])
ax[0, 0].set_title("Dataset I")
ax[0, 1].set_title("Dataset II")
ax[1, 0].set_title("Dataset III")
ax[1, 1].set_title("Dataset IV")
fig.suptitle("Anscombe's Quartet")
Out[56]:
lmplot plots the data with the regression coefficient through it.
In [57]:
sns.lmplot(x="x", y="y", data=ansc, col="dataset", hue="dataset",
col_wrap=2, ci=None)
Out[57]:
regplot also shows the regression line through data points
In [67]:
fig, ax = plt.subplots(2, 2, figsize = (10, 10))
sns.regplot(x="x", y="y", data=ansc[ansc["dataset"] == "I"], ax=ax[0,0])
sns.regplot(x="x", y="y", data=ansc[ansc["dataset"] == "II"], ax=ax[0,1])
sns.regplot(x="x", y="y", data=ansc[ansc["dataset"] == "III"], ax=ax[1,0])
sns.regplot(x="x", y="y", data=ansc[ansc["dataset"] == "IV"], ax=ax[1,1])
Out[67]:
In [ ]: