Graphics using Seaborn

We previously have covered how to do some basic graphics using matplotlib. In this notebook we introduce a package called seaborn. seaborn builds on top of matplotlib by doing 2 things:

Gives us access to more types of plots (Note: Every plot created in seaborn could be made by matplotlib, but you shouldn't have to worry about doing this)
Sets better defaults for how the plot looks right away

Before we start, make sure that you have seaborn installed. If not, then you can install it by

conda install seaborn

This notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.



In [1]:

    
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import sys


%matplotlib inline

As per usual, we begin by listing the versions of each package that is used in this notebook.



In [2]:

    
# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Matplotlib version: ', mpl.__version__)
print('Seaborn version: ', sns.__version__)









    



Python version: 3.5.2 |Anaconda custom (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
Pandas version:  0.19.2
Matplotlib version:  1.5.3
Seaborn version:  0.7.1

Datasets

There are some classical datasets that get used to demonstrate different types of plots. We will use several of them here.

tips : This dataset has informaiton on waiter tips. Includes information such as total amount of the bill, tip amount, sex of waiter, what day of the week, which meal, and party size.
anscombe: This dataset is a contrived example. It has 4 examples which differ drastically when you look at them, but they have the same correlation, regression coefficient, and $R^2$.
titanic : This dataset has information on each of the passengers who were on the titanic. Includes information such as: sex, age, ticket class, fare paid, whether they were alone, and more.



In [3]:

    
tips = sns.load_dataset("tips")
ansc = sns.load_dataset("anscombe")
tita = sns.load_dataset("titanic")

Better Defaults

Recall that in our previous notebook that we used plt.style.use to set styles. We will begin by setting the style to "classic"; this sets all of our default settings back to matplotlib's default values.



In [4]:

    
tips.head()









    Out[4]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
    
  
  
    
      0
      16.99
      1.01
      Female
      No
      Sun
      Dinner
      2
    
    
      1
      10.34
      1.66
      Male
      No
      Sun
      Dinner
      3
    
    
      2
      21.01
      3.50
      Male
      No
      Sun
      Dinner
      3
    
    
      3
      23.68
      3.31
      Male
      No
      Sun
      Dinner
      2
    
    
      4
      24.59
      3.61
      Female
      No
      Sun
      Dinner
      4



In [5]:

    
tips[tips['sex']=='Female'].head()









    Out[5]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
    
  
  
    
      0
      16.99
      1.01
      Female
      No
      Sun
      Dinner
      2
    
    
      4
      24.59
      3.61
      Female
      No
      Sun
      Dinner
      4
    
    
      11
      35.26
      5.00
      Female
      No
      Sun
      Dinner
      4
    
    
      14
      14.83
      3.02
      Female
      No
      Sun
      Dinner
      2
    
    
      16
      10.33
      1.67
      Female
      No
      Sun
      Dinner
      3

Use html color names. List



In [6]:

    
fig, ax = plt.subplots(figsize = (8, 4))

tips[tips["sex"] == "Male"].plot(x="total_bill", y="tip", ax=ax, kind="scatter",
                                 color="blue", label ='male')
tips[tips["sex"] == "Female"].plot(x="total_bill", y="tip", ax=ax, kind="scatter",
                                   color="#7CFC00", label ='female')

ax.set_xlim(0, 52)
ax.legend(loc='best')









    Out[6]:





<matplotlib.legend.Legend at 0x174cc37d7f0>



In [7]:

    
plt.style.use("classic")

fig, ax = plt.subplots(2, 1)

tips[tips["sex"] == "Male"].plot(x="total_bill", y="tip", ax=ax[0], kind="scatter",
                                 color="blue")
tips[tips["sex"] == "Female"].plot(x="total_bill", y="tip", ax=ax[1], kind="scatter",
                                   color="#F52887")

ax[0].set_xlim(0, 52)
ax[1].set_xlim(0, 52)
ax[0].set_ylim(0, 15)
ax[1].set_ylim(0, 15)

ax[0].set_title("Male Tips")
ax[1].set_title("Female Tips")

fig.tight_layout()
# fig.savefig("/home/chase/Desktop/foo.png")

Quite different picture



In [8]:

    
tips['percent'] = (tips['tip']/tips['total_bill'])*100



In [9]:

    
fig, ax = plt.subplots(2, 1)

tips[tips["sex"] == "Male"].plot(x="total_bill", y="percent", ax=ax[0], kind="scatter",
                                 color="blue")
tips[tips["sex"] == "Female"].plot(x="total_bill", y="percent", ax=ax[1], kind="scatter",
                                   color="#F52887")

ax[0].set_xlim(0, 52)
ax[1].set_xlim(0, 52)

ax[0].set_title("Male Tips")
ax[1].set_title("Female Tips")

fig.tight_layout()



In [10]:

    
# sns.set() resets default seaborn settings
sns.set()

fig, ax = plt.subplots(2)

tips[tips["sex"] == "Male"].plot(x="total_bill", y="percent", ax=ax[0], kind="scatter",
                                 color="blue")
tips[tips["sex"] == "Female"].plot(x="total_bill", y="percent", ax=ax[1], kind="scatter",
                                   color="#F52887")

ax[0].set_xlim(0, 52)
ax[1].set_xlim(0, 52)
ax[0].set_ylim(0, 45)
ax[1].set_ylim(0, 45)


ax[0].set_title("Male Tips")
ax[1].set_title("Female Tips")

fig.tight_layout()

What did you notice about the differences in the settings of the plot?

Which do you like better? We like the second better.

Investigate other styles and create the same plot as above using a style you like. You can choose from the list in the code below.

If you have additional time, visit the seaborn docs and try changing other default settings.



In [11]:

    
plt.style.available









    Out[11]:





['seaborn-poster',
 'ggplot',
 'seaborn-notebook',
 'seaborn-pastel',
 'grayscale',
 'bmh',
 'dark_background',
 'seaborn-dark-palette',
 'seaborn-paper',
 'classic',
 'seaborn-deep',
 'seaborn-dark',
 'seaborn-muted',
 'seaborn-bright',
 'fivethirtyeight',
 'seaborn-darkgrid',
 'seaborn-whitegrid',
 'seaborn-talk',
 'seaborn-white',
 'seaborn-colorblind',
 'seaborn-ticks']



In [12]:

    
sns.set_style('dark')

We could do the same for a different style (like ggplot)



In [13]:

    
plt.style.use('ggplot')

fig, ax = plt.subplots(2)

tips[tips["sex"] == "Male"].plot(x="total_bill", y="percent", ax=ax[0], kind="scatter")
tips[tips["sex"] == "Female"].plot(x="total_bill", y="percent", ax=ax[1], kind="scatter")

ax[0].set_xlim(0, 52)
ax[1].set_xlim(0, 52)
ax[0].set_ylim(0, 45)
ax[1].set_ylim(0, 45)


ax[0].set_title("Male Tips")
ax[1].set_title("Female Tips")

fig.tight_layout()

Exercise: Find a style you like and recreate the plot above using that style.



In [ ]:

The Juicy Stuff

While having seaborn set sensible defaults is convenient, it isn't a particularly large innovation. We could choose sensible defaults and set them to be our default. The main benefit of seaborn is the types of graphs that it gives you access to -- All of which could be done in matplotlib, but, instead of 5 lines of code, it would require possibly hundreds of lines of code. Trust us... This is a good thing.

We don't have time to cover everything that can be done in seaborn, but we suggest having a look at the gallery of examples.

We will cover:

kdeplot
jointplot
violinplot
pairplot
...



In [14]:

    
# Move back to seaborn defaults
sns.set()

kdeplot

What does kde stand for?

kde stands for "kernel density estimation." This is (far far far) beyond the scope of this class, but the basic idea is that this is a smoothed histogram. When we are trying to get information about distributions it sometimes looks nicer than a histogram does.



In [15]:

    
tita.head()









    Out[15]:






  
    
      
      survived
      pclass
      sex
      age
      sibsp
      parch
      fare
      embarked
      class
      who
      adult_male
      deck
      embark_town
      alive
      alone
    
  
  
    
      0
      0
      3
      male
      22.0
      1
      0
      7.2500
      S
      Third
      man
      True
      NaN
      Southampton
      no
      False
    
    
      1
      1
      1
      female
      38.0
      1
      0
      71.2833
      C
      First
      woman
      False
      C
      Cherbourg
      yes
      False
    
    
      2
      1
      3
      female
      26.0
      0
      0
      7.9250
      S
      Third
      woman
      False
      NaN
      Southampton
      yes
      True
    
    
      3
      1
      1
      female
      35.0
      1
      0
      53.1000
      S
      First
      woman
      False
      C
      Southampton
      yes
      False
    
    
      4
      0
      3
      male
      35.0
      0
      0
      8.0500
      S
      Third
      man
      True
      NaN
      Southampton
      no
      True

Try a simple histogram



In [16]:

    
# hist cannot take NaN's
tita_nonan = tita[~tita['age'].isnull()]

fig, ax = plt.subplots()
ax.hist(tita_nonan['age'], bins=25)

ax.set_title("Histogram of age")
plt.show()

Express this in terms of frequencies (~probabilities) instead of number of occurrences



In [17]:

    
fig, ax = plt.subplots()
ax.hist(tita_nonan['age'], bins=25, normed=True)
ax.set_title("Histogram of age")
plt.show()

kdeplot essentially smoothes this histogram out

If you get a TypeError for kdeplot, try
 conda uninstall statsmodels --yes
 conda install -c taugspurger statsmodels=0.8.0



In [18]:

    
fig, ax = plt.subplots()
ax.hist(tita_nonan['age'], bins=25, normed=True)
sns.kdeplot(tita_nonan['age'], ax=ax, lw = 3)
ax.set_title("Histogram of age")
plt.show()



In [23]:

    
fig, ax = plt.subplots(1, 2, figsize = (12, 5), sharey=True)
ax[0].hist(tita_nonan['age'][tita_nonan['survived']==0], bins=25, normed=True, label='hist')
sns.kdeplot(tita_nonan['age'][tita_nonan['survived']==0], ax=ax[0], lw = 3, label='smooth')
ax[1].hist(tita_nonan['age'][tita_nonan['survived']==1], bins=25, normed=True, label='hist')
sns.kdeplot(tita_nonan['age'][tita_nonan['survived']==1], ax=ax[1], lw = 3, label='smooth')
ax[0].set_title("Histogram of age (not survived)")
ax[1].set_title("Histogram of age (survived)")
plt.show()



In [20]:

    
tita_survived = tita_nonan[tita_nonan['survived']==1]

fig, ax = plt.subplots(1, 2, figsize = (12, 5), sharey=True)
ax[0].hist(tita_survived['age'][tita_survived['sex']=='male'], bins=25, normed=True)
sns.kdeplot(tita_survived['age'][tita_survived['sex']=='male'], ax=ax[0], lw = 3)
ax[1].hist(tita_survived['age'][tita_survived['sex']=='female'], bins=25, normed=True)
sns.kdeplot(tita_survived['age'][tita_survived['sex']=='female'], ax=ax[1], lw = 3)
ax[0].set_title("Histogram of age (male)")
ax[1].set_title("Histogram of age (female)")
plt.show()



In [21]:

    
fig, ax = plt.subplots()

sns.kdeplot(tips["percent"], ax=ax, label='smooth')
ax.hist(tips["percent"], bins=25, alpha=0.25, normed=True, label="tip (percent)")
ax.legend()
fig.suptitle("Kernel Density of tips with Histogram")
plt.show()



In [24]:

    
tips.head()









    Out[24]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
      percent
    
  
  
    
      0
      16.99
      1.01
      Female
      No
      Sun
      Dinner
      2
      5.944673
    
    
      1
      10.34
      1.66
      Male
      No
      Sun
      Dinner
      3
      16.054159
    
    
      2
      21.01
      3.50
      Male
      No
      Sun
      Dinner
      3
      16.658734
    
    
      3
      23.68
      3.31
      Male
      No
      Sun
      Dinner
      2
      13.978041
    
    
      4
      24.59
      3.61
      Female
      No
      Sun
      Dinner
      4
      14.680765



In [25]:

    
fig, ax = plt.subplots(1, 2, sharey=True, figsize = (12, 4))

sns.kdeplot(tips["percent"][tips['sex']=='Male'], ax=ax[0], label='smooth')
ax[0].hist(tips["percent"][tips['sex']=='Male'], bins=25, alpha=0.25, normed=True, label="tip (percent)")
ax[0].set_title('Male waiter')
ax[0].legend()
sns.kdeplot(tips["percent"][tips['sex']=='Female'], ax=ax[1], label='smooth')
ax[1].hist(tips["percent"][tips['sex']=='Female'], bins=25, alpha=0.25, normed=True, label="tip (percent)")
ax[1].set_title('Female waiter')
ax[1].legend()

fig.suptitle("Kernel Density of tips with Histogram", y=1.03)
plt.show()

Exercise: Create your own kernel density plot using sns.kdeplot of "total_bill" from the tips dataframe



In [ ]:

Jointplot

We now show what jointplot does. It draws a scatter plot of two variables and puts their histogram just outside of the scatter plot. This tells you information about not only the joint distribution, but also the marginals.



In [24]:

    
sns.jointplot(x="total_bill", y="percent", data=tips)









    Out[24]:





<seaborn.axisgrid.JointGrid at 0x7ff8d6bb6d68>

We can also plot everything as a kernel density estimate -- Notice the main plot is now a contour map.



In [25]:

    
sns.jointplot(x="total_bill", y="percent", data=tips, kind="kde")









    Out[25]:





<seaborn.axisgrid.JointGrid at 0x7ff8d6b31668>



In [26]:

    
sns.set(style="darkgrid", color_codes=True)

g = sns.jointplot("total_bill", "tip", data=tips, kind="reg",
                  xlim=(0, 60), ylim=(0, 12), color="r", size=6)

Exercise: Create your own jointplot. Feel free to choose your own x and y data (if you can't decide then use x=size and y=tip). Interpret the output of the plot.

violinplot

Some of the story of this notebook is that distributions matter and how we can show them. Violin plots are similar to a sideways kernel density and it allows us to look at how distributions matter over some aspect of the data.



In [26]:

    
tips.head()









    Out[26]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
      percent
    
  
  
    
      0
      16.99
      1.01
      Female
      No
      Sun
      Dinner
      2
      5.944673
    
    
      1
      10.34
      1.66
      Male
      No
      Sun
      Dinner
      3
      16.054159
    
    
      2
      21.01
      3.50
      Male
      No
      Sun
      Dinner
      3
      16.658734
    
    
      3
      23.68
      3.31
      Male
      No
      Sun
      Dinner
      2
      13.978041
    
    
      4
      24.59
      3.61
      Female
      No
      Sun
      Dinner
      4
      14.680765

If violinplot tends to kill your kernel, try to upload your numpy and scipy packages
   conda update numpy 



In [27]:

    
sns.set(style="whitegrid", palette="pastel", color_codes=True)



In [28]:

    
sns.violinplot(x="day", y="total_bill", data=tips)









    Out[28]:





<matplotlib.axes._subplots.AxesSubplot at 0x174ccc9e5c0>



In [29]:

    
sns.violinplot(x="day", y="total_bill", hue='sex', data=tips)









    Out[29]:





<matplotlib.axes._subplots.AxesSubplot at 0x174cd0f6710>



In [30]:

    
sns.violinplot(x="day", y="total_bill", hue='sex', data=tips, split=True, inner='quart')









    Out[30]:





<matplotlib.axes._subplots.AxesSubplot at 0x174ccee5470>



In [31]:

    
# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="day", y="total_bill", hue="sex", data=tips, split=True,
               inner="quart", palette={"Male": "b", "Female": "y"})









    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0x174ccc3f828>



In [32]:

    
sns.violinplot(x="class", y="age", hue='sex', split=True, data=tita_survived)









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x174ccc56438>

Pairplot

Pair plots show us two things. They show us the histograms of the variables along the diagonal and then the scatter plot of each pair of variables on the off diagonal pictures.

Why might this be useful? It allows us to look get an idea of the correlations across each pair of variables and gives us an idea of their relationships across the variables.



In [33]:

    
sns.set()



In [34]:

    
sns.pairplot(tips[["percent", "total_bill", "size"]], size=3.5)









    Out[34]:





<seaborn.axisgrid.PairGrid at 0x174ccd10128>

Below is the same plot, but slightly different. What is different?



In [35]:

    
sns.pairplot(tips[["tip", "total_bill", "size"]], size=3.5, diag_kind="kde")









    Out[35]:





<seaborn.axisgrid.PairGrid at 0x174cd891ac8>

What's different about this plot?

Different colors for each company.



In [36]:

    
tips.head()









    Out[36]:






  
    
      
      total_bill
      tip
      sex
      smoker
      day
      time
      size
      percent
    
  
  
    
      0
      16.99
      1.01
      Female
      No
      Sun
      Dinner
      2
      5.944673
    
    
      1
      10.34
      1.66
      Male
      No
      Sun
      Dinner
      3
      16.054159
    
    
      2
      21.01
      3.50
      Male
      No
      Sun
      Dinner
      3
      16.658734
    
    
      3
      23.68
      3.31
      Male
      No
      Sun
      Dinner
      2
      13.978041
    
    
      4
      24.59
      3.61
      Female
      No
      Sun
      Dinner
      4
      14.680765



In [37]:

    
sns.pairplot(tips[["tip", "total_bill", "size", "time"]], size=3.5, diag_kind="kde", hue="time")









    Out[37]:





<seaborn.axisgrid.PairGrid at 0x174ce1de7b8>

swarmplot

Sometimes we simply have too much data. One approach to visualizing all the data is to adjust features like the point size or transparency.

An alternative is to use a swarm plot. This is best understood by example, so let's dive in!



In [55]:

    
fig, ax = plt.subplots()
sns.swarmplot(data=tips, x="day", y="total_bill")









    Out[55]:





<matplotlib.axes._subplots.AxesSubplot at 0x7ff8bfe02160>

lmplot

We often want to think about running regressions of variables. A statistician named Francis Anscombe came up with four datasets that:

Same mean for $x$ and $y$
Same variance for $x$ and $y$
Same correlation between $x$ and $y$
Same regression coefficient of $x$ on $y$

Below we show the scatter plot of the datasets to give you an idea of how different they are.



In [56]:

    
fig, ax = plt.subplots(2, 2, figsize=(10, 9))

ansc[ansc["dataset"] == "I"].plot.scatter(x="x", y="y", ax=ax[0, 0])
ansc[ansc["dataset"] == "II"].plot.scatter(x="x", y="y", ax=ax[0, 1])
ansc[ansc["dataset"] == "III"].plot.scatter(x="x", y="y", ax=ax[1, 0])
ansc[ansc["dataset"] == "IV"].plot.scatter(x="x", y="y", ax=ax[1, 1])

ax[0, 0].set_title("Dataset I")
ax[0, 1].set_title("Dataset II")
ax[1, 0].set_title("Dataset III")
ax[1, 1].set_title("Dataset IV")

fig.suptitle("Anscombe's Quartet")









    Out[56]:





<matplotlib.text.Text at 0x7ff8bea5ac50>

lmplot plots the data with the regression coefficient through it.



In [57]:

    
sns.lmplot(x="x", y="y", data=ansc, col="dataset", hue="dataset",
           col_wrap=2, ci=None)









    Out[57]:





<seaborn.axisgrid.FacetGrid at 0x7ff8c8236f60>

regplot

regplot also shows the regression line through data points



In [67]:

    
fig, ax = plt.subplots(2, 2, figsize = (10, 10))
sns.regplot(x="x", y="y", data=ansc[ansc["dataset"] == "I"], ax=ax[0,0])
sns.regplot(x="x", y="y", data=ansc[ansc["dataset"] == "II"], ax=ax[0,1])
sns.regplot(x="x", y="y", data=ansc[ansc["dataset"] == "III"], ax=ax[1,0])
sns.regplot(x="x", y="y", data=ansc[ansc["dataset"] == "IV"], ax=ax[1,1])









    Out[67]:





<matplotlib.axes._subplots.AxesSubplot at 0x7ff8bdb48d68>



In [ ]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True