Talk slides: http://bit.ly/pycon2014_dataviz
Talk git repo: http://github.com/olgabot/pycon2014_dataviz
In [1]:
from IPython.display import HTML
HTML('<iframe src=https:/olgabot.github.io/prettyplotlib width=1024 height=350></iframe>')
Out[1]:
This is where the seaborn
plotting library comes in.
Written primarily by Michael Waskom, PhD student in Cognitive Neuroscience at Stanford
In [2]:
from IPython.display import HTML
HTML('<iframe src=http://stanford.edu/~mwaskom/software/seaborn/examples/index.html width=1024 height=350></iframe>')
Out[2]:
So let's make some data to plot
In [3]:
%matplotlib inline
import numpy as np
import pandas as pd
data = pd.DataFrame(np.vstack([np.random.normal(loc=1, scale=10, size=100),
np.random.negative_binomial(20, 0.7, size=100)]).T,
columns=('apple', 'banana'))
data.head()
Out[3]:
In [21]:
data.plot()
Out[21]:
In [5]:
import seaborn as sns
data.plot()
Out[5]:
Being a fan of Edward Tufte, I prefer less "chartjunk" and a white background over the grey-styled backgrounds that are like R's ggplot2 (by Hadley Wickham)
So we'll set the style away from the default. We'll set the style
to 'white'
so there's a white background, and the context
to 'talk'
so everyone here can see the axes labels.
In [6]:
sns.set(style='white', context='talk')
data.plot()
Out[6]:
In [7]:
import matplotlib.pyplot as plt
# the two ways you can access a column in a pandas Dataframe:
# 1. data['columnname']
# 2. data.columnname
plt.scatter(data['apple'], data.banana)
Out[7]:
Well, what if we want to look at the histogram of the two data points in addition to the scatterplot?
Like this:
In [8]:
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,6))
ax1 = plt.subplot2grid((4,4), (1,0), colspan=3, rowspan=3)
ax2 = plt.subplot2grid((4,4), (0,0), colspan=3)
ax3 = plt.subplot2grid((4,4), (1, 3), rowspan=3)
plt.tight_layout()
ax1.scatter(data.apple, data.banana)
ax2.hist(data.apple)
# Turn the histogram upside-down by switching the axis limits
ax2_limits = ax2.axis()
ax3.hist(data.banana, orientation='horizontal')
Out[8]:
Could add some more code to find and put the text of it somewhere...
pearsonr = np.correlate(data.apple, data.banana)[0]
# Get axis limits of the scatterplot
xmin, xmax, ymin, ymax = ax1.axis()
dx = xmax - xmin
dy = ymax - ymin
ax1.text(x=xmin + .9*dx, y=ymin + .9*dy,
s='pearsonr = {:.3f}'.format(pearsonr))
In [9]:
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,6))
ax1 = plt.subplot2grid((4,4), (1,0), colspan=3, rowspan=3)
ax2 = plt.subplot2grid((4,4), (0,0), colspan=3)
ax3 = plt.subplot2grid((4,4), (1, 3), rowspan=3)
plt.tight_layout()
ax1.scatter(data.apple, data.banana)
ax2.hist(data.apple)
ax3.hist(data.banana, orientation='horizontal')
pearsonr = np.correlate(data.apple, data.banana)[0]
# Get axis limits of the scatterplot
xmin, xmax, ymin, ymax = ax1.axis()
dx = xmax - xmin
dy = ymax - ymin
ax1.text(x=xmin + .9*dx, y=ymin + .9*dy,
s='pearsonr = {:.3f}'.format(pearsonr))
Out[9]:
Thus, enter seaborn
.
Seaborn's jointplot
does exactly this, sanely, in a single line:
In [10]:
sns.jointplot('apple', 'banana', data)
Out[10]:
In [11]:
sns.jointplot('apple', 'banana', data, kind='reg')
Out[11]:
In [12]:
sns.jointplot('apple', 'banana', data, kind='kde')
Out[12]:
In [13]:
sns.jointplot('apple', 'banana', data, kind='hex')
Out[13]:
Quick tip: Some seaborn
functions do all kinds of crazy things in the background but you can't pass a fig
object to them, so the way that you can get the figure to save is via matplotlib
's plt.gcf()
("get current figure"):
In [14]:
sns.jointplot('apple', 'banana', data, kind='hex')
fig = plt.gcf()
fig.savefig('jointplot_hex.pdf')
In [15]:
plt.boxplot(data.values);
Bleh. Very ugly. Plus nothing is labeled!
In [16]:
sns.boxplot(data)
Out[16]:
Very nice! But something I use a lot in my own research is violinplot
.
In [17]:
sns.violinplot(data)
# remove the top and right axes
sns.despine()
As a side note, this is the way you'd do this in pure matplotlib
:
(which is how I used to do it before I discovered seaborn
)
In [18]:
from scipy.stats import gaussian_kde
def violinplot(ax, x, ys, bp=False, cut=False, bw_method=.5, width=None):
"""Make a violin plot of each dataset in the `ys` sequence. `ys` is a
list of numpy arrays.
Adapted by: Olga Botvinnik
# Original Author: Teemu Ikonen <tpikonen@gmail.com>
# Based on code by Flavio Codeco Coelho,
# http://pyinsci.blogspot.com/2009/09/violin-plot-with-matplotlib.html
"""
dist = np.max(x) - np.min(x)
if width is None:
width = min(0.15 * max(dist, 1.0), 0.4)
for i, (d, p) in enumerate(zip(ys, x)):
k = gaussian_kde(d, bw_method=bw_method) #calculates the kernel density
# k.covariance_factor = 0.1
s = 0.0
if not cut:
s = 1 * np.std(d) #FIXME: magic constant 1
m = k.dataset.min() - s #lower bound of violin
M = k.dataset.max() + s #upper bound of violin
x = np.linspace(m, M, 100) # support for violin
v = k.evaluate(x) #violin profile (density curve)
v = width * v / v.max() #scaling the violin to the available space
ax.fill_betweenx(x, -v + p,
v + p)
if bp:
ax.boxplot(ys, notch=1, positions=x, vert=1)
fig, ax = plt.subplots()
violinplot(ax, range(data.shape[1]), data.values)
But this is not even close to what seaborn
does and it's already more code than I want to write.
In [19]:
sns.violinplot(data, inner='points')
Out[19]:
We have a lot of data points so it almost looks like a line. But you can see the outliers quite nicely.
Talk slides: http://bit.ly/pycon2014_dataviz
Talk git repo: http://github.com/olgabot/pycon2014_dataviz