Visualizing distributions of data

This notebook demonstrates different approaches to graphically representing distributions of data, specifically focusing on the tools provided by the seaborn package.


In [1]:
%matplotlib inline

In [2]:
import numpy as np
from numpy.random import randn
import pandas as pd
from scipy import stats
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
sns.set_palette("deep", desat=.6)
sns.set_context(rc={"figure.figsize": (8, 4)})
np.random.seed(9221999)

Basic visualization with histograms

The most basic and common way of representing a distributions is with a histogram. We can do this directly through the hist function that is part of matplotlib.


In [4]:
data = randn(75)
plt.hist(data);


By default, hist separates the data into 10 bins of equal widths and plots the number of observations in each bin. Thus, the main parameter is the number of bins, which we can change.

The more bins you have, the more sensitive you will be to high-frequency patterns in the distribution. But, sometimes those high-frequency patterns will be noise. Often you want to try different values until you think you have best captured what you see in the data.


In [5]:
plt.hist(data, 6, color=sns.desaturate("indianred", .75));


The normed argument can also be useful if you want to compare two distributions that do not have the same number of observations. Note also that bins can be a sequence of where each bin starts.


In [6]:
data1 = stats.poisson(2).rvs(100)
data2 = stats.poisson(5).rvs(120)
max_data = np.r_[data1, data2].max()
bins = np.linspace(0, max_data, max_data + 1)
plt.hist(data1, bins, normed=True, color="#6495ED", alpha=.5)
plt.hist(data2, bins, normed=True, color="#F08080", alpha=.5);


The hist function has quite a few other options, which you can explore in its docstring. Here we'll just highlight one more that can be useful when plotting many observations (such as following a resampling procedure).


In [7]:
x = stats.gamma(3).rvs(5000)
plt.hist(x, 70, histtype="stepfilled", alpha=.7);


You can also represent a joint distribution with the histogram method, using a hexbin plot. This is similar to a histogram, except instead of coding the number of observations in each bin with a position on one of the axes, it uses a color-mapping to give the plot three quantitative dimensions.

In seaborn, you can draw a hexbin plot using the jointplot function and setting kind to "hex". This will also plot the marginal distribution of each variable on the sides of the plot using a histrogram:


In [8]:
y = stats.gamma(5).rvs(5000)
with sns.axes_style("white"):
    sns.jointplot(x, y, kind="hex");


Estimating the density of the observations: kdeplot and rugplot

A superior, if more computationally intensive, approach to estimating a distribution is known as a kernel density estimate, or KDE. To motivate the KDE, let's first think about rug plots. A rug plot is a very simple, but also perfectly legitimate, way of representing a distribution. To create one, simply draw a vertical line at each observed data point. Here, the height is totally arbitrary.


In [9]:
sns.set_palette("hls", 1)
data = randn(30)
sns.rugplot(data)
plt.ylim(0, 1);


You can see where the density of the distribution is by how dense the tick-marks are. Before talking about kernel density plots, let's connect the rug plot to the histogram. The connection here is very direct: a histogram just creates bins along the range of the data and then draws a bar with height equal to the number of ticks in each bin


In [10]:
plt.hist(data, alpha=.3)
sns.rugplot(data);


A kernel density plot is also a transformation from the tick marks to a height-encoded measure of density. However, the transformaiton is a bit more complicated. Instead of binning each tick mark, we will instead represent each tick with a gaussian basis function.


In [11]:
# Draw the rug and set up the x-axis space
sns.rugplot(data);
xx = np.linspace(-4, 4, 100)

# Compute the bandwidth of the kernel using a rule-of-thumb
bandwidth = ((4 * data.std() ** 5) / (3 * len(data))) ** .2
bandwidth = len(data) ** (-1. / 5)

# We'll save the basis functions for the next step
kernels = []

# Plot each basis function
for d in data:
    
    # Make the basis function as a gaussian PDF
    kernel = stats.norm(d, bandwidth).pdf(xx)
    kernels.append(kernel)
    
    # Scale for plotting
    kernel /= kernel.max()
    kernel *= .4
    plt.plot(xx, kernel, "#888888", alpha=.5)
plt.ylim(0, 1);


We then estimate the distribution that our samples came from from by summing these basis functions (and normalizing so, as a proper density, the function integrates to 1).

There is also a function in the scipy.stats module that will perform a kernel density estimate (it actually returns an object that can be called on some values to return the density). We see that plotting the values from this object give us basically the same results as summing the gaussian basis functions.


In [12]:
# Set up the plots
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
c1, c2 = sns.color_palette("husl", 3)[:2]

# Plot the summed basis functions
summed_kde = np.sum(kernels, axis=0)
ax1.plot(xx, summed_kde, c=c1)
sns.rugplot(data, c=c1, ax=ax1)
ax1.set_yticks([])
ax1.set_title("summed basis functions")

# Use scipy to get the density estimate
scipy_kde = stats.gaussian_kde(data)(xx)
ax2.plot(xx, scipy_kde, c=c2)
sns.rugplot(data, c=c2, ax=ax2)
ax2.set_yticks([])
ax2.set_title("scipy gaussian_kde")
f.tight_layout()


The seaborn package has a high-level function for plotting a kernel density estimate in one quick step, along with some additional nice features, such as shading in the density.


In [13]:
sns.kdeplot(data, shade=True);


Much like how the bins parameter controls the fit of the histogram to the data, you can adjust the bandwidth (bw) of the kernel to make the densisty estimate more or less sensitive to high-frequency structure.


In [14]:
pal = sns.blend_palette([sns.desaturate("royalblue", 0), "royalblue"], 5)
bws = [.1, .25, .5, 1, 2]

for bw, c in zip(bws, pal):
    sns.kdeplot(data, bw=bw, color=c, lw=1.8, label=bw)

plt.legend(title="kernel bandwidth")
sns.rugplot(data, color="#333333");


Although the gaussian kernel is the most common, and probably the most useful, there are a variety of kernels you can use to fit the density estimate. The kernel you choose generally has less influence on the resulting estimate thean the bandwidth size, but there may be situations where you want to experiment with different choices.


In [15]:
kernels = ["biw", "cos", "epa", "gau", "tri", "triw"]
pal = sns.color_palette("hls", len(kernels))
for k, c in zip(kernels, pal):
    sns.kdeplot(data, kernel=k, color=c, label=k)
plt.legend();


The cut and clip parameters allows you to control how far outside the range of the data the estimate extends: cut influences only the range of the support, while clip affects the fitting of the KDE as well.


In [16]:
with sns.color_palette("Set2"):
    f, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 8), sharex=True)
    for cut in [4, 3, 2]:
        sns.kdeplot(data, cut=cut, label=cut, lw=cut * 1.5, ax=ax1)

    for clip in [1, 2, 3]:
        sns.kdeplot(data, clip=(-clip, clip), label=clip, ax=ax2);


As in the case of the histogram, plotting shaded density plots on top of each other can be a good way to ask whether two samples are from the same distribution. This also implements a simple classification operation. For a given x value with an unknown label, you should conclude it was drawn from the distribution with a greater density at that value.

A legend can be helpful when overlaying several densities. This is done either by providing a label keyword to kdeplot, which explicitly assigns a value to the data, or by inferring the name if you pass an object (like a Pandas Series) with a name attribute.


In [17]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(8, 6))
c1, c2, c3 = sns.color_palette("Set1", 3)

dist1, dist2, dist3 = stats.norm(0, 1).rvs((3, 100))
dist3 = pd.Series(dist3 + 2, name="dist3")

sns.kdeplot(dist1, shade=True, color=c1, ax=ax1)
sns.kdeplot(dist2, shade=True, color=c2, label="dist2", ax=ax1)

sns.kdeplot(dist1, shade=True, color=c2, ax=ax2)
sns.kdeplot(dist3, shade=True, color=c3, ax=ax2);


If you want to plot the density along the y axis, use the vertical keyword.


In [18]:
plt.figure(figsize=(4, 7))
data = stats.norm(0, 1).rvs((3, 100)) + np.arange(3)[:, None]

with sns.color_palette("Set2"):
    for d, label in zip(data, list("ABC")):
        sns.kdeplot(d, vertical=True, shade=True, label=label)


You can also use kdeplot to estimate the cumulative distribution function (CDF) of the population from your data. This plot will tell you what proportion of the distribution falls at smaller values than a given point on the x axis:


In [19]:
with sns.color_palette("Set1"):
    for d, label in zip(data, list("ABC")):
        sns.kdeplot(d, cumulative=True, label=label)


Multivariate density estimation with kdeplot

You can also use the kernel density method with multidimensional data. For visualization, we are mostly concerned with plotting joint bivariate distributions. As with the hexbin plot, we will color-encode the density estimate over a 2D space. The kdeplot function tries to infer whether it should draw a univariate or bivariate plot based on the type and shape of the data argument. If using a 2d array or a DataFrame, the array is assumed to be shaped (n_units, n_variables).


In [20]:
data = np.random.multivariate_normal([0, 0], [[1, 2], [2, 20]], size=1000)
data = pd.DataFrame(data, columns=["X", "Y"])
mpl.rc("figure", figsize=(6, 6))

In [21]:
sns.kdeplot(data);


You can also pass in two vectors as the first positional arguments, which will also draw a bivariate density.


In [22]:
sns.kdeplot(data.X, data.Y, shade=True);


As in the univariate case, you can also set the kernel bandwidth and choose different points at which to clip the data or cut the estimate. However, only the gaussian kernel is availible for the multidimensional kde.

When specifying the colormap, the multivariate kdeplot accepts a special token where colormaps that end with _d are plotted in a way that maintains the overall color palette but that uses darker colors at one extreme so the contour lines are fully visible.


In [23]:
sns.kdeplot(data, bw="silverman", gridsize=50, cut=2, clip=(-11, 11), cmap="BuGn_d");



In [24]:
sns.kdeplot(data, bw=1, clip=[(-4, 4), (-11, 11)], cmap="PuRd_d");



In [25]:
sns.kdeplot(data.values, shade=True, bw=(.5, 1), cmap="Purples");


Bivariate and univariate plots using jointplot:

The jointplot function allows you to simultaneously visualize the joint distribution of two variables and the marginal distribution of each. Above, we showed how to use it to draw a hexbin plot. You can also use it to draw a kernel density estimate:


In [26]:
with sns.axes_style("white"):
    sns.jointplot("X", "Y", data, kind="kde");


Combining plot styles: distplot

Each of these styles has advantages and disadvantages. Fortunately, it is easy to combine multiple styles using the distplot function in seaborn. distplot provides one interface for plotting histograms, kernel density plots, rug plots, and plotting fitted probability distributions.

By default, you'll get a kernel density over a histogram. Unlike the default matplotlib hist function, distplot tries to use a good number of bins for the dataset you have, although all of the options for specifying bins in hist can be used.


In [27]:
sns.set_palette("hls")
mpl.rc("figure", figsize=(8, 4))
data = randn(200)
sns.distplot(data);


hist, kde, and rug are boolean arguments to turn those features on and off.


In [28]:
sns.distplot(data, rug=True, hist=False);


You can also pass a distribution family from scipy.stats, and distplot will fit the parameters using maximum likelihood and plot the resulting function.


In [29]:
sns.distplot(data, kde=False, fit=stats.norm);


To control any of the underlying plots, pass keyword arguments to the [plot]_kws argument.


In [30]:
sns.distplot(data,
             kde_kws={"color": "seagreen", "lw": 3, "label": "KDE"},
             hist_kws={"histtype": "stepfilled", "color": "slategray"});


You can also draw the distribution vertically, if for example you wanted to plot marginal distributions on a scatterplot (as in the jointplot function):


In [31]:
plt.figure(figsize=(4, 7))
sns.distplot(data, color="dodgerblue", vertical=True);


If the data has a name attribute (e.g. it is a pandas Series), the name will become the label for the dimension on which the distributio is plotted, unless you use axlabel=False. You can also provide a string, which will override this behavior and label nameless data.


In [32]:
sns.distplot(pd.Series(data, name="score"), color="mediumpurple");