In [1]:
import numpy as np
import matplotlib.pyplot as plt
In [2]:
%matplotlib notebook
Starting with 2MASS and SDSS and extending through LSST, we are firmly in an era where data and large statistical samples are cheap. With this explosion in data volume comes a problem: we do not know the underlying probability density function (PDF) of the random variables measured via our observations. Hence - density estimation: an attempt to recover the unknown PDF from observations. In some cases theory can guide us to a parametric form for the PDF, but more often than not such guidance is not available.
There is a common, simple, and very familiar tool for density estimation: histograms.
But there is also a problem:
HISTOGRAMS LIE!
We will "prove" this to be the case in a series of examples. For this exercise, we will load the famous Linnerud data set, which tested 20 middle aged men by measuring the number of chinups, situps, and jumps they could do in order to compare these numbers to their weight, pulse, and waist size. To load the data (just chinups for now) we will run the following:
from sklearn.datasets import load_linnerud
linnerud = load_linnerud()
chinups = linnerud.data[:,0]
In [10]:
from sklearn.datasets import load_linnerud
linnerud = load_linnerud()
chinups = linnerud.data[:,0]
Problem 1a
Plot the histogram for the number of chinups using the default settings in pyplot.
In [12]:
fig, ax = plt.subplots()
ax.hist(chinups, histtype = "step", lw = 3)
ax.set_xlabel('chinups', fontsize=14)
ax.set_ylabel('N', fontsize=14)
fig.tight_layout()
Already with this simple plot we see a problem - the choice of bin centers and number of bins suggest that there is a 0% probability that middle aged men can do 10 chinups. Intuitively this seems incorrect, so lets examine how the histogram changes if we change the number of bins or the bin centers.
Problem 1b
Using the same data make 2 new histograms: (i) one with 5 bins (bins = 5
), and (ii) one with the bars centered on the left bin edges (align = "left"
).
Hint - if overplotting the results, you may find it helpful to use the histtype = "step"
option
In [5]:
fig, ax = plt.subplots()
ax.hist(chinups, bins = 5, histtype="step", lw = 3)
ax.hist(chinups, align = "left", histtype="step", lw = 3)
ax.set_xlabel('chinups', fontsize=14)
ax.set_ylabel('N', fontsize=14)
fig.tight_layout()
These small changes significantly change the output PDF. With fewer bins we get something closer to a continuous distribution, while shifting the bin centers reduces the probability to zero at 9 chinups.
What if we instead allow the bin width to vary and require the same number of points in each bin? You can determine the bin edges for bins with 5 sources using the following command:
bins = np.append(np.sort(chinups)[::5],
np.max(chinups))
Problem 1c
Plot a histogram with variable width bins, each with the same number of points.
Hint - setting normed = True
will normalize the bin heights so that the PDF integrates to 1.
In [13]:
bins = np.append(np.sort(chinups)[::5], np.max(chinups))
fig, ax = plt.subplots()
ax.hist(chinups, bins = bins, histtype = "step",
density = True, lw = 3)
ax.set_xlabel('chinups', fontsize=14)
ax.set_ylabel('N', fontsize=14)
fig.tight_layout()
Ending the lie
Earlier I stated that histograms lie. One simple way to combat this lie: show all the data. Displaying the original data points allows viewers to somewhat intuit the effects of the particular bin choices that have been made (though this can also be cumbersome for very large data sets, which these days is essentially all data sets). The standard for showing individual observations relative to a histogram is a "rug plot," which shows a vertical tick (or other symbol) at the location of each source used to estimate the PDF.
Problem 1d Execute the cell below to see an example of a rug plot.
In [14]:
fig, ax = plt.subplots()
ax.hist(chinups, histtype = 'step')
# this is the code for the rug plot
ax.plot(chinups, np.zeros_like(chinups), '|', color='k', ms = 25, mew = 4)
ax.set_xlabel('chinups', fontsize=14)
ax.set_ylabel('N', fontsize=14)
fig.tight_layout()
Of course, even rug plots are not a perfect solution. Many of the chinup measurements are repeated, and those instances cannot be easily isolated above. One (slightly) better solution is to vary the transparency of the rug "whiskers" using alpha = 0.3
in the whiskers plot call. But this too is far from perfect.
To recap, histograms are not ideal for density estimation for the following reasons:
Histograms are useful for generating a quick representation of univariate data, but for the reasons listed above they should never be used for analysis. Most especially, functions should not be fit to histograms given how greatly the number of bins and bin centering affects the output histogram.
Okay - so if we are going to rail on histograms this much, there must be a better option. There is: Kernel Density Estimation (KDE), a nonparametric form of density estimation whereby a normalized kernel function is convolved with the discrete data to obtain a continuous estimate of the underlying PDF. As a rule, the kernel must integrate to 1 over the interval $-\infty$ to $\infty$ and be symmetric. There are many possible kernels (gaussian is highly popular, though Epanechnikov, an inverted parabola, produces the minimal mean square error).
KDE is not completely free of the problems we illustrated for histograms above (in particular, both a kernel and the width of the kernel need to be selected), but it does manage to correct a number of the ills. We will now demonstrate this via a few examples using the scikit-learn
implementation of KDE: KernelDensity
, which is part of the sklearn.neighbors
module.
Note There are many implementations of KDE in Python, and Jake VanderPlas has put together an excellent description of the strengths and weaknesses of each. We will use the scitkit-learn
version as it is in many cases the fastest implementation.
To demonstrate the basic idea behind KDE, we will begin by representing each point in the dataset as a block (i.e. we will adopt the tophat kernel). Borrowing some code from Jake, we can estimate the KDE using the following code:
from sklearn.neighbors import KernelDensity
def kde_sklearn(data, grid, bandwidth = 1.0, **kwargs):
kde_skl = KernelDensity(bandwidth = bandwidth, **kwargs)
kde_skl.fit(data[:, np.newaxis])
log_pdf = kde_skl.score_samples(grid[:, np.newaxis]) # sklearn returns log(density)
return np.exp(log_pdf)
The two main options to set are the bandwidth and the kernel.
In [16]:
# execute this cell
from sklearn.neighbors import KernelDensity
def kde_sklearn(data, grid, bandwidth = 1.0, **kwargs):
kde_skl = KernelDensity(bandwidth = bandwidth, **kwargs)
kde_skl.fit(data[:, np.newaxis])
log_pdf = kde_skl.score_samples(grid[:, np.newaxis]) # sklearn returns log(density)
return np.exp(log_pdf)
Problem 1e
Plot the KDE of the PDF for the number of chinups middle aged men can do using a bandwidth of 0.1 and a tophat kernel.
Hint - as a general rule, the grid should be smaller than the bandwidth when plotting the PDF.
In [17]:
grid = np.arange(0 + 1e-4,20,0.01)
PDFtophat = kde_sklearn(chinups, grid,
bandwidth = 0.1, kernel = 'tophat')
fig, ax = plt.subplots()
ax.plot(grid, PDFtophat)
ax.set_xlabel('chinups', fontsize=14)
ax.set_ylabel('PDF', fontsize=14)
fig.tight_layout()
In this representation, each "block" has a height of 0.25. The bandwidth is too narrow to provide any overlap between the blocks. This choice of kernel and bandwidth produces an estimate that is essentially a histogram with a large number of bins. It gives no sense of continuity for the distribution. Now, we examine the difference (relative to histograms) upon changing the the width (i.e. kernel) of the blocks.
Problem 1f
Plot the KDE of the PDF for the number of chinups middle aged men can do using bandwidths of 1 and 5 and a tophat kernel. How do the results differ from the histogram plots above?
In [18]:
PDFtophat1 = kde_sklearn(chinups, grid,
bandwidth = 1, kernel = 'tophat')
PDFtophat5 = kde_sklearn(chinups, grid,
bandwidth = 5, kernel = 'tophat')
fig, ax = plt.subplots()
ax.plot(grid, PDFtophat1, 'MediumAquaMarine', lw = 3, label = "bw = 1")
ax.plot(grid, PDFtophat5, 'Tomato', lw = 3, label = "bw = 5")
ax.set_xlabel('chinups', fontsize=14)
ax.set_ylabel('PDF', fontsize=14)
fig.tight_layout()
ax.legend()
Out[18]:
It turns out blocks are not an ideal representation for continuous data (see discussion on histograms above). Now we will explore the resulting PDF from other kernels.
Problem 1g Plot the KDE of the PDF for the number of chinups middle aged men can do using a gaussian and Epanechnikov kernel. How do the results differ from the histogram plots above?
Hint - you will need to select the bandwidth. The examples above should provide insight into the useful range for bandwidth selection. You may need to adjust the values to get an answer you "like."
In [19]:
PDFgaussian = kde_sklearn(chinups, grid,
bandwidth = 1, kernel = 'gaussian')
PDFepanechnikov = kde_sklearn(chinups, grid,
bandwidth = 2, kernel = 'epanechnikov')
fig, ax = plt.subplots()
ax.plot(grid, PDFgaussian, 'DarkOrange',
lw = 3, label = "gaussian")
ax.plot(grid, PDFepanechnikov, 'SlateGrey',
lw = 3, label = "epanechnikov")
ax.legend(loc = 2)
ax.set_xlabel('chinups', fontsize=14)
ax.set_ylabel('PDF', fontsize=14)
fig.tight_layout()
So, what is the optimal choice of bandwidth and kernel? Unfortunately, there is no hard and fast rule, as every problem will likely have a different optimization. Typically, the choice of bandwidth is far more important than the choice of kernel. In the case where the PDF is likely to be gaussian (or close to gaussian), then Silverman's rule of thumb can be used:
$$h = 1.059 \sigma n^{-1/5}$$where $h$ is the bandwidth, $\sigma$ is the standard deviation of the samples, and $n$ is the total number of samples. Note - in situations with bimodal or more complicated distributions, this rule of thumb can lead to woefully inaccurate PDF estimates. The most general way to estimate the choice of bandwidth is via cross validation (we will cover cross-validation later today).
What about multidimensional PDFs? It is possible using many of the Python implementations of KDE to estimate multidimensional PDFs, though it is very very important to beware the curse of dimensionality in these circumstances.
Now a more open ended topic: data exploration. In brief, data exploration encompases a large suite of tools (including those discussed above) to examine data that live in large dimensional spaces. There is no single best method or optimal direction for data exploration. Instead, today we will introduce some of the tools available via python.
As an example we will start with a basic line plot - and examine tools beyond matplotlib
.
In [16]:
x = np.arange(0, 6*np.pi, 0.1)
y = np.cos(x)
fig, ax=plt.subplots()
ax.plot(x,y, lw = 2)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_xlim(0, 6*np.pi)
fig.tight_layout()
Seaborn
is a plotting package that enables many useful features for exploration. In fact, a lot of the functionality that we developed above can readily be handled with seaborn
.
To begin, we will make the same plot that we created in matplotlib.
In [17]:
import seaborn as sns
fig, ax = plt.subplots()
ax.plot(x,y, lw = 2)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_xlim(0, 6*np.pi)
fig.tight_layout()
These plots look identical, but it is possible to change the style with seaborn
.
seaborn
has 5 style presets: darkgrid
, whitegrid
, dark
, white
, and ticks
. You can change the preset using the following:
sns.set_style("whitegrid")
which will change the output for all subsequent plots. Note - if you want to change the style for only a single plot, that can be accomplished with the following:
with sns.axes_style("dark"):
with all ploting commands inside the with
statement.
Problem 3a
Re-plot the sine curve using each seaborn
preset to see which you like best - then adopt this for the remainder of the notebook.
In [19]:
sns.set_style("darkgrid")
fig, ax = plt.subplots()
ax.plot(x,y, lw = 2)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_xlim(0, 6*np.pi)
fig.tight_layout()
The folks behind seaborn
have thought a lot about color palettes, which is a good thing. Remember - the choice of color for plots is one of the most essential aspects of visualization. A poor choice of colors can easily mask interesting patterns or suggest structure that is not real. To learn more about what is available, see the seaborn
color tutorial.
Here we load the default:
In [20]:
# default color palette
current_palette = sns.color_palette()
sns.palplot(current_palette)
which we will now change to colorblind
, which is clearer to those that are colorblind.
In [21]:
# set palette to colorblind
sns.set_palette("colorblind")
current_palette = sns.color_palette()
sns.palplot(current_palette)
Now that we have covered the basics of seaborn
(and the above examples truly only scratch the surface of what is possible), we will explore the power of seaborn
for higher dimension data sets. We will load the famous Iris data set, which measures 4 different features of 3 different types of Iris flowers. There are 150 different flowers in the data set.
Note - for those familiar with pandas
seaborn
is designed to integrate easily and directly with pandas DataFrame
objects. In the example below the Iris data are loaded into a DataFrame
. iPython
notebooks also display the DataFrame
data in a nice readable format.
In [22]:
iris = sns.load_dataset("iris")
iris
Out[22]:
Now that we have a sense of the data structure, it is useful to examine the distribution of features. Above, we went to great pains to produce histograms, KDEs, and rug plots. seaborn
handles all of that effortlessly with the distplot
function.
Problem 3b
Plot the distribution of petal lengths for the Iris data set.
In [24]:
# note - hist, kde, and rug all set to True, set to False to turn them off
with sns.axes_style("dark"):
sns.distplot(iris['petal_length'], bins=20,
hist=True, kde=True, rug=True)
plt.tight_layout()
Of course, this data set lives in a 4D space, so plotting more than univariate distributions is important (and as we will see tomorrow this is particularly useful for visualizing classification results). Fortunately, seaborn
makes it very easy to produce handy summary plots.
At this point, we are familiar with basic scatter plots in matplotlib.
Problem 3c
Make a matplotlib scatter plot showing the Iris petal length against the Iris petal width.
In [26]:
fig, ax = plt.subplots()
ax.scatter(iris['petal_length'], iris['petal_width'])
ax.set_xlabel("petal length (cm)")
ax.set_ylabel("petal width (cm)")
fig.tight_layout()
Of course, when there are many many data points, scatter plots become difficult to interpret. As in the example below:
In [27]:
xexample = np.random.normal(loc = 0.2, scale = 1.1, size = 10000)
yexample = np.random.normal(loc = -0.1, scale = 0.9, size = 10000)
fig, ax = plt.subplots()
ax.scatter(xexample, yexample)
ax.set_xlabel('X', fontsize=14)
ax.set_ylabel('Y', fontsize=14)
fig.tight_layout()
Here, we see that there are many points, clustered about the origin, but we have no sense of the underlying density of the distribution. 2D histograms, such as plt.hist2d()
, can alleviate this problem. I prefer to use plt.hexbin()
which is a little easier on the eyes (though note - these histograms are just as subject to the same issues discussed above).
In [31]:
# hexbin w/ bins = "log" returns the log of counts/bin
# mincnt = 1 displays only hexpix with at least 1 source present
fig, ax = plt.subplots()
cax = ax.hexbin(xexample, yexample, bins = "log", cmap = "viridis", mincnt = 1)
ax.set_xlabel('X', fontsize=14)
ax.set_ylabel('Y', fontsize=14)
fig.tight_layout()
plt.colorbar(cax)
Out[31]:
While the above plot provides a significant improvement over the scatter plot by providing a better sense of the density near the center of the distribution, the binedge effects are clearly present. An even better solution, like before, is a density estimate, which is easily built into seaborn
via the kdeplot
function.
In [32]:
fig, ax = plt.subplots()
sns.kdeplot(xexample, yexample,shade=False)
ax.set_xlabel('X', fontsize=14)
ax.set_ylabel('Y', fontsize=14)
fig.tight_layout()
This plot is much more appealing (and informative) than the previous two. For the first time we can clearly see that the distribution is not actually centered on the origin. Now we will move back to the Iris data set.
Suppose we want to see univariate distributions in addition to the scatter plot? This is certainly possible with matplotlib
and you can find examples on the web, however, with seaborn
this is really easy.
In [34]:
sns.jointplot(x=iris['petal_length'], y=iris['petal_width'])
plt.tight_layout()
But! Histograms and scatter plots can be problematic as we have discussed many times before.
Problem 3d
Re-create the plot above but set kind='kde'
to produce density estimates of the distributions.
In [35]:
sns.jointplot(x=iris['petal_length'], y=iris['petal_width'], kind = 'kde', shade = 'False')
plt.tight_layout()
That is much nicer than what was presented above. However - we still have a problem in that our data live in 4D, but we are (mostly) limited to 2D projections of that data. One way around this is via the seaborn
version of a pairplot
, which plots the distribution of every variable in the data set against each other. (Here is where the integration with pandas DataFrame
s becomes so powerful.)
In [36]:
sns.pairplot(iris[["sepal_length", "sepal_width",
"petal_length", "petal_width"]])
plt.tight_layout()
For data sets where we have classification labels, we can even color the various points using the hue
option, and produce KDEs along the diagonal with diag_type = 'kde'
.
In [40]:
sns.pairplot(iris, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"],
hue = "species", diag_kind = 'kde')
plt.tight_layout()
Even better - there is an option to create a PairGrid
which allows fine tuned control of the data as displayed above, below, and along the diagonal. In this way it becomes possible to avoid having symmetric redundancy, which is not all that informative. In the example below, we will show scatter plots and contour plots simultaneously.
In [39]:
g = sns.PairGrid(iris, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"],
hue = "species", diag_sharey=False)
g.map_lower(sns.kdeplot)
g.map_upper(plt.scatter, edgecolor='white')
g.map_diag(sns.kdeplot, lw=3)
plt.tight_layout()