In [ ]:
from __future__ import division, print_function, absolute_import

Introduction to Visualization:

Density Estimation and Data Exploration

Version 0.1

There are many flavors of data analysis that fall under the "visualization" umbrella in astronomy. Today, by way of example, we will focus on 2 basic problems.


By AA Miller 16 September 2017


In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Problem 1) Density Estimation

Starting with 2MASS and SDSS and extending through LSST, we are firmly in an era where data and large statistical samples are cheap. With this explosion in data volume comes a problem: we do not know the underlying probability density function (PDF) of the random variables measured via our observations. Hence - density estimation: an attempt to recover the unknown PDF from observations. In some cases theory can guide us to a parametric form for the PDF, but more often than not such guidance is not available.

There is a common, simple, and very familiar tool for density estimation: histograms.

But there is also a problem:

HISTOGRAMS LIE!

We will "prove" this to be the case in a series of examples. For this exercise, we will load the famous Linnerud data set, which tested 20 middle aged men by measuring the number of chinups, situps, and jumps they could do in order to compare these numbers to their weight, pulse, and waist size. To load the data (just chinups for now) we will run the following:

from sklearn.datasets import load_linnerud
linnerud = load_linnerud()
chinups = linnerud.data[:,0]

In [ ]:
from sklearn.datasets import load_linnerud

linnerud = load_linnerud()
chinups = linnerud.data[:,0]

Problem 1a

Plot the histogram for the number of chinups using the default settings in pyplot.


In [ ]:
plt.hist( # complete

Already with this simple plot we see a problem - the choice of bin centers and number of bins suggest that there is a 0% probability that middle aged men can do 10 chinups. Intuitively this seems incorrect, so lets examine how the histogram changes if we change the number of bins or the bin centers.

Problem 1b

Using the same data make 2 new histograms: (i) one with 5 bins (bins = 5), and (ii) one with the bars centered on the left bin edges (align = "left").

Hint - if overplotting the results, you may find it helpful to use the histtype = "step" option


In [ ]:
plt.hist( # complete
# complete

These small changes significantly change the output PDF. With fewer bins we get something closer to a continuous distribution, while shifting the bin centers reduces the probability to zero at 9 chinups.

What if we instead allow the bin width to vary and require the same number of points in each bin? You can determine the bin edges for bins with 5 sources using the following command:

bins = np.append(np.sort(chinups)[::5], np.max(chinups))

Problem 1c

Plot a histogram with variable width bins, each with the same number of points.

Hint - setting normed = True will normalize the bin heights so that the PDF integrates to 1.


In [ ]:
# complete
plt.hist(# complete

Ending the lie

Earlier I stated that histograms lie. One simple way to combat this lie: show all the data. Displaying the original data points allows viewers to somewhat intuit the effects of the particular bin choices that have been made (though this can also be cumbersome for very large data sets, which these days is essentially all data sets). The standard for showing individual observations relative to a histogram is a "rug plot," which shows a vertical tick (or other symbol) at the location of each source used to estimate the PDF.

Problem 1d Execute the cell below to see an example of a rug plot.


In [ ]:
plt.hist(chinups, histtype = 'step')

# this is the code for the rug plot
plt.plot(chinups, np.zeros_like(chinups), '|', color='k', ms = 25, mew = 4)

Of course, even rug plots are not a perfect solution. Many of the chinup measurements are repeated, and those instances cannot be easily isolated above. One (slightly) better solution is to vary the transparency of the rug "whiskers" using alpha = 0.3 in the whiskers plot call. But this too is far from perfect.

To recap, histograms are not ideal for density estimation for the following reasons:

  • They introduce discontinuities that are not present in the data
  • They are strongly sensitive to user choices ($N_\mathrm{bins}$, bin centering, bin grouping), without any mathematical guidance to what these choices should be
  • They are difficult to visualize in higher dimensions

Histograms are useful for generating a quick representation of univariate data, but for the reasons listed above they should never be used for analysis. Most especially, functions should not be fit to histograms given how greatly the number of bins and bin centering affects the output histogram.

Okay - so if we are going to rail on histograms this much, there must be a better option. There is: Kernel Density Estimation (KDE), a nonparametric form of density estimation whereby a normalized kernel function is convolved with the discrete data to obtain a continuous estimate of the underlying PDF. As a rule, the kernel must integrate to 1 over the interval $-\infty$ to $\infty$ and be symmetric. There are many possible kernels (gaussian is highly popular, though Epanechnikov, an inverted parabola, produces the minimal mean square error).

KDE is not completely free of the problems we illustrated for histograms above (in particular, both a kernel and the width of the kernel need to be selected), but it does manage to correct a number of the ills. We will now demonstrate this via a few examples using the scikit-learn implementation of KDE: KernelDensity, which is part of the sklearn.neighbors module.

Note There are many implementations of KDE in Python, and Jake VanderPlas has put together an excellent description of the strengths and weaknesses of each. We will use the scitkit-learn version as it is in many cases the fastest implementation.

To demonstrate the basic idea behind KDE, we will begin by representing each point in the dataset as a block (i.e. we will adopt the tophat kernel). Borrowing some code from Jake, we can estimate the KDE using the following code:

from sklearn.neighbors import KernelDensity
def kde_sklearn(data, grid, bandwidth = 1.0, **kwargs):
    kde_skl = KernelDensity(bandwidth = bandwidth, **kwargs)
    kde_skl.fit(data[:, np.newaxis])
    log_pdf = kde_skl.score_samples(grid[:, np.newaxis]) # sklearn returns log(density)

    return np.exp(log_pdf)

The two main options to set are the bandwidth and the kernel.


In [ ]:
# execute this cell
from sklearn.neighbors import KernelDensity
def kde_sklearn(data, grid, bandwidth = 1.0, **kwargs):
    kde_skl = KernelDensity(bandwidth = bandwidth, **kwargs)
    kde_skl.fit(data[:, np.newaxis])
    log_pdf = kde_skl.score_samples(grid[:, np.newaxis]) # sklearn returns log(density)

    return np.exp(log_pdf)

Problem 1e

Plot the KDE of the PDF for the number of chinups middle aged men can do using a bandwidth of 0.1 and a tophat kernel.

Hint - as a general rule, the grid should be smaller than the bandwidth when plotting the PDF.


In [ ]:
grid = # complete
PDFtophat = kde_sklearn( # complete
plt.plot( # complete

In this representation, each "block" has a height of 0.25. The bandwidth is too narrow to provide any overlap between the blocks. This choice of kernel and bandwidth produces an estimate that is essentially a histogram with a large number of bins. It gives no sense of continuity for the distribution. Now, we examine the difference (relative to histograms) upon changing the the width (i.e. kernel) of the blocks.

Problem 1f

Plot the KDE of the PDF for the number of chinups middle aged men can do using bandwidths of 1 and 5 and a tophat kernel. How do the results differ from the histogram plots above?


In [ ]:
PDFtophat1 = # complete

# complete
# complete
# complete

It turns out blocks are not an ideal representation for continuous data (see discussion on histograms above). Now we will explore the resulting PDF from other kernels.

Problem 1g Plot the KDE of the PDF for the number of chinups middle aged men can do using a gaussian and Epanechnikov kernel. How do the results differ from the histogram plots above?

Hint - you will need to select the bandwidth. The examples above should provide insight into the useful range for bandwidth selection. You may need to adjust the values to get an answer you "like."


In [ ]:
PDFgaussian = # complete
PDFepanechnikov = # complete

So, what is the optimal choice of bandwidth and kernel? Unfortunately, there is no hard and fast rule, as every problem will likely have a different optimization. Typically, the choice of bandwidth is far more important than the choice of kernel. In the case where the PDF is likely to be gaussian (or close to gaussian), then Silverman's rule of thumb can be used:

$$h = 1.059 \sigma n^{-1/5}$$

where $h$ is the bandwidth, $\sigma$ is the standard deviation of the samples, and $n$ is the total number of samples. Note - in situations with bimodal or more complicated distributions, this rule of thumb can lead to woefully inaccurate PDF estimates. The most general way to estimate the choice of bandwidth is via cross validation (we will cover cross-validation later today).

What about multidimensional PDFs? It is possible using many of the Python implementations of KDE to estimate multidimensional PDFs, though it is very very important to beware the curse of dimensionality in these circumstances.

Problem 2) Data Exploration

Now a more open ended topic: data exploration. In brief, data exploration encompases a large suite of tools (including those discussed above) to examine data that live in large dimensional spaces. There is no single best method or optimal direction for data exploration. Instead, today we will introduce some of the tools available via python.

As an example we will start with a basic line plot - and examine tools beyond matplotlib.


In [ ]:
x = np.arange(0, 6*np.pi, 0.1)
y = np.cos(x)

plt.plot(x,y, lw = 2)
plt.xlabel('X')
plt.ylabel('Y')
plt.xlim(0, 6*np.pi)

Seaborn

Seaborn is a plotting package that enables many useful features for exploration. In fact, a lot of the functionality that we developed above can readily be handled with seaborn.

Aside - aesthetics vary greatly from person to person. The mere act of importing seaborn drastically changes the appearance of all plots made thereafter. Some really like these changes, others do not. I've heard it said that seaborn is no good as it forces an import MakeUgly under the hood. The choice is up to you.

To begin, we will make the same plot that we created in matplotlib.


In [ ]:
import seaborn as sns

fig = plt.figure()
ax = fig.add_subplot(111)

ax.plot(x,y, lw = 2)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_xlim(0, 6*np.pi)

We can see that the plot has significantly changed styles.

seaborn has 5 style presets: darkgrid, whitegrid, dark, white, and ticks. You can change the preset using the following:

sns.set_style("whitegrid")

which will change the output for all subsequent plots. Note - if you want to change the style for only a single plot, that can be accomplished with the following:

with sns.axes_style("dark"):

with all ploting commands inside the with statement.

Problem 3a

Re-plot the sine curve using each seaborn preset to see which you like best - then adopt this for the remainder of the notebook.


In [ ]:
sns.set_style( # complete
# complete

The folks behind seaborn have thought a lot about color palettes, which is a good thing. Remember - the choice of color for plots is one of the most essential aspects of visualization. A poor choice of colors can easily mask interesting patterns or suggest structure that is not real. To learn more about what is available, see the seaborn color tutorial.

Here we load the default:


In [ ]:
# default color palette

current_palette = sns.color_palette()
sns.palplot(current_palette)

which we will now change to colorblind, which is clearer to those that are colorblind.


In [ ]:
# set palette to colorblind
sns.set_palette("colorblind")

current_palette = sns.color_palette()
sns.palplot(current_palette)

Now that we have covered the basics of seaborn (and the above examples truly only scratch the surface of what is possible), we will explore the power of seaborn for higher dimension data sets. We will load the famous Iris data set, which measures 4 different features of 3 different types of Iris flowers. There are 150 different flowers in the data set.

Note - for those familiar with pandas seaborn is designed to integrate easily and directly with pandas DataFrame objects. In the example below the Iris data are loaded into a DataFrame. iPython notebooks also display the DataFrame data in a nice readable format.


In [ ]:
iris = sns.load_dataset("iris")
iris

Now that we have a sense of the data structure, it is useful to examine the distribution of features. Above, we went to great pains to produce histograms, KDEs, and rug plots. seaborn handles all of that effortlessly with the distplot function.

Problem 3b

Plot the distribution of petal lengths for the Iris data set.


In [ ]:
# note - hist, kde, and rug all set to True, set to False to turn them off 
with sns.axes_style("dark"):
    sns.distplot(iris['petal_length'], bins=20, hist=True, kde=True, rug=True)

Of course, this data set lives in a 4D space, so plotting more than univariate distributions is important (and as we will see tomorrow this is particularly useful for visualizing classification results). Fortunately, seaborn makes it very easy to produce handy summary plots.

At this point, we are familiar with basic scatter plots in matplotlib.

Problem 3c

Make a matplotlib scatter plot showing the Iris petal length against the Iris petal width.


In [ ]:
plt.scatter( # complete

Of course, when there are many many data points, scatter plots become difficult to interpret. As in the example below:


In [ ]:
with sns.axes_style("darkgrid"):
    xexample = np.random.normal(loc = 0.2, scale = 1.1, size = 10000)
    yexample = np.random.normal(loc = -0.1, scale = 0.9, size = 10000)

    plt.scatter(xexample, yexample)

Here, we see that there are many points, clustered about the origin, but we have no sense of the underlying density of the distribution. 2D histograms, such as plt.hist2d(), can alleviate this problem. I prefer to use plt.hexbin() which is a little easier on the eyes (though note - these histograms are just as subject to the same issues discussed above).


In [ ]:
# hexbin w/ bins = "log" returns the log of counts/bin
# mincnt = 1 displays only hexpix with at least 1 source present
with sns.axes_style("darkgrid"):
    plt.hexbin(xexample, yexample, bins = "log", cmap = "viridis", mincnt = 1)
    plt.colorbar()

While the above plot provides a significant improvement over the scatter plot by providing a better sense of the density near the center of the distribution, the binedge effects are clearly present. An even better solution, like before, is a density estimate, which is easily built into seaborn via the kdeplot function.


In [ ]:
with sns.axes_style("darkgrid"):
    sns.kdeplot(xexample, yexample,shade=False)

This plot is much more appealing (and informative) than the previous two. For the first time we can clearly see that the distribution is not actually centered on the origin. Now we will move back to the Iris data set.

Suppose we want to see univariate distributions in addition to the scatter plot? This is certainly possible with matplotlib and you can find examples on the web, however, with seaborn this is really easy.


In [ ]:
sns.jointplot(x=iris['petal_length'], y=iris['petal_width'])

But! Histograms and scatter plots can be problematic as we have discussed many times before.

Problem 3d

Re-create the plot above but set kind='kde' to produce density estimates of the distributions.


In [ ]:
sns.jointplot( # complete

That is much nicer than what was presented above. However - we still have a problem in that our data live in 4D, but we are (mostly) limited to 2D projections of that data. One way around this is via the seaborn version of a pairplot, which plots the distribution of every variable in the data set against each other. (Here is where the integration with pandas DataFrames becomes so powerful.)


In [ ]:
sns.pairplot(iris[["sepal_length", "sepal_width", "petal_length", "petal_width"]])

For data sets where we have classification labels, we can even color the various points using the hue option, and produce KDEs along the diagonal with diag_type = 'kde'.


In [ ]:
sns.pairplot(iris, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"],
             hue = "species", diag_kind = 'kde')

Even better - there is an option to create a PairGrid which allows fine tuned control of the data as displayed above, below, and along the diagonal. In this way it becomes possible to avoid having symmetric redundancy, which is not all that informative. In the example below, we will show scatter plots and contour plots simultaneously.


In [ ]:
g = sns.PairGrid(iris, vars = ["sepal_length", "sepal_width", "petal_length", "petal_width"],
                 hue = "species", diag_sharey=False)
g.map_lower(sns.kdeplot)
g.map_upper(plt.scatter, edgecolor='white')
g.map_diag(sns.kdeplot, lw=3)

Note - one disadvantage to the plot above is that the contours do not share the same color scheme as the KDE estimates and the scatter plot. I have not been able to figure out how to change this in a satisfactory way. (One potential solution is detailed here, however, it is worth noting that this solution restricts your color choices to a maximum of ~5 unless you are a colormaps wizard, and I am not.)