Nearest-Neighbor Density Estimation

G. Richards (2016), based on materials from Ivezic, Vanderplas, and Leighly

Another very simple way to estimate the density of an $N$-dimensional distribution is to look to the nearest object (or the $K$ nearest objects) and compute their distances, $d_K$. This is the $K$-Nearest Neighbor algorithm. We'll see later that this is also a good method to use for classification.

In this prescription, the density at a given point, $x$ is estimated as

$$\hat{f}_K(x) = \frac{K}{V_D(d_K)}$$

where $V_D(d)$ is given generically by $\frac{2d^D\pi^{D/2}}{D\Gamma(D/2)}$ where $\Gamma$ is the complete gamma function (Equation 3.58) and this formula reduces to the usual equations for area and volume in 2 and 3 dimensions, respectively.

We can simplify this to $$\hat{f}_K(x) = \frac{C}{d_K^D}$$ since the constant, $C$ can be evaluated at the end.

This estimator is biased, so ideally we don't actually want the nearest neighbor, but rather we want something like the 5th nearest neighbor (or larger). For example see this figure from the wikipedia link above:

In fact, the error in the estimator can be reduced by considering all $K$ nearest neighbors: $$\hat{f}_K(x) = \frac{C}{\sum_{i=1}^K d_i^D}$$

See the Scikit-Learn neighbors documentation for more information.

Ivezic, Figure 6.5 compares a Nearest Neighbor ($k=10$) with a KDE algorithm. See what happens as you increase the number of neighbors used.


In [ ]:
# Based on Ivezic, Figure 6.5
# Author: Jake VanderPlas
# License: BSD
#   The figure produced by this code is published in the textbook
#   "Statistics, Data Mining, and Machine Learning in Astronomy" (2013)
#   For more information, see http://astroML.github.com
#   To report a bug or issue, use the following forum:
#    https://groups.google.com/forum/#!forum/astroml-general
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from scipy import stats

from astroML.density_estimation import KNeighborsDensity
from astroML.plotting import hist

from sklearn.neighbors import KernelDensity

#------------------------------------------------------------
# Generate our data: a mix of several Cauchy distributions
#  this is the same data used in the Bayesian Blocks figure
np.random.seed(0)
N = 10000
mu_gamma_f = [(5, 1.0, 0.1),
              (7, 0.5, 0.5),
              (9, 0.1, 0.1),
              (12, 0.5, 0.2),
              (14, 1.0, 0.1)]
true_pdf = lambda x: sum([f * stats.cauchy(mu, gamma).pdf(x) for (mu, gamma, f) in mu_gamma_f])
x = np.concatenate([stats.cauchy(mu, gamma).rvs(int(f * N)) for (mu, gamma, f) in mu_gamma_f])
np.random.shuffle(x)
x = x[x > -10]
x = x[x < 30]

#------------------------------------------------------------
# plot the results
fig = plt.figure(figsize=(10, 10))
N = 5000
k = 10

xN = x[:N]
t = np.linspace(-10, 30, 1000)

# Compute density with KDE
kde = KernelDensity(0.1, kernel='gaussian')
kde.fit(xN[:, None])
dens_kde = np.exp(kde.score_samples(t[:, None]))

# Compute density with Bayesian nearest neighbors
nbrs = KNeighborsDensity('bayesian', n_neighbors=k)
nbrs.fit(xN[:, None])
dens_nbrs = nbrs.eval(t[:, None]) / N

# plot the results
plt.plot(t, true_pdf(t), ':', color='black', zorder=3, label="Generating Distribution")
plt.plot(xN, -0.005 * np.ones(len(xN)), '|k')
plt.plot(t, dens_nbrs, '-', lw=1.5, color='gray', zorder=2, label="Nearest Neighbors (k=%i)" % k)
plt.plot(t, dens_kde, '-', color='black', zorder=3, label="Kernel Density (h=0.1)")

# label the plot
#plt.text(0.02, 0.95, "%i points" % N, ha='left', va='top', transform=ax.transAxes)
plt.ylabel('$p(x)$')
plt.legend(loc='upper right')
plt.xlim(0, 20)
plt.ylim(-0.01, 0.4001)

plt.show()

Nearest-neighbors are both pretty simple and pretty powerful. But you can imagine that they could also be really slow if you have either a lot of points or want to consider a lot of neighbors as you have to compute all of the pairwise distances! You can certainly do this "brute force" computation, but the use of trees speeds things up considerably.

We haven't talked about order notation yet, but now is a good time to introduce it. If we say that something "goes as $\mathscr{O}(N)$", that means that $N$ operations are needed. If is it $\mathscr{O}(N^2)$, then $N\times N$ operations are needed.

If we have $N$ samples with $D$ features, then brute force nearest neighbor goes as $\mathscr{O}(DN)$. That can be a very large number of operations and make our code run slow. So, can we do it faster?

Trees

It seems like you would be stuck computing all of the distances, but consider this:

if point A is very distant from point B, and point B is very close to point C, then we know that points A and C are very distant.

So, we just have to compute the A-B and B-C distances; we don't actually need to compute A-C.

We can take advantage of this using trees. In 2-D we use a quad-tree, which is illustrated in Ivezic, Figure 2.3 below.

What is happening is that the data space gets broken down into smaller and smaller bins. Then instead of computing the distances between each of the objects, we compute the distances between each of the bins. We only compute the distances between objects if the bin distances are small enough to be considered further. This process can speed up nearest neighbor finding considerably.

For a quad-tree we are specifically dividing the 2-D space into 4 equal "nodes" (children) until each box is either empty or has no more than some number of points. The terminal nodes are called "leaves", thus the name "tree". In 3-D we instead have oct-trees.

We can generalize this to $k$ dimensions but the scaling as $2^D$ quickly gets out of control and is called the curse of dimensionality. So for so-called $k$D-tree we instead use binary trees where each node has 2 children instead of 4. $k$D trees further split the data into two rather than the box into two. A $k$D-tree is illustrated in Ivezic, Figure 2.4 below.

For $k$D trees no $D$-dimensional distances need to be computed, so once the tree is constructed, the nearest neighbor determination is only $\mathscr{O}(D \log N)$.

In Scikit-Learn $k$D Trees are implemented in http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree

As long as $D$ isn't too large ($\lesssim 20$), this works well, but $k$D trees also suffer from the curse of dimensionality for large $D$ and go as $\mathscr{O}(DN)$. In that case, we can use ball-trees instead.

Instead of using Cartesian axes, ball-trees split the data into nesting hyper spheres. This makes it more "expensive" to build the tree, but it makes finding the nearest neighbors faster $\mathscr{O}(D \log N)$.

For more information on ball-trees in Scikit-Learn, see http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.BallTree.html#sklearn.neighbors.BallTree

An example from Ivezic is shown below

Creating an evaluation grid

Now let's look at nearest neighbors in 2-D. Note that for 1-D we had to define the "grid" over which we evaluated the pdf and we did something like

t = np.linspace(-10, 30, 1000)

We need to do the same in 2-D. That seems easy enough, right? If $x$ runs from 0 to 10 and $y$ runs from 0 to 10 and we want to evaluate in steps of 1 in each direction, then we just need to build a grid that includes the points, $(0,0), (1,0), (0,1) \dots (10,9), (9,10), (10,10)$.

As far as I know there is no simple tool that does this. But we can use np.meshgrid() to help us as follows


In [ ]:
x = np.linspace(0,10,11)
y = np.linspace(0,10,11)
print x,y

In [ ]:
xv,yv = np.meshgrid(x,y)
print xv
print yv

In [ ]:
print xv.ravel()
print yv.ravel()
# Equivalent to flatten(), except for making a copy (or not) of the array

In [ ]:
xystack = np.vstack([xv.ravel(),yv.ravel()])
print xystack

In [ ]:
Xgrid = xystack.T
print Xgrid

Note that, while I said that there is no built in function for this, there are any number of ways to accomplish this! For example, using map, np.mgrid(), transposes, etc.

The code below accomplishes it in just one line, but I thought that it would make more sense if we broke it down like we did above.


In [ ]:
# Comparison of KDE and K-Nearest Neighbors "smoothing"
# Based on Ivezic, Figure 6.4
# Author: Jake VanderPlas
# License: BSD
#   The figure produced by this code is published in the textbook
#   "Statistics, Data Mining, and Machine Learning in Astronomy" (2013)
#   For more information, see http://astroML.github.com
#   To report a bug or issue, use the following forum:
#    https://groups.google.com/forum/#!forum/astroml-general

%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.colors import LogNorm

from scipy.spatial import cKDTree

from astroML.datasets import fetch_great_wall
from astroML.density_estimation import KDE, KNeighborsDensity

#------------------------------------------------------------
# Fetch the great wall data
X = fetch_great_wall()

#------------------------------------------------------------
# Create  the grid on which to evaluate the results
Nx = 50
Ny = 125
xmin, xmax = (-375, -175)
ymin, ymax = (-300, 200)

#------------------------------------------------------------
# Evaluate for several models
Xgrid = np.vstack(map(np.ravel, np.meshgrid(np.linspace(xmin, xmax, Nx), np.linspace(ymin, ymax, Ny)))).T

#print Xgrid

kde = KDE(metric='gaussian', h=5)
dens_KDE = kde.fit(X).eval(Xgrid).reshape((Ny, Nx))

knn5 = KNeighborsDensity('bayesian', 5)
dens_k5 = knn5.fit(X).eval(Xgrid).reshape((Ny, Nx))

knn40 = KNeighborsDensity('bayesian', 40)
dens_k40 = knn40.fit(X).eval(Xgrid).reshape((Ny, Nx))

#------------------------------------------------------------
# Plot the results
fig = plt.figure(figsize=(10, 5))
fig.subplots_adjust(left=0.12, right=0.95, bottom=0.2, top=0.9,
                    hspace=0.01, wspace=0.01)

# First plot: scatter the points
ax1 = plt.subplot(221, aspect='equal')
ax1.scatter(X[:, 1], X[:, 0], s=1, lw=0, c='k')
ax1.text(0.95, 0.9, "input", ha='right', va='top',
         transform=ax1.transAxes,
         bbox=dict(boxstyle='round', ec='k', fc='w'))

# Second plot: KDE
ax2 = plt.subplot(222, aspect='equal')
ax2.imshow(dens_KDE.T, origin='lower', norm=LogNorm(),
           extent=(ymin, ymax, xmin, xmax), cmap=plt.cm.binary)
ax2.text(0.95, 0.9, "KDE: Gaussian $(h=5)$", ha='right', va='top',
         transform=ax2.transAxes,
         bbox=dict(boxstyle='round', ec='k', fc='w'))

# Third plot: KNN, k=5
ax3 = plt.subplot(223, aspect='equal')
ax3.imshow(dens_k5.T, origin='lower', norm=LogNorm(),
           extent=(ymin, ymax, xmin, xmax), cmap=plt.cm.binary)
ax3.text(0.95, 0.9, "$k$-neighbors $(k=5)$", ha='right', va='top',
         transform=ax3.transAxes,
         bbox=dict(boxstyle='round', ec='k', fc='w'))

# Fourth plot: KNN, k=40
ax4 = plt.subplot(224, aspect='equal')
ax4.imshow(dens_k40.T, origin='lower', norm=LogNorm(),
           extent=(ymin, ymax, xmin, xmax), cmap=plt.cm.binary)
ax4.text(0.95, 0.9, "$k$-neighbors $(k=40)$", ha='right', va='top',
         transform=ax4.transAxes,
         bbox=dict(boxstyle='round', ec='k', fc='w'))

for ax in [ax1, ax2, ax3, ax4]:
    ax.set_xlim(ymin, ymax - 0.01)
    ax.set_ylim(xmin, xmax)

for ax in [ax1, ax2]:
    ax.xaxis.set_major_formatter(plt.NullFormatter())

for ax in [ax3, ax4]:
    ax.set_xlabel('$y$ (Mpc)')

for ax in [ax2, ax4]:
    ax.yaxis.set_major_formatter(plt.NullFormatter())

for ax in [ax1, ax3]:
    ax.set_ylabel('$x$ (Mpc)')

plt.show()

What the "right" answer is depends on what you want to do with the data.

Next time we'll talk about Gaussian Mixture Models and K-Means Clustering.