Kernel Density Estimator

I am trying to the use the Python Kernel Density estimator in scikit.learn, but something is not working properly ... what have I have done wrong this time??

The first two boxes load the libraries and read the data from the file that is online. Then we create the KDE and use it to create a pdf. Finally, we plot the data.

However, if you look at the plot, the pdf is the inverse of what we'd expect. The probability is high when it should be low. We basically want 1-pdf. But that is not what it should be and I can't figure out what I did wrong.


In [1]:
# load the libraries
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats.kde import gaussian_kde
import numpy as np
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KernelDensity
import sys
import urllib2

In [2]:
# read the data. This is just x/y data in this tab separated file

x = []
y = []

data = urllib2.urlopen("https://edwards.sdsu.edu/~redwards/coverage.tsv")

for l in data:
    p=l.strip().split("\t")
    x.append(float(p[0]))
    y.append(float(p[1]))

# create numpy arrays for the kde work    
ny = np.array(y)
nx = np.array(x)

In [9]:
print("{} : {}\n{} : {}".format(min(x), max(x), min(y), max(y)))


0.0 : 1545.0
0.0 : 4901.0

Kernel Density Estimator

This is where we run the Kernel Density Estimator. Most of this comes from Pythonic Perambulations.

We use GridSearchCV to run the cross validation for us. Here, we are choosing 100 points between 0.1 and 200 and use 20-fold cross validation to find the best bandwith for the KDE.

We initiate the Kernel Density Estimator using the gaussian kernel. We could also use 'tophat', 'epanechnikov', 'exponential', 'linear', or 'cosine'

We fit the data that we have just read and once we have done that the grid search will tell us the best bandwidth to use for our kde.

Then finally we score our original samples. We are scoring the same data that we fit to the KDE so it should give us the smoothed results. That's what everyone else does!

kde.score_samples returns the log of the density so we use numpy.exp to convert to a number (as is done in this example, and this example).


In [3]:
grid = GridSearchCV(KernelDensity(kernel='gaussian'),
                    {'bandwidth': np.linspace(0.1, 2000, 100)},
                    cv=20)  # 20-fold cross-validation

grid.fit(ny[:, np.newaxis])

# what is the best model we found in cross validation?
kde = grid.best_estimator_
print("Best bandwith: {}".format(kde.bandwidth))

# score the samples and find the antilog
pdf = np.exp(kde.score_samples(ny[:, None]))


Best bandwith: 242.512121212

Plot the data

We just make the figure!


In [7]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax2=ax.twinx()

# plot the data with a red line
ax.plot(x, y, color='r')
ax.set_xlabel("Position")
ax.set_ylabel("Coverage")

# plot the fitted data on the second axis
ax2.plot(x, pdf, color='blue')
ax2.set_ylabel("likelihood score")


ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.spines['left'].set_visible(False)

ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()
ax2.get_yaxis().tick_right()

fig.set_facecolor('white')

plt.show()


So what did I screw up?

Why is the PDF high when the counts are low and vice-versa?