Bivariate Kernel Density Estimation

Bivariate Kernel Density Estimation using Seaborn, Statsmodels, and Scikit-learn Python libraries

This notebook briefly demonstrates basic usage of the three different functions, from three different python libraries, for managing bivariate kernel density estimation:

  • kdeplot from seaborn
  • KDEMultivariate from statsmodels
  • KernelDensity from sklearn

Furthermore, triangulation with mesh refinement (from matplotlib) is employed (instead of the meshgrid approach) for the evaluation of the kernel density. This triangulation can be used with KDEMultivariate and KernelDensity functions, as demonstrated hereafter. Using trinagulation ought to be faster and more adaptable (with refinement) than the more-classical meshgrid approach. Finally, grid search and cross-validation is used with KernelDensity from sklearn, for the optimal bandwidth selection.


In [1]:
from __future__ import print_function

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.tri as tri
from statsmodels.nonparametric.kernel_density import KDEMultivariate
from sklearn.neighbors import KernelDensity
from sklearn.grid_search import GridSearchCV
import seaborn as sns

# Seaborn style
sns.set(style='darkgrid', font_scale=1.2)

# Inline figures
%matplotlib inline

Input data (random) for kernel density estimation


In [2]:
# Generate some input data from N(0,1)
npts = 600
x = np.random.randn(npts)
y = np.random.randn(npts)

Kernel Density Estimation using Seaborn


In [3]:
# Seaborn kdeplot
fig, ax = plt.subplots(figsize=(7,7))
sns.kdeplot(x, y, kernel='gau', bw='scott', shade=True, shade_lowest=False, cmap=plt.cm.GnBu, ax=ax)
ax.scatter(x, y, s=10, c='darkgrey', marker='o', alpha=0.5)
ax.set_title('Bivariate Standardized Normal Distribution')
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.show()


Kernel Density Estimation using Statsmodels


In [4]:
# Kernel density estimation (statsmodels)
model = KDEMultivariate([x, y], 'cc', bw='cv_ml')  
# cross validation maximum likelihood bandwidth selection
print('Bandwidth:', model.bw)
zi = model.pdf()


Bandwidth: [ 0.46000813  0.34701119]

In [11]:
# Triangulation with refinement
triang = tri.Triangulation(x, y)
refiner = tri.UniformTriRefiner(triang)
tri_refi, z_refi = refiner.refine_field(zi, subdiv=3)  # three levels of subdivisions

In [6]:
# Graphical visualization
fig, ax = plt.subplots(figsize=(7,7))
ax.set_title('Bivariate Standardized Normal Distribution')
ax.triplot(triang, color='white', lw=0.5)
cax = ax.tricontourf(tri_refi, z_refi, 16, cmap=plt.cm.GnBu)
ax.tricontour(tri_refi, z_refi, 16, colors=['0.25', '0.5', '0.5', '0.5'], 
              linewidths=[1.0, 0.5, 1.0, 0.5])
cbar = fig.colorbar(cax)
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.tight_layout()
plt.show()


Kernel Density Estimation using scikit-learn


In [7]:
# Kernel density estimation (scikit-learn)
data = np.vstack((x, y)).T

# Using grid search cross-validation to optimize the bandwidth
params = {'bandwidth': np.logspace(-1, 1, 20)}
grid = GridSearchCV(KernelDensity(), params)
grid.fit(data)
bw = grid.best_estimator_.bandwidth
print("Best bandwidth: {:g}".format(bw))

# Model 
model = KernelDensity(bandwidth=bw, algorithm='auto', kernel='gaussian')
model.fit(data)

# Evaluate the density model on the data
logdata = model.score_samples(data)
zdata = np.exp(logdata)


Best bandwidth: 0.428133

In [8]:
# Triangulation
triang = tri.Triangulation(x, y)
# Triangulation refinment
refiner = tri.UniformTriRefiner(triang)
tri_refi, z_refi = refiner.refine_field(zdata, subdiv=3)

In [9]:
# Graphical visualization
fig, ax = plt.subplots(figsize=(7,7))
ax.set_title('Bivariate Standardized Normal Distribution')
cax = ax.tricontourf(tri_refi, z_refi, 16, cmap=plt.cm.GnBu)
ax.tricontour(tri_refi, z_refi, 16, colors=['0.25', '0.5', '0.5', '0.5'], 
              linewidths=[1.0, 0.5, 1.0, 0.5])
ax.scatter(x, y, s=10, c='darkgrey', marker='o', alpha=0.5)
cbar = fig.colorbar(cax)
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.tight_layout()
plt.show()



In [10]:
# Generating random samples from the Kernel model (scikit-learn)
samples = model.sample(10)
print('Random samples:\n',samples)


Random samples:
 [[-1.6289274   1.40016955]
 [ 2.77335275 -1.03678107]
 [-0.13370906 -1.66372754]
 [-0.50445188  1.57188034]
 [ 1.7567684   0.61192732]
 [ 0.08484377 -1.18525784]
 [-0.8224073   0.04602298]
 [ 0.59912065  0.35647967]
 [ 0.43950839  0.40743642]
 [-0.46963738 -0.08988496]]

In [ ]: