This notebook briefly demonstrates basic usage of the three different functions, from three different python libraries, for managing bivariate kernel density estimation:
Furthermore, triangulation with mesh refinement (from matplotlib) is employed (instead of the meshgrid approach) for the evaluation of the kernel density. This triangulation can be used with KDEMultivariate and KernelDensity functions, as demonstrated hereafter. Using trinagulation ought to be faster and more adaptable (with refinement) than the more-classical meshgrid approach. Finally, grid search and cross-validation is used with KernelDensity from sklearn, for the optimal bandwidth selection.
In [1]:
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.tri as tri
from statsmodels.nonparametric.kernel_density import KDEMultivariate
from sklearn.neighbors import KernelDensity
from sklearn.grid_search import GridSearchCV
import seaborn as sns
# Seaborn style
sns.set(style='darkgrid', font_scale=1.2)
# Inline figures
%matplotlib inline
Input data (random) for kernel density estimation
In [2]:
# Generate some input data from N(0,1)
npts = 600
x = np.random.randn(npts)
y = np.random.randn(npts)
In [3]:
# Seaborn kdeplot
fig, ax = plt.subplots(figsize=(7,7))
sns.kdeplot(x, y, kernel='gau', bw='scott', shade=True, shade_lowest=False, cmap=plt.cm.GnBu, ax=ax)
ax.scatter(x, y, s=10, c='darkgrey', marker='o', alpha=0.5)
ax.set_title('Bivariate Standardized Normal Distribution')
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.show()
In [4]:
# Kernel density estimation (statsmodels)
model = KDEMultivariate([x, y], 'cc', bw='cv_ml')
# cross validation maximum likelihood bandwidth selection
print('Bandwidth:', model.bw)
zi = model.pdf()
In [11]:
# Triangulation with refinement
triang = tri.Triangulation(x, y)
refiner = tri.UniformTriRefiner(triang)
tri_refi, z_refi = refiner.refine_field(zi, subdiv=3) # three levels of subdivisions
In [6]:
# Graphical visualization
fig, ax = plt.subplots(figsize=(7,7))
ax.set_title('Bivariate Standardized Normal Distribution')
ax.triplot(triang, color='white', lw=0.5)
cax = ax.tricontourf(tri_refi, z_refi, 16, cmap=plt.cm.GnBu)
ax.tricontour(tri_refi, z_refi, 16, colors=['0.25', '0.5', '0.5', '0.5'],
linewidths=[1.0, 0.5, 1.0, 0.5])
cbar = fig.colorbar(cax)
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.tight_layout()
plt.show()
In [7]:
# Kernel density estimation (scikit-learn)
data = np.vstack((x, y)).T
# Using grid search cross-validation to optimize the bandwidth
params = {'bandwidth': np.logspace(-1, 1, 20)}
grid = GridSearchCV(KernelDensity(), params)
grid.fit(data)
bw = grid.best_estimator_.bandwidth
print("Best bandwidth: {:g}".format(bw))
# Model
model = KernelDensity(bandwidth=bw, algorithm='auto', kernel='gaussian')
model.fit(data)
# Evaluate the density model on the data
logdata = model.score_samples(data)
zdata = np.exp(logdata)
In [8]:
# Triangulation
triang = tri.Triangulation(x, y)
# Triangulation refinment
refiner = tri.UniformTriRefiner(triang)
tri_refi, z_refi = refiner.refine_field(zdata, subdiv=3)
In [9]:
# Graphical visualization
fig, ax = plt.subplots(figsize=(7,7))
ax.set_title('Bivariate Standardized Normal Distribution')
cax = ax.tricontourf(tri_refi, z_refi, 16, cmap=plt.cm.GnBu)
ax.tricontour(tri_refi, z_refi, 16, colors=['0.25', '0.5', '0.5', '0.5'],
linewidths=[1.0, 0.5, 1.0, 0.5])
ax.scatter(x, y, s=10, c='darkgrey', marker='o', alpha=0.5)
cbar = fig.colorbar(cax)
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.tight_layout()
plt.show()
In [10]:
# Generating random samples from the Kernel model (scikit-learn)
samples = model.sample(10)
print('Random samples:\n',samples)
In [ ]: