$\newcommand{\Reals}{\mathbb{R}} \newcommand{\Nats}{\mathbb{N}} \newcommand{\PDK}{{k}} \newcommand{\IS}{\mathcal{X}} \newcommand{\FM}{\Phi} \newcommand{\Gram}{G} \newcommand{\RKHS}{\mathcal{H}} \newcommand{\prodDot}[2]{\left\langle#1,#2\right\rangle} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max}$

Tutorial on modern kernel methods




Ingmar Schuster

(Zalando Research)

Overview

  • Introduction
  • Feature engineering and two simple classification algorithms
  • Kernels and feature space
  • Applications
    • mean embedding, conditional mean embedding and operators
    • Training GANs
    • Regression with uncertainty estimates
  • Conclusion and outlook

Introduction

Kernel methods

  • Support Vector Machines (SVMs) are a staple for classification
  • Gaussian Processes (GPs) very popular for regression
  • kernel mean embedding of distributions/conditional distributions
  • recent techniques for representation learning/unsupervised learning
    • Deep Gaussian Processes
    • Compositional Kernel Machines
  • very elegant mathematics

In [14]:
### FIRST SOME CODE ####

from __future__ import division, print_function, absolute_import
from IPython.display import SVG, display, Image, HTML

import numpy as np, scipy as sp, pylab as pl, matplotlib.pyplot as plt, scipy.stats as stats, sklearn, sklearn.datasets
from scipy.spatial.distance import squareform, pdist, cdist

import rkhs_operators as ro

import distributions as dist #commit 480cf98 of https://github.com/ingmarschuster/distributions

pl.style.use(u'seaborn-talk')

Gaussian Processes

  • model for functions/continuous output
  • for new input returns predicted output and uncertainty

In [15]:
display(Image(filename="GP_uq.png", width=630)) #source: http://scikit-learn.org/0.17/modules/gaussian_process.html


Support Vector Machines

  • model for classification
  • map data nonlinearly to higher dimensionsal space
  • separate points of different classes using a plane (i.e. linearly)

In [16]:
display(Image(filename="SVM.png", width=700)) #source: https://en.wikipedia.org/wiki/Support_vector_machine


Feature engineering and two classification algorithms

Feature engineering in Machine Learning

  • feature engineering: map data to features with function $\FM:\IS\to \RKHS$
    • handle nonlinear relations with linear methods ($\FM$ nonlinear)
    • handle non-numerical data (e.g. text)

In [17]:
display(Image(filename="monomials_small.jpg", width=800)) #source: Berhard Schölkopf


Working in Feature Space

  • want Feature Space $\RKHS$ (the codomain of $\FM$) to be vector space to get nice mathematical structure
  • definition of inner products induces norms and possibility to measure angles
  • can use linear algebra in $\RKHS$ to solve ML problems
    • inner products
    • angles
    • norms
    • distances
  • induces nonlinear operations on the Input Space (domain of $\FM$)

Two simple classification algorithms

  • given data points from mixture of two distributions with densities $p_0,p_1$: $$x_i \sim 0.5 p_0 + 0.5 p_1$$ and label $l_i = 0$ if $x_i$ generated by $p_0$, $l_i = 1$ otherwise

In [18]:
figkw = {"figsize":(4,4), "dpi":150}
np.random.seed(5)
samps_per_distr = 20
data = np.vstack([stats.multivariate_normal(np.array([-2,0]), np.eye(2)*1.5).rvs(samps_per_distr),
                  stats.multivariate_normal(np.array([2,0]), np.eye(2)*1.5).rvs(samps_per_distr)])
distr_idx = np.r_[[0]*samps_per_distr, [1]*samps_per_distr]

In [19]:
f = pl.figure(**figkw);
for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
    plt.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker, alpha = 0.4)
pl.show()


Classification using inner products in Feature Space

  • compute mean feature space embedding $$\mu_{0} = \frac{1}{N_0} \sum_{l_i = 0} \FM(x_i) ~~~~~~~~ \mu_{1} = \frac{1}{N_1} \sum_{l_i = 1} \FM(x_i)$$
  • assign test point to most similar class in terms of inner product between point and mean embedding $\prodDot{\FM(x)}{\mu_c}$ $$f_d(x) = \argmax_{c\in\{0,~1\}} \prodDot{\FM(x)}{\mu_c}$$ (remember in $\Reals^2$ canonically: $\prodDot{a}{b} = a_1 b_1+a_2 b_2 $)

In [20]:
pl.figure(**figkw)
for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
    pl.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker, alpha=0.2)
    pl.arrow(0, 0, *data[distr_idx==idx,:].mean(0), head_width=0.3, width=0.05, head_length=0.3, fc=c, ec=c)
pl.title(r"Mean embeddings for $\Phi(x)=x$");



In [21]:
pl.figure(**figkw)
for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
    pl.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker, alpha=0.2)
    pl.arrow(0, 0, *data[distr_idx==idx,:].mean(0), head_width=0.3, width=0.05, head_length=0.3, fc=c, ec=c)
pl.title(r"Mean embeddings for $\Phi(x)=x$");
pl.scatter(np.ones(1), np.ones(1), c='k', marker='D', alpha=0.8);


Classification using density estimation

  • estimate density for each class by centering a gaussian, taking mixture as estimate $$\widehat{p}_0 = \frac{1}{N_0} \sum_{l_i = 0} \mathcal{N}(\cdot; x_i,\Sigma) ~~~~~~~~ \widehat{p}_1 = \frac{1}{N_1} \sum_{l_i = 1} \mathcal{N}(\cdot; x_i,\Sigma)$$

In [22]:
# Some plotting code

def apply_to_mg(func, *mg):
    #apply a function to points on a meshgrid
    x = np.vstack([e.flat for e in mg]).T
    return np.array([func(i.reshape((1,2))) for i in x]).reshape(mg[0].shape)

def plot_with_contour(samps, data_idx, cont_func, method_name = None, delta = 0.025, pl = pl, colormesh_cmap = pl.cm.Pastel2, contour_classif = True):
    x = np.arange(samps.T[0].min()-delta, samps.T[1].max()+delta, delta)
    y = np.arange(samps.T[1].min()-delta, samps.T[1].max()+delta, delta)
    X, Y = np.meshgrid(x, y)
    Z = apply_to_mg(cont_func, X,Y)
    Z = Z.reshape(X.shape)
    

    fig = pl.figure(**figkw)
    if colormesh_cmap is not None:
        bound = np.abs(Z).max()
        pl.pcolor(X, Y, Z , cmap=colormesh_cmap, alpha=0.5, edgecolors=None, vmin=-bound, vmax=bound)
    if contour_classif is True:
        c = pl.contour(X, Y, Z, colors=['k', ],
                       alpha = 0.5,
                       linestyles=[ '--'],
                       levels=[0],
                       linewidths=0.7)
    else:
        pl.contour(X, Y, Z, linewidths=0.7)
    if method_name is not None:
        pl.title(method_name)
    for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
        pl.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker, alpha = 0.4)

    pl.show()
    pl.close()

In [23]:
est_dens_1 = dist.mixt(2, [dist.mvnorm(x, np.eye(2)*0.1) for x in data[:4]], [1./4]*4)
plot_with_contour(data, distr_idx,
                  lambda x: exp(est_dens_1.logpdf(x)),
                  colormesh_cmap=None, contour_classif=False)



In [24]:
est_dens_1 = dist.mixt(2, [dist.mvnorm(x, np.eye(2)*0.1,10) for x in data[:samps_per_distr]], [1./samps_per_distr]*samps_per_distr)
plot_with_contour(data, distr_idx,
                  lambda x: exp(est_dens_1.logpdf(x)),
                  colormesh_cmap=None, contour_classif=False)


Classification using density estimation

  • estimate density for each class by centering a gaussian, taking mixture as estimate $$\widehat{p}_0 = \frac{1}{N_0} \sum_{l_i = 0} \mathcal{N}(\cdot; x_i,\Sigma) ~~~~~~~~ \widehat{p}_1 = \frac{1}{N_1} \sum_{l_i = 1} \mathcal{N}(\cdot; x_i,\Sigma)$$
  • assign test point $x$ to class $c$ that gives highest value for $\widehat{p}_c(x)$
  • $\widehat{p}_c$ is known as a kernel density estimate (KDE)
  • different but overlapping notion of 'kernel'
  • classification algorithm known as Parzen windows classification

For a certain feature map and inner product, both algorithms are the same!

Let's construct this feature map and inner product.

Kernels and feature space

Positive definite functions and feature spaces

  • let $\PDK:\IS\times\IS \to \Reals$, called a kernel
    • if $\PDK$ is a symmetric and positive semi definite (psd)
    • then there exists $\FM: \IS \to \RKHS$ to a hilbert space $\RKHS$ such that $$\PDK(x_i, x_j) = \prodDot{\FM(x_i)}{\FM(x_j)}_\RKHS$$ i.e. $\PDK$ computes inner product after mapping to some $\RKHS$

Gram matrix (1)

  • If all matrices $$\Gram_{X}=\begin{bmatrix} \PDK(x_1, x_1) & \dots & \PDK(x_1, x_N)\\ \PDK(x_2, x_1) & \ddots & \vdots\\ \vdots & & \vdots\\ \PDK(x_N, x_1) & \dots & \PDK(x_N, x_N) \end{bmatrix}$$ are symmetric positive semidefinite, then $\PDK$ is a psd kernel
  • called a gram matrix

Gram matrix (2)

  • sometimes mixed gram matrices are needed $$\Gram_{XY} = \begin{bmatrix} \PDK(x_1, y_1) & \PDK(x_1, y_2) & \dots & \PDK(x_1, y_M)\\ \PDK(x_2, y_1) & \ddots & &\vdots\\ \vdots & & &\vdots\\ \PDK(x_N, y_1) & \dots & &\PDK(x_N, y_M) \end{bmatrix} $$

Examples of psd kernels

  • Linear $\PDK_L(x,x') = \prodDot{x}{x'} = x_1 x'_1 + x_2 x'_2+ \dots$
  • Gaussian $\PDK_G(x,x') = \exp({-{ 0.5}(x-x' )^{\top }\Sigma ^{-1}(x-x' )})$

PSD kernels

  • easy to construct $\PDK$ given $\FM$: $\PDK(x_i, x_j) = \prodDot{\FM(x_i)}{\FM(x_j)}$
  • construction for $\FM$ given $\PDK$ not trivial but still elementary
  • given $\FM$ and inner product in feature space we
    • can endow space with norm induced by the inner product $$\|g\|_\RKHS = \sqrt{\prodDot{g}{g}}_\RKHS~~~\textrm{for}~g \in \RKHS$$
    • can measure angles in the new space $$\prodDot{g}{f}_\RKHS = \cos(\angle[g,f])~\|g\|_\RKHS ~\|f\|_\RKHS$$

Construction of the canonical feature map (Aronszajn map)

Plan

  • construction of $\FM$ from $\PDK$
  • definition of inner product in new space $\RKHS$ such that in fact $\PDK(x,x') = \prodDot{\FM(x)}{\FM(x)}$
  • feature for each $x \in \IS$ will be a function from $\IS$ to $\Reals$ $$\FM:\IS \to \Reals^\IS$$

Canonical feature map (Aronszajn map)

  • pick $\FM(x) = \PDK(\cdot, x)$

    • Linear kernel: $\FM_L(x) = \prodDot{\cdot}{x}$
    • Gaussian kernel: $\FM_G(x) = \exp\left(-0.5{\|\cdot -x \|^2}/{\sigma^2}\right)$.
  • let linear combinations of features also be in $\RKHS$ $$f(\cdot)=\sum_{i=1}^m a_i \PDK(\cdot, x_i) \in \RKHS$$ for $a_i \in \Reals$

  • $\RKHS$ a vector space over $\Reals$ : if $f(\cdot)$ and $g(\cdot)$ functions from $\IS$ to $\Reals$, so are $a~f(\cdot)$ for $a \in \Reals, f(\cdot)+g(\cdot)$

Canonical inner product (1)

  • for $f(\cdot)=\sum_{i=1}^m a_i \PDK(\cdot, x_i) \in \RKHS$ and $g(\cdot)=\sum_{j=1}^{m'} b_j \PDK(\cdot, x'_j) \in \RKHS$ define inner product $$\prodDot{f}{g} = \sum_{i=1}^m \sum_{j=1}^{m'} b_j a_i \PDK(x'_j, x_i)$$
  • In particular $\prodDot{ \PDK(\cdot,x)}{ \PDK(\cdot,x')}=\PDK(x,x')$ (reproducing property of kernel in its $\RKHS$)

Canonical inner product (2)

  • $\RKHS$ a hilbert space with this inner product, as it is
    • positive definite
    • linear in its first argument
    • symmetric
    • complete
  • $\RKHS$ called Reproducing Kernel Hilbert Space (RKHS).

Equivalence of classification algorithms

  • Recall mean canonical feature and kernel density estimate $$\widehat{p}_0 = \frac{1}{N_0} \sum_{l_i = 0} \mathcal{N}(\cdot; x_i,\Sigma) ~~~~~~~~ \mu_{0} = \frac{1}{N_0} \sum_{l_i = 0} \PDK(\cdot, x_i)$$
  • observe $$\frac{1}{N_0} \sum_{l_i = 0} \mathcal{N}(x^*; x_i,\Sigma) = \prodDot{\frac{1}{N_0} \sum_{l_i = 0} \PDK(\cdot, x_i)}{\PDK(\cdot, x^*)}$$ if $\PDK$ is Gaussian density with covariance $\Sigma$
  • kernel mean and Parzen windows classification are equivalent!

Lets look at example classification output


In [25]:
class KMEclassification(object):
    def __init__(self, samps1, samps2, kernel):
        self.de1 = ro.RKHSDensityEstimator(samps1, kernel, 0.1)
        self.de2 = ro.RKHSDensityEstimator(samps2, kernel, 0.1)
    
    def classification_score(self, test):
        return (self.de1.eval_kme(test) - self.de2.eval_kme(test))

In [26]:
data, distr_idx = sklearn.datasets.make_circles(n_samples=400, factor=.3, noise=.05)

kc = KMEclassification(data[distr_idx==0,:], data[distr_idx==1,:],  ro.LinearKernel())
plot_with_contour(data, distr_idx, kc.classification_score, 'Inner Product classif. '+"Linear", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)



In [27]:
kc = KMEclassification(data[distr_idx==0,:], data[distr_idx==1,:],  ro.GaussianKernel(0.3))
plot_with_contour(data, distr_idx, kc.classification_score, 'Inner Product classif. '+"Gaussian", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)


Applications

Kernel mean embedding

  • mean feature with canonical feature map $\frac{1}{N} \sum_{i = 1}^N \FM(x_i) = \frac{1}{N} \sum_{i = 1}^N \PDK(x_i, \cdot)$
  • this the estimate of the kernel mean embedding of the distribution/density $\rho$ of $x_i$ $$\mu_\rho(\cdot) = \int \PDK(x,\cdot) \mathrm{d}\rho(x)$$
  • using this we can define a distance between distributions $\rho, q$ as $$\mathrm{MMD}(\rho, q)^2 = \|\mu_\rho - \mu_q \|^2_\RKHS$$ called the maximum mean discrepancy (MMD)
  • Has been used as the critic in generative adversial networks (i.e. generative network as usual, MMD as drop-in for discriminator)

Conditional mean embedding (1)

  • we can define operators on RKHSs
  • these are maps from RKHS elements to RKHS elements (i.e. mapping between functionals)
  • one such operator is the conditional mean embedding $\mathcal{D}_{Y|X}$
    • given the embedding of the input variables distribution, returns the embedding of the output variables distribution
  • Regression with uncertainty estimate

Conditional mean embedding (2)

An example


In [28]:
out_samps = data[distr_idx==0,:1] + 1
inp_samps = data[distr_idx==0,1:] + 1
def plot_mean_embedding(cme, inp_samps, out_samps, p1 = 0., p2 = 1., offset = 0.5):
    x = np.linspace(inp_samps.min()-offset,inp_samps.max()+offset,200)
    fig = pl.figure(figsize=(10, 5))
    ax = [pl.subplot2grid((2, 2), (0, 1)),
          pl.subplot2grid((2, 2), (0, 0), rowspan=2),
          pl.subplot2grid((2, 2), (1, 1))]

    ax[1].scatter(out_samps, inp_samps, alpha=0.3, color = 'r')
    ax[1].set_xlabel('Output')
    ax[1].set_ylabel('Input')
    ax[1].axhline(p1, 0, 8, color='g', linestyle='--')
    ax[1].axhline(p2, 0, 8, color='b', linestyle='--')
    ax[1].set_title("%d input-output pairs"%len(out_samps))
    ax[1].set_yticks((p1, p2))
    e = cme.lhood(np.array([[p1], [p2]]), x[:, None]).T
    
    #ax[0].plot(x, d[0], '-', label='cond. density')
    ax[2].plot(x, e[0], 'g--', label='cond. mean emb.')
    ax[2].set_title(r"p(outp | inp=%.1f)"%p1)
    ax[0].plot(x, e[1], 'b--', label='cond. mean emb.')
    ax[0].set_title(r"p(outp | inp=%.1f)"%p2)
    #ax[2].legend(loc='best')
    fig.tight_layout()

In [29]:
cme = ro.ConditionMeanEmbedding(inp_samps, out_samps, ro.GaussianKernel(0.3), ro.GaussianKernel(0.3), 5)
plot_mean_embedding(cme, inp_samps, out_samps,  0.3, 2.,)


Conditional mean embedding (3)

  • closed form estimate given samples from input and output $$\begin{bmatrix}\PDK_Y(y_1, \cdot),& \dots &, \PDK_Y(y_N, \cdot)\end{bmatrix} \Gram_X^{-1} \begin{bmatrix}\PDK_X(x_1, \cdot)\\ \vdots \\ \PDK_X(x_N, \cdot)\end{bmatrix}$$

  • closed form estimate of output embedding for new input $x^*$

$$\prodDot{\mathcal{D}_{Y|X}}{\PDK(x^*,\cdot)} = \begin{bmatrix}\PDK_Y(y_1, \cdot),& \dots &, \PDK_Y(y_N, \cdot)\end{bmatrix} \Gram_X^{-1} \begin{bmatrix}\PDK_X(x_1, x^*)\\ \vdots \\ \PDK_X(x_N, x^*)\end{bmatrix}$$

Conditional mean embedding (4)

  • Similar to Gaussian processes, but output distribution more flexible
    • mixture of Gaussians, Laplace, distributions on discrete objects
    • multimodality can be represented
    • multidimensional output
    • output can be combination of e.g. string and reals
  • Conditional mean embedding was used to construct Kernel Bayes Rule, enabling closed form Bayesian inference
  • other types of operators have been derived (see Stefan Klus' talk next week)

In [30]:
HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/MpzaCCbX-z4?rel=0&amp;showinfo=0&amp;start=148" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')


Out[30]:

In [31]:
display(Image(filename="Pendulum_eigenfunctions.png", width=700))



In [32]:
display(Image(filename="KeywordClustering.png", width=700))


Conclusion and Outlook

Conclusion and Outlook

  • kernels implicitly compute inner products in well-behaved vector spaces
  • use angles, norms, distances in these for ML algorithms
  • successful models in ML build on PSD kernels
    • kernel induced covariance structure (Gaussian Processes)
    • kernel embedding of probability distributions
      • measuring distance between distributions
    • kernel embedding of conditional probability distributions

Get slides at ingmarschuster.com