$\newcommand{\Reals}{\mathbb{R}} \newcommand{\Nats}{\mathbb{N}} \newcommand{\PDK}{\mathbf{k}} \newcommand{\IS}{\mathcal{X}} \newcommand{\FM}{\Phi} \newcommand{\Gram}{K} \newcommand{\RKHS}{\mathcal{H}} \newcommand{\prodDot}[2]{\left\langle#1,#2\right\rangle} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max}$

Kernel Methods in Machine Learning

Ingmar Schuster (FU Berlin)

Kernel methods

Support Vector Machines (SVMs) are a Machine learning staple for classification
Gaussian Processes (GPs) very popular for regression
recent techniques for representation learning/unsupervised learning
- Deep Gaussian Processes
- Compositional Kernel Machines
very elegant mathematics



In [1]:

    
### FIRST SOME CODE ####

from __future__ import division, print_function, absolute_import
from IPython.display import SVG, display, Image

import numpy as np, scipy as sp, pylab as pl, matplotlib.pyplot as plt, scipy.stats as stats, sklearn, sklearn.datasets
from scipy.spatial.distance import squareform, pdist, cdist

import distributions as dist #commit 480cf98 of https://github.com/ingmarschuster/distributions

pl.style.use(u'seaborn-talk')









    



/home/ingmar/Documents/university/Promotion_NLP/software/python/distributions/distributions/linalg.py:35: UserWarning: warning: caught this exception:'module' object has no attribute '_dotblas'
  warnings.warn("warning: caught this exception:" + str(e))

Gaussian Processes

model for functions/continuous output
for new input returns predicted output and uncertainty



In [3]:

    
display(Image(filename="GP_uq.png", width=630)) #source: http://scikit-learn.org/0.17/modules/gaussian_process.html

Support Vector Machines

model for classification
map data nonlinearly to higher dimensionsal space
separate points of different classes using a plane (i.e. linearly)



In [118]:

    
display(Image(filename="SVM.png", width=700)) #source: https://en.wikipedia.org/wiki/Support_vector_machine

Feature engineering & two simple classification algorithms

Motivation for kernel methods: Feature engineering in Machine Learning

feature engineering: map data to features with function $\FM:\IS\to \RKHS$
- handle nonlinear relations with linear methods ($\FM$ nonlinear)
- handle non-numerical data (e.g. text)



In [119]:

    
display(Image(filename="monomials.jpg", width=900)) #source: Berhard Schölkopf

Working in Feature Space

want Feature Space $\RKHS$ (the codomain of $\FM$) to be vector space to get nice mathematical structure
definition of inner products induces norms and possibility to measure angles
can use linear algebra in $\RKHS$ to solve ML problems
- inner products
- angles
- norms
- distances
induces nonlinear operations on the Input Space (domain of $\FM$)

Two simple classification algorithms

given data points from mixture of two distributions with densities $p_0,p_1$: $$x_i \sim 0.5 p_0 + 0.5 p_1$$ and label $l_i = 0$ if $x_i$ generated by $p_0$, $l_i = 1$ otherwise



In [120]:

    
np.random.seed(5)

samps_per_distr = 20
data = np.vstack([stats.multivariate_normal(np.array([-2,0]), np.eye(2)*1.5).rvs(samps_per_distr),
                  stats.multivariate_normal(np.array([2,0]), np.eye(2)*1.5).rvs(samps_per_distr)])
distr_idx = np.r_[[0]*samps_per_distr, [1]*samps_per_distr]



In [121]:

    
for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
    pl.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker)
pl.show(); pl.close()

Classification using inner products in Feature Space

compute mean feature space embedding $\mu_c = \frac{1}{N_c} \sum_{l_i = c} \FM(x_i)$ for $c \in \{0,1\}$
assign test point to most similar class which in terms of inner product between point and mean embedding $\prodDot{\FM(x)}{\mu_c}$ $$f_d(x) = \argmax_{c\in\{0,1\}} \prodDot{\FM(x)}{\mu_c}$$ (remember in $\Reals^2$ canonically: $\prodDot{a}{b} = a_1 b_1+a_2 b_2 $)



In [122]:

    
for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
    pl.scatter(*data[distr_idx==idx,:].T, c=c, marker=marker)
    pl.arrow(0, 0, *data[distr_idx==idx,:].mean(0), head_width=0.2, head_length=0.2, fc=c, ec=c)
pl.title(r"Mean embeddings for $\Phi(x)=x$")
pl.show()

Classification using density estimation

estimate density for each class by centering a gaussian, taking mixture as estimate $$\widehat{p}_c = \frac{1}{N_c} \sum_{l_i = c} \mathcal{N}(\cdot; x_i,\Sigma)$$ for $c \in \{0,1\}$



In [123]:

    
# Some plotting code

def apply_to_mg(func, *mg):
    #apply a function to points on a meshgrid
    x = np.vstack([e.flat for e in mg]).T
    return np.array([func(i.reshape((1,2))) for i in x]).reshape(mg[0].shape)

def plot_with_contour(samps, data_idx, cont_func, method_name = None, delta = 0.025, pl = pl, colormesh_cmap = pl.cm.Pastel2, contour_classif = True):
    x = np.arange(samps.T[0].min()-delta, samps.T[1].max()+delta, delta)
    y = np.arange(samps.T[1].min()-delta, samps.T[1].max()+delta, delta)
    X, Y = np.meshgrid(x, y)
    Z = apply_to_mg(cont_func, X,Y)
    Z = Z.reshape(X.shape)

    fig = pl.figure()
    if colormesh_cmap is not None:
        pl.pcolor(X, Y, Z , cmap=colormesh_cmap, alpha=0.5, edgecolors=None)
    if contour_classif is True:
        pl.contour(X, Y, Z, colors=['k', 'k', 'k'],
                  linestyles=['--', '-', '--'],
                  levels=[-.5, 0, .5])
    else:
        pl.contour(X, Y, Z)
    if method_name is not None:
        pl.title('Decision for '+method_name)
    for (idx, c, marker) in [(0,'r', (0,3,0)), (1, "b", "x")]:
        pl.scatter(*data[distr_idx==idx,:].T, c=c, alpha=0.7, marker=marker)

    pl.show()



In [124]:

    
plot_with_contour(data, distr_idx,
                  lambda x: stats.multivariate_normal.pdf(x, data[0,:], np.eye(2)*0.1),
                  colormesh_cmap=None, contour_classif=False)



In [125]:

    
est_dens_1 = dist.mixt(2, [dist.mvnorm(x, np.eye(2)*0.1) for x in data[:4]], [1./4]*4)
plot_with_contour(data, distr_idx,
                  lambda x: exp(est_dens_1.logpdf(x)),
                  colormesh_cmap=None, contour_classif=False)



In [126]:

    
est_dens_1 = dist.mixt(2, [dist.mvnorm(x, np.eye(2)*0.1,10) for x in data[:samps_per_distr]], [1./samps_per_distr]*samps_per_distr)
plot_with_contour(data, distr_idx,
                  lambda x: exp(est_dens_1.logpdf(x)),
                  colormesh_cmap=None, contour_classif=False)

Classification using density estimation

estimate density for each class by centering a gaussian, taking mixture as estimate $$\widehat{p}_c = \frac{1}{N_c} \sum_{l_i = c} \mathcal{N}(\cdot; x_i,\Sigma)$$ for $c \in \{0,1\}$
assign test point $x$ to class $c$ that gives highest value for $\widehat{p}_c(x)$
$\widehat{p}_c$ is known as a kernel density estimate (KDE)
algorithm sometimes known as Parzen windows classification

For a certain feature map and inner product, both algorithms are the same!

Kernels and feature space

Positive definite functions and feature spaces

let $\PDK:\IS\times\IS \to \Reals$, called a kernel
- if $\PDK$ is symmetric and positive semi definite (psd) (i.e. matrix $\Gram$ defined by $\Gram_{i,j} = \PDK(x_i, x_j)$ for $x_i, x_j \in \IS$ has only nonnegative eigenvalues)
- then there exists $\FM: \IS \to \RKHS$ to a hilbert space $\RKHS$ such that $$\PDK(x_i, x_j) = \prodDot{\FM(x_i)}{\FM(x_j)}_\RKHS$$ i.e. $\PDK$ computes inner product after mapping to some $\RKHS$

Examples of psd kernels

Linear $\PDK_L(x,x') = \prodDot{x}{x'} = x_1 x'_1 + x_2 x'_2+ \dots$
Gaussian $\PDK_G(x,x') = \exp({-{ 0.5}(x-x' )^{\top }\Sigma ^{-1}(x-x' )})$

PSD kernels

easy to construct $\PDK$ given $\FM$: $\PDK(x_i, x_j) = \prodDot{\FM(x_i)}{\FM(x_j)}$
construction for $\FM$ given $\PDK$ not trivial but still elementary
can endow space with norm induced by the inner product $$\|g\|_\PDK = \sqrt{\prodDot{g}{g}} = \left(\cos(\angle[g,g])~\|g\|_\PDK ~\|g\|_\PDK\right)^{0.5}$$ for $g \in \RKHS$

Construction of the canonical feature map (Aronszajn map)

Plan

construction of $\FM$ from $\PDK$
definition of inner product in new space $\RKHS$ such that in fact $\PDK(x,x') = \prodDot{\FM(x)}{\FM(x)}$
feature for each $x \in \IS$ will be a function from $\IS$ to $\Reals$ $$\FM:\IS \to \Reals^\IS$$

Canonical feature map (Aronszajn map)

pick $\FM(x) = \PDK(\cdot, x)$
- Linear kernel: $\FM_L(x) = \prodDot{\cdot}{x}$
- Gaussian kernel: $\FM_G(x) = \exp\left(-0.5{\|\cdot -x \|^2}/{\sigma^2}\right)$.
let linear combinations of features also be in $\RKHS$ $$f(\cdot)=\sum_{i=1}^m a_i \PDK(\cdot, x_i) \in \RKHS$$ for $a_i \in \Reals$
$\RKHS$ a vector space over $\Reals$ : if $f(\cdot)$ and $g(\cdot)$ functions from $\IS$ to $\Reals$, so are $a~f(\cdot)$ for $a \in \Reals, f(\cdot)+g(\cdot)$

Canonical inner product

for $f(\cdot)=\sum_{i=1}^m a_i \PDK(\cdot, x_i) \in \RKHS$ and $g(\cdot)=\sum_{j=1}^{m'} b_j \PDK(\cdot, x'_j) \in \RKHS$ define inner product $$\prodDot{f}{g} = \sum_{i=1}^m \sum_{j=1}^{m'} b_j a_i \PDK(x'_j, x_i)$$
In particular $\prodDot{f}{g} = \prodDot{ \PDK(\cdot,x)}{ \PDK(\cdot,x')}=\PDK(x,x')$ (reproducing property of kernel in its $\RKHS$)
$\RKHS$ a hilbert space with this inner product, as it is
- positive definite
- linear in its first argument
- symmetric
- complete
$\RKHS$ called Reproducing Kernel Hilbert Space (RKHS).

Equivalence of classification algorithms

taking the inner product between mean RKHS embedding and test point embedding equivalent to evaluating Kernel density estimate

Lets look at example classification output



In [127]:

    
class Kernel(object):
    def mean_emb(self, samps):
        return lambda Y: self.k(samps, Y).sum()/len(samps)
    
    def mean_emb_len(self, samps):
        return self.k(samps, samps).sum()/len(samps**2)
    
    def k(self, X, Y):
        raise NotImplementedError()

class FeatMapKernel(Kernel):
    def __init__(self, feat_map):
        self.features = feat_map
        
    def features_mean(self, samps):
        return self.features(samps).mean(0)
    
    def mean_emb_len(self, samps):
        featue_space_mean = self.features_mean(samps)
        return featue_space_mean.dot(featue_space_mean)
    
    def mean_emb(self, samps):
        featue_space_mean = self.features(samps).mean(0)
        return lambda Y: self.features(Y).dot(featue_space_mean)
    
    def k(self, X, Y):
        gram = self.features(X).dot(self.features(Y).T)
        return gram

class LinearKernel(FeatMapKernel):
    def __init__(self):
        FeatMapKernel.__init__(self, lambda x: x)

class GaussianKernel(Kernel):
    def __init__(self, sigma):
        self.width = sigma
    
    def k(self, X, Y=None):
        assert(len(np.shape(X))==2)
        
        # if X=Y, use more efficient pdist call which exploits symmetry
        if Y is None:
            sq_dists = squareform(pdist(X, 'sqeuclidean'))
        else:
            assert(len(np.shape(Y))==2)
            assert(np.shape(X)[1]==np.shape(Y)[1])
            sq_dists = cdist(X, Y, 'sqeuclidean')
    
        K = exp(-0.5 * (sq_dists) / self.width ** 2)
        return K

class StudentKernel(Kernel):
    def __init__(self, s2, df):
        self.dens = dist.mvt(0,s2,df)
        
    def k(self, X,Y=None):
        if Y is None:
            sq_dists = squareform(pdist(X, 'sqeuclidean'))
        else:
            assert(len(np.shape(Y))==2)
            assert(np.shape(X)[1]==np.shape(Y)[1])
            sq_dists = cdist(X, Y, 'sqeuclidean')
        dists = np.sqrt(sq_dists)
        return exp(self.dens.logpdf(dists.flatten())).reshape(dists.shape)


def kernel_mean_inner_prod_classification(samps1, samps2, kernel):
    mean1 = kernel.mean_emb(samps1)
    norm_mean1 = kernel.mean_emb_len(samps1)
    mean2 = kernel.mean_emb(samps2)
    norm_mean2 = kernel.mean_emb_len(samps2)
    
    def sim(test):
        return (mean1(test) - mean2(test))
    
    def decision(test):
        if sim(test) >= 0:
            return 1
        else:
            return 0
    
    return sim, decision



In [128]:

    
data, distr_idx = sklearn.datasets.make_circles(n_samples=400, factor=.3, noise=.05)

(sim, dec) = kernel_mean_inner_prod_classification(data[distr_idx==0,:], data[distr_idx==1,:],  LinearKernel())
plot_with_contour(data, distr_idx, sim, 'Inner Product classif. '+"Linear", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)



In [129]:

    
(sim, dec) = kernel_mean_inner_prod_classification(data[distr_idx==0,:], data[distr_idx==1,:],  GaussianKernel(0.1))
plot_with_contour(data, distr_idx, sim, 'Inner Product classif. '+"Gauss", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)



In [130]:

    
(sim, dec) = kernel_mean_inner_prod_classification(data[distr_idx==0,:], data[distr_idx==1,:],  StudentKernel(0.1,10))
plot_with_contour(data, distr_idx, sim, 'Inner Product classif. '+"Student-t", pl = plt, contour_classif = True, colormesh_cmap = pl.cm.bwr)

Conclusion & Outlook

kernels implicitly compute inner products in well-behaved vector spaces
use angles, norms, distances in these for ML algorithms
successful models in ML build on PSD kernels (Support Vector Machines, Gaussian Processes)
many more kernel-based methods exist
- kernel embedding of probability distributions
- kernel induced covariance structure (GPs)

Get slides at ingmarschuster.wordpress.com