Kernel Methods in Machine Learning

Some Definitions

  1. ###Inner product space
    • ###Inner product
  2. ###Hilbert space
    • ###Complete
    • ###Separable
  3. ###Reproducing kernel Hilbert space
    • ###Reproducing property

Mercer's Theorem (Mercer 1909)

Let $ \mathcal{X} $ be a compact space equipped with a strictly positive finite Borel measure $ \mu $ and $ K: \mathcal{X} \times \mathcal{X} \to \mathbb{R} $ a continuous, symmetric, and positive definite function. Define the integral operator $ T_K: L^2(\mathcal{X}) \rightarrow L^2(\mathcal{X}) $ as

$$ [T_K f](\cdot) =\int_\mathcal{X} K(\cdot,t) f(t)\, d\mu(t) $$

where $ L^2(\mathcal{X}) $ is the space of square integrable functions with respect to $ \mu $.

Then $ K $ can be written in terms of the eigenvalues and continuous eigenfunctions of $T_k$ as

$$ K(x,y) = \sum_{j=1}^\infty \lambda_j \, \phi_j(x) \, \phi_j(y) $$

Moore-Aronszajn Theorem (Aronzajn 1950)

Suppose $K$ is a symmetric, positive definite kernel on a set $\mathcal{X}$. Then there is a unique Hilbert space of functions on $\mathcal{X}$ for which $K$ is a reproducing kernel.

Representer Theorem (Kimeldorf and Wahba 1971)

Let $\mathcal{X}$ be a nonempty set and $k$ a positive-definite real-valued kernel on $\mathcal{X} \times \mathcal{X}$ with corresponding reproducing kernel Hilbert space $H_k$. Given a training sample $(x_1, y_1), \dotsc, (x_n, y_n) \in \mathcal{X} \times \mathbb{R}$, a strictly monotonically increasing real-valued function $g \colon [0, \infty) \to \mathbb{R}$, and an arbitrary empirical risk function $E \colon (\mathcal{X} \times \mathbb{R}^2)^n \to \mathbb{R} \cup \lbrace \infty \rbrace$, then for any $f^{*} \in H_k$ satisfying

$$ f^{*} = \operatorname{arg min}_{f \in H_k} \left\lbrace E\left( (x_1, y_1, f(x_1)), ..., (x_n, y_n, f(x_n)) \right) + g\left( \lVert f \rVert \right) \right \rbrace, $$

$f^{*}$ admits a representation of the form:

$$ f^{*}(\cdot) = \sum_{i = 1}^n \alpha_i k(\cdot, x_i), $$

where $\alpha_i \in \mathbb{R}$ for all $1 \le i \le n$.

Advantages of RKHS in Machine Learning

  • ##Powerful and flexible models can be defined.

  • ##Many results and algorithms for linear models in Euclidean spaces can be generalized to RKHS.

  • ##Learning theory assures that effective learning in RKHS is possible, for instance, by means of regularization.

Example - Support Vector Machine

Example - Ridge Regression

Objective : Minimize $\|A\mathbf{x}-\mathbf{b}\|^2_2+ \lambda\| \mathbf{x}\|^2_2$

Explicit solution : $\hat{x} = (A^{T}A+ \lambda I )^{-1}A^{T}\mathbf{b}$

Kernelization $\implies$ $\hat{x} = (K+ \lambda I )^{-1}A^{T}\mathbf{b}$

Example - Principal Component Analysis

Objective : Compute eigendecomposition of $X^T X = W \Sigma W^T$, where $\Sigma$ is diagonal matrix which keeps eigenvalues.

Kernelization $\implies$ Replacing $X^T X$ by $K$.