Chapter 4: Linear Classification

The goal of linear classification is to take an input vector, $\mathbf{x}$ and assign it to one of K discrete classes, $C_k$ where $k = 1, ..., K$. The decision surfaces are linear functions of the input vector, $\mathbf{x}$. Data sets whose classes can be separated exactly by linear decision surfaces are said to be 'linearly separable'.


In [5]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

Linearly separable... (hyperplane divides up the classes)


In [60]:
np.random.seed(0)
mu1 = [1, 1]
cov1 = 0.9 * np.eye(2)
mu2 = [5, 3]
cov2 = np.eye(2) * np.array([0.4, 0.1])
X = np.concatenate([np.random.multivariate_normal(mu1, cov1, 100),
                    np.random.multivariate_normal(mu2, cov2, 100)])
Y = np.zeros(200)
y[100:] = 1
fig = plt.figure(figsize=(5, 3.75))
ax = fig.add_subplot(111)
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.binary, zorder=2)


Out[60]:
<matplotlib.collections.PathCollection at 0x10bb76450>

Not linearly separable...


In [61]:
from IPython.display import YouTubeVideo
YouTubeVideo('9NrALgHFwTo')


Out[61]:

In general, there are three different approaches to the classification problem:

1. Construct a discriminant function: assign each x to a class.

OR Model the conditional probability distribution, $p(C_k|x)$ and use this distribution to make decisions by either

2. modelling the conditional probabilities directly (optimise parametric model parameters with training data) OR
3. Adopt generative approach: model the class-conditional densities, $p(x|C_k)$, together with the priors, $p(C_k)$ and compute posteriors.

For classification the general linear model can be written

$$y(x) = f(w^Tx + w_0),$$

where $f(.)$ is the activation function. Decision surfaces are linear functions of x (nonlinear in the parameters), even if the function $f(.)$ is nonlinear.

Note, likelihood is the probability of the label given the parameters, class-conditional density is the probability of the data given the class.

4.1 Discriminant Functions

4.1.1 Two classes

A discriminant is a function that takes an input vector ${\bf x}$ and assigns is to one of $K$ classes, denoted $C_k$. Consider first the case of 2 classes. The simplest representation of a linear discriminant function is obtained by taking a linear function of the input vector so that

$$y(\mathbf{x}) = \mathbf{w}^T\mathbf{x} + w_0,$$

where $\mathbf(w)$ is the weight vector and $w_0$ is the bias. In this simplest case, an input vector, $\mathbf{x}$ is assigned to class $C_1$ if $y(\mathbf{x}) \geq 0$ and to $C_2$ otherwise. So the decision boundary is defined by $y(\mathbf{x}) = 0$. The vector $\mathbf{w}$ defines the orientation of the decision surface and $w_0$ defines the the location:

$$\mathbf{w}^T(\mathbf{x_A}-\mathbf{x_b}) = 0,$$
$$\frac{\mathbf{w}^T\mathbf{x}}{||\mathbf{w}||} = - \frac{w_0}{||\mathbf{w}||}.$$

In [79]:
d.Image(filename="fig4.1.png")


Out[79]:

4.1.2 Multiple classes


In [81]:
d.Image(filename="fig4.2.png")


Out[81]:

This is why you can't construct multi-class classifiers out of 2-class classifiers!

Consider a K-class discriminant comprising K linear functions of the form

$$y_k(\mathbf{x}) = \mathbf{w_k}^T\mathbf{x} + w_{k0}$$

A point $\mathbf{x}$ will have a class $C_k$ if $y_k(\mathbf{x}) > y_j(\mathbf{x})$ for all $j \ne k$. The decision boundary between class $C_k$ and class $C_j$ is therefore given by $y_k(\mathbf{x}) = y_j(\mathbf{x})$ and hence corresponds to a $(D-1)$ dimensional hyperplane defined by

$$(\mathbf{w_k} - \mathbf{w_j})^T\mathbf{x} + (w_{k0} - w_{j0}) = 0 $$

4.1.3 Least squares for classification

Switch to vector notation:

$$\mathbf{y}(\mathbf{x}) = \tilde{\mathbf{W}}^T\tilde{\mathbf{x}}$$

A new input, $\mathbf{x}$ will be assigned to the class for which $y_k = \mathbf{\tilde{w}}_k^T\mathbf{\tilde{x}}$ is largest. $\mathbf{\tilde{W}}$ is determined by minimising a sum-of-squares error function. Consider a training data set $\{\mathbf{x}_n, \mathbf{t}_n\}$ where $n = 1,...,N$, and define a matrix $\mathbf{T}$ whose $n^{th}$ row is the vector $\mathbf{t}^T_n$ and a matrix $\tilde{\mathbf{X}}$ whose $n^{th}$ row is $\mathbf{x}^T_n$. The sum-of-squares error function can be written

$$E_D(W) = \frac{1}{2}Tr\left\{(\tilde{\mathbf{X}}\tilde{\mathbf{W}} - \mathbf{T})^T(\tilde{\mathbf{X}}\tilde{\mathbf{W}} - \mathbf{T})\right\}$$

Maximum likelihood solution for $\tilde{\mathbf{W}}$:

$$\tilde{\mathbf{W}} = (\tilde{\mathbf{X}}^T\tilde{\mathbf{X}})^{-1}\tilde{\mathbf{X}}^T\mathbf{T} = \tilde{\mathbf{X}}^{\dagger}T,$$

and so the discriminant function can be written

$$\mathbf{y}(\mathbf{x}) = \tilde{\mathbf{W}}^T\tilde{\mathbf{x}} = \mathbf{T}^T\left(\tilde{\mathbf{X}}^{\dagger}\right)^T\tilde{\mathbf{x}}$$

4.1.4 Fisher's Linear Discriminant

Want to find a way to reduce the dimensionality of your data, i.e. project it down to one dimension. Need to find a projection that will minimise overlap between classes. Maximise mean separation, minimise within-class variance, i.e. maximise

$$J(\mathbf{w}) = \frac{(m_1-m_2)^2}{s_1^2 + s_2^2}$$

or

$$J(\mathbf{w}) = \frac{\mathbf{w}^T\mathbf{S}_b\mathbf{w}}{\mathbf{w}^T\mathbf{S}_w\mathbf{w}} $$

Where $\mathbf{S}_b$ is the between class covariance matrix and $\mathbf{S}_w$ is the within class covariance matrix. The projection direction is given by 'Fisher's linear discriminant':

$$w \propto S^{-1}_W(m_2 - m_1)$$

In [59]:
import IPython.display as d
d.Image(filename="fig4.6.png")


Out[59]:

Easily generalisable to D dimensions (see book for details...)

4.1.7 The Perceptron algorithm

Another example of a linear discriminant model (and has historical importance, apparently). Used for two-class models where a set of feature vectors, $\phi(x)$ is used to construct a generalised linear model of the form

$$y(x) = f(w^T\phi(x))$$

where the nonlinear activation function, $f(.)$ is a step function. Find parameters, w, of the discriminant function by minimising the error function. Could choose to minimise the number of misclassified points, but this leads to discontinuities where a change in w causes the decision boundary to move across one of the data points and therefore you can't calc gradient.

The alternative: the perceptron criterion.

A correctly classified data point will contribute zero error, whereas for a misclassified point the algorithm will try to minimise the quantity, $-w^T\phi(x_n)t_n$. So the perceptron criterion is

$$E_p(w) = - \sum_{n\in\mathcal{M}} wT \phi_n t_n$$

where $\mathcal{M}$ is the set of misclassified data.

Now the contribution to the error associated with a particular misclassified point is a linear function of w in regions of w space where the pattern is misclassifiesd and zero in regions where is is correctly classified.

Use stochastic gradient descent to find the weights. This algorithm will find an exact solution in a finite number of steps!


In [72]:
d.Image(filename="fig4.7.png")


Out[72]:

4.2 Probabilistic Generative Models

Models with linear decision boundaries arise from simple assumptions about the data. In this section the class-conditional densities, $p(x|C_k)$ are used, with some priors to compute posteriors.

Consider 2 classes. The posterior for class 1 can be written

$$p(C_1|x) = \frac{p(x|C_1)(p(C1)}{p(x|C_1)p(C1) + p(x|C_2)p(C2)}$$
$$= \frac{1}{1+\exp(-a)} = \sigma(a)$$

where

$$a = ln\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)} *$$

and $\sigma(a)$ is the logistic sigmoid function:


In [73]:
d.Image(filename="fig4.9.png")


Out[73]:

Generalising to K > 2 classes:

$$p(C_k|x) = \frac{p(x|C_k)p(C_k)}{\sum_jp(x|C_j)p(C_j)}$$
$$= \frac{\exp(a_k)}{\sum_j \exp(a_j)},$$

where $a_k = \ln p(x|C_k)p(C_k).$ The above is a 'normalised exponential', a multiclass generalisation of the logistic sigmoid, also known as the 'softmax function'.

If $a_k > a_j$ for all $j \neq k$, then $p(C_k|x) \sim 1$ and $p(C_j|x) \sim 0$.

Now explore different forms fot the class conditional densities: continuous and discrete.

4.2.1 Continuous inputs

Assume class-conditional densities are Gaussian and share the same covariance matrix:

$$p(x|C_k) = \frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma|^{1/2}} \exp \left\{-\frac{1}{2}(x - \mu_k)^T\Sigma^{-1}(x - \mu_k)\right\}$$

Consider 2 classes and plug the above into *

$$p(C_1|x) = \sigma(w^Tx + w_0)$$

The quadratic terms in x in the exponents cancel because the covariances are the same, leading to a linear function of x.


In [74]:
d.Image(filename="fig4.10.png")


Out[74]:

Decision boundaries correspind to surfaces along which the posterior probabilities $p(C_k|x)$ are constant and so will be given by linear functions of x, and therefore the decision boundaries are linear in input space. The prior probabilities, $p(C_k)$, enter oly through the bias parameter $w_o$ so that changes in the priors have the effect of making parallel shifts of the decision boundary and more generally of the parallel contours of constant posterior probability.

This is trivially generalisable to more than 2 classes by using the normalised exponential instead of the logistic sigmoid.

If you relax the assumption of shared covariances you get quadratic functions of x, giving rise to a quadratic discriminant.


In [75]:
d.Image(filename="fig4.11.png")


Out[75]:

4.2.2 Maximum likelihood solution

Once you have a parametric functional form for the class-conditional densities, you can find the parameter values using max likelihood. The likelihood is given by

$$p(t|\pi, \mu_1,\mu_2,\Sigma) = \prod_{n=1}^{N}[/pi\mathcal{N}(x_n|\mu_1,\Sigma)]^{t_n}[(1-\pi)\mathcal{N}(x_n|\mu_2,\Sigma)]^{1-t_n}$$

4.2.3. Discrete features

Feature values treated as independent. Class-