Assignment 3 - basic classifiers

Math practice and coding application for main classifiers introduced in Chapter 3 of the Python machine learning book.

Weighting

Note that this assignment is more difficult than the previous ones, and thus has a higher weighting 3 and longer duration (3 weeks). Each one of the previous two assignments has a weighting 1.

Specifically, the first 3 assignments contribute to your continuous assessment as follows:

Assignment weights: $w_1 = 1, w_2 = 1, w_3 = 3$

Assignment grades: $g_1, g_2, g_3$

Weighted average: $\frac{1}{\sum_i w_i} \times \sum_i \left(w_i \times g_i \right)$

Future assignments will be added analogously.

RBF kernel (20 points)

Show that a Gaussian RBF kernel can be expressed as a dot product: $$ K(\mathbf{x}, \mathbf{y}) = e^\frac{-|\mathbf{x} - \mathbf{y}|^2}{2} = \phi(\mathbf{x})^T \phi(\mathbf{y}) $$ by spelling out the mapping function $\phi$.

For simplicity

  • you can assume both $\mathbf{x}$ and $\mathbf{y}$ are 2D vectors $ x = \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} , \; y = \begin{pmatrix} y_1 \\ y_2 \end{pmatrix} $
  • we use a scalar unit variance here

even though the proof can be extended for vectors $\mathbf{x}$ $\mathbf{y}$ and general covariance matrices.

Hint: use Taylor series expansion of the exponential function

Answer

$$ e^\frac{-|\mathbf{x} - \mathbf{y}|^2}{2} = e^{-\frac{|\mathbf{x}|^2}{2} - \frac{|\mathbf{y}|^2}{2} + \mathbf{x}^T \mathbf{y}} = e^{-\frac{|\mathbf{x}|^2}{2}} e^{-\frac{|\mathbf{y}|^2}{2}} e^{\mathbf{x}^T \mathbf{y}} $$

Recall (from calculus) that the Taylor series expansion of the exp function is: $$ e^z = \sum_{k=0}^\infty \frac{z^k}{k!} $$

Thus $$ \begin{align} e^\frac{-|\mathbf{x} - \mathbf{y}|^2}{2} &= \sum_{k=0}^\infty \frac{\left( \mathbf{x}^T \mathbf{y} \right)^k}{k!} e^{-\frac{|\mathbf{x}|^2}{2}} e^{-\frac{|\mathbf{y}|^2}{2}} \\ &= \sum_{k=0}^\infty \frac{\left( x_1 y_1 + x_2 y_2 \right)^k}{k!} e^{-\frac{|\mathbf{x}|^2}{2}} e^{-\frac{|\mathbf{y}|^2}{2}} \\ &= \sum_{k=0}^\infty \frac{\sum_{\ell = 0}^{k} C_\ell^k (x_1 y_1)^\ell (x_2 y_2)^{k-\ell}}{k!} e^{-\frac{|\mathbf{x}|^2}{2}} e^{-\frac{|\mathbf{y}|^2}{2}} \end{align} $$

From this we can see: $$

\phi(\mathbf{x})

e^{-\frac{|\mathbf{x}|^2}{2}} \biguplus{k=0}^\infty \biguplus{\ell = 0}^{k} \frac{\sqrt{C_\ell^k} x_1^\ell x_2^{k-\ell}}{\sqrt{k!}} $$ , where $\biguplus$ means ordered enumeration of vector components, e.g. $$

\biguplus_{k=0}^2 x_k

\begin{pmatrix} x_0 \\ x_1 \\ x_2 \end{pmatrix}$$ And $$

\biguplus{k=0}^2 \biguplus{\ell=0}^k

x_{k\ell}

\begin{pmatrix} x_{00} \\ x_{10} \\ x_{11} \\ x_{20} \\ x_{21} \\ x_{22} \end{pmatrix}

$$

Kernel SVM complexity (10 points)

How would the complexity (in terms of number of parameters) of a trained kernel SVM change with the amount of training data, and why? Note that the answer may depend on the specific kernel used as well as the amount of training data. Consider specifically the following types of kernels $K(\mathbf{x}, \mathbf{y})$.

  • linear: $$ K\left(\mathbf{x}, \mathbf{y}\right) = \mathbf{x}^T \mathbf{y} $$
  • polynomial with degree $q$: $$ K\left(\mathbf{x}, \mathbf{y}\right) = (\mathbf{x}^T\mathbf{y} + 1)^q $$
  • RBF with distance function $D$: $$ K\left(\mathbf{x}, \mathbf{y} \right) = e^{-\frac{D\left(\mathbf{x}, \mathbf{y} \right)}{2s^2}} $$

Answer

Parametric versus non-parametric

As a quick recap of the lecture slides, machine learning models lay on a spectrum of being parametric versus non-parametric. Two extreme examples are below. How about SVM with different kernels above?

Extreme end of being parametric

Linear classifiers such as perceptron has a fixed number of parameters $\mathbf{w}$ regardless of the size of the training data.

Extreme end of being non-parametric

KNN has no parameters and rely entirely on the training data. One way to look at this is to treat the entire training data as the parameters for KNN.

RBF kernel projects to an infinite dimensional output space, so the parameter complexity depends entirely on the amount of training data, i.e. the Gram matrix (see IML Chapter 13) with dimension $N \times N$ where $N$ is the number of training samples: $$ K_{ts} = K(\mathbf{x}^{(t)}, \mathbf{x}^{(s)}) $$

Both linear and polynomial kernels project to a finite dimensional output/mapped space. In particular, the linear kernel remains in the orignal space. Thus, the number of parameters is decided by the dimensionality of the output space.

So, Kernel SVM is more non-parametric for RBF kernel but more parametric for linear/polynomial kernels.

Gaussian density Bayes (30 points)

$$ p\left(\Theta | \mathbf{X}\right) = \frac{p\left(\mathbf{X} | \Theta\right) p\left(\Theta\right)}{p\left(\mathbf{X}\right)} $$

Assume both the likelihood and prior have Gaussian distributions:

$$ \begin{align} p(\mathbf{X} | \Theta) &= \frac{1}{(2\pi)^{N/2}\sigma^N} \exp\left(-\frac{\sum_{t=1}^N (\mathbf{x}^{(t)} - \Theta)^2}{2\sigma^2}\right) \\ p(\Theta) &= \frac{1}{\sqrt{2\pi}\sigma_0} \exp\left( -\frac{(\Theta - \mu_0)^2}{2\sigma_0^2} \right) \end{align} $$

Derive $\Theta$ from the dataset $\mathbf{X}$ via the following methods:

ML (maximum likelihood) estimation

$$ \Theta_{ML} = argmax_{\Theta} p(\mathbf{X} | \Theta) $$

MAP estimation

$$ \begin{align} \Theta_{MAP} &= argmax_{\Theta} p(\Theta | \mathbf{X}) \\ &= argmax_{\Theta} p(\mathbf{X} | \Theta) p(\Theta) \end{align} $$

Bayes estimation

$$ \begin{align} \Theta_{Bayes} &= E(\Theta | \mathbf{X}) \\ &= \int \Theta p(\Theta | \mathbf{X}) d\Theta \end{align} $$

Answer

As a common trick, to maximize $\exp(f)\exp(g)$, we can maximize $\log\left(\exp(f)\exp(g)\right) = \log\left(\exp(f+g)\right) = f+g$.

ML

$$ \begin{align} \Theta_{ML} &= argmax_{\Theta} p(\mathbf{X} | \Theta) \\ &= argmax_{\Theta} \log(p(\mathbf{X} | \Theta)) \\ &= argmax_{\Theta} \left( -\log\left((2\pi)^{N/2} \sigma^N\right) - \frac{1}{2\sigma^2} \sum_{t=1}^N \left( \mathbf{x}^{(t)} - \Theta \right)^2 \right) \\ &= argmin_{\Theta} \sum_{t=1}^N \left( \mathbf{x}^{(t)} - \Theta \right)^2 \end{align} $$

Take derivate $$ 0 = \frac{\partial \sum_{t=1}^N \left( \mathbf{x}^{(t)} - \Theta \right)^2 }{\partial \Theta} = 2 \sum_{t=1}^N \left( \Theta - \mathbf{x}^{(t)} \right) $$

We know $$ \Theta_{ML} =\frac{\sum_{t=1}^N \mathbf{x}^{(t)}}{N} $$

MAP

$$ \begin{align} \Theta_{MAP} &= argmax_{\Theta} p(\Theta | \mathbf{X}) \\ &= argmax_{\Theta} p(\mathbf{X} | \Theta) p(\Theta) \\ &= argmax_{\Theta} \left( \log\left(p(\mathbf{X} | \Theta)\right) + \log\left(p(\Theta)\right) \right) \\ &= argmax_{\Theta} \left( -\log\left((2\pi)^{N/2}\sigma^N \right) -\frac{\sum_{t=1}^N \left(\mathbf{x}^{(t)} - \Theta\right)^2}{2\sigma^2} -\log\left(\sqrt{2\pi}\sigma_0\right) -\frac{(\Theta - \mu_0)^2}{2\sigma_0^2} \right) \\ &= argmin_{\Theta} \left( \frac{1}{2\sigma^2} \sum_{t=1}^N \left( \mathbf{x}^{(t)} - \Theta \right)^2 + \frac{(\Theta - \mu_0)^2}{2\sigma_0^2} \right) \end{align} $$

Take derivate: $$ 0 = \frac{\partial \left( \frac{1}{2\sigma^2} \sum_{t=1}^N \left( \mathbf{x}^{(t)} - \Theta \right)^2 + \frac{(\Theta - \mu_0)^2}{2\sigma_0^2} \right)}

{\partial \Theta}

\frac{1}{\sigma^2} \sum_{t=1}^N \left( \Theta - \mathbf{x}^{(t)} \right) + \frac{1}{\sigma_0^2} \left( \Theta - \mu_0 \right) $$

We have $$ \Theta_{MAP} = \frac{N/\sigma^2}{N/\sigma^2 + 1/\sigma_0^2} \mathbf{m} + \frac{1/\sigma_0^2}{N/\sigma^2 + 1/\sigma_0^2} \mu_0 $$ , where $$ \mathbf{m} = \frac{1}{N} \sum_{t=1}^N \mathbf{x}^{(t)} $$

Alternatively

Note that the product of two Gaussians is still a Gaussian: $$ \begin{align} p(\Theta | \mathbf{X}) & = \frac{p(\mathbf{X} | \Theta) p(\Theta)}{p(\mathbf{X})} \\ &= \frac{1}{(2\pi)^{N/2}\sigma^N \sqrt{2\pi}\sigma_0 p(\mathbf{X})} \exp\left(-\frac{\sum_{t=1}^N (\mathbf{x}^{(t)} - \Theta)^2}{2\sigma^2}\right) \exp\left( -\frac{(\Theta - \mu_0)^2}{2\sigma_0^2} \right) \\ &= \frac{1}{(2\pi)^{N/2}\sigma^N \sqrt{2\pi}\sigma_0 p(\mathbf{X})} \exp\left(-\frac{\sum_{t=1}^N (\mathbf{x}^{(t)} - \Theta)^2}{2\sigma^2} -\frac{(\Theta - \mu_0)^2}{2\sigma_0^2} \right) \\ &= \frac{1}{(2\pi)^{N/2}\sigma^N \sqrt{2\pi}\sigma_0 p(\mathbf{X})} \exp\left(-\frac{\sum_{t=1}^N \sigma_0^2 (\Theta - \mathbf{x}^{(t)})^2 + \sigma^2 (\Theta - \mu_0)^2}{2(\sigma\sigma_0)^2} \right) \\ &= \frac{1}{(2\pi)^{N/2}\sigma^N \sqrt{2\pi}\sigma_0 p(\mathbf{X})} \exp \left( -\frac{N\sigma_0^2 + \sigma^2}{2(\sigma\sigma_0)^2} \left( \left(\Theta - \Theta_{MAP}\right)^2 + \left(-\Theta_{MAP}^2 + \sigma_0^2 \sum_{t=1}^N \left(\mathbf{x}^{(t)}\right)^2 + \sigma^2 -mu_0^2\right) \right) \right) \\ &= \frac{1}{\bigtriangleup} \exp \left( -\frac{N\sigma_0^2 + \sigma^2}{2(\sigma\sigma_0)^2} \left(\Theta - \Theta_{MAP}\right)^2 \right) \\ \frac{1}{\bigtriangleup} &= \frac{1}{(2\pi)^{N/2}\sigma^N \sqrt{2\pi}\sigma_0 p(\mathbf{X})} \exp \left( -\frac{N\sigma_0^2 + \sigma^2}{2(\sigma\sigma_0)^2} \left(-\Theta_{MAP}^2 + \sigma_0^2 \sum_{t=1}^N \left(\mathbf{x}^{(t)}\right)^2 + \sigma^2 -mu_0^2\right) \right) \end{align} $$

$\bigtriangleup$ is a constant (with $\Theta$), so the mean $\Theta_{MAP}$ will maximize $p(\Theta | \mathbf{X})$.

Bayes

From the single Gaussian interpretation of $p(\Theta | \mathbf{X})$ above we know: $$ \begin{align} \Theta_{Bayes} &= E(\Theta | \mathbf{X}) \\ &= \int \Theta p(\Theta | \mathbf{X}) d\Theta \\ &= \int \Theta \frac{1}{\bigtriangleup} \exp \left( -\frac{N\sigma_0^2 + \sigma^2}{2(\sigma\sigma_0)^2} \left(\Theta - \Theta_{MAP}\right)^2 \right) d\Theta \\ &= \int \left( \Theta + \Theta_{MAP} \right) \frac{1}{\bigtriangleup} \exp \left( -\frac{N\sigma_0^2 + \sigma^2}{2(\sigma\sigma_0)^2} \Theta^2 \right) d\Theta \\ &= \int \Theta \frac{1}{\bigtriangleup} \exp \left( -\frac{N\sigma_0^2 + \sigma^2}{2(\sigma\sigma_0)^2} \Theta^2 \right) d\Theta + \Theta_{MAP} \int \frac{1}{\bigtriangleup} \exp \left( -\frac{N\sigma_0^2 + \sigma^2}{2(\sigma\sigma_0)^2} \Theta^2 \right) d\Theta \end{align} $$

The first term above is an odd symmetric function, so the integral is zero: $$ 0 = \int \Theta \frac{1}{\bigtriangleup} \exp \left( -\frac{N\sigma_0^2 + \sigma^2}{2(\sigma\sigma_0)^2} \Theta^2 \right) d\Theta $$

And the integral part of the second term above is $1$ since the original density function is normalized: $$ 1 = \int \frac{1}{\bigtriangleup} \exp \left( -\frac{N\sigma_0^2 + \sigma^2}{2(\sigma\sigma_0)^2} \Theta^2 \right) d\Theta $$

Thus $\Theta_{Bayes} = \Theta_{MAP}$

Hand-written digit classification (40 points)

In the textbook sample code we applied different scikit-learn classifers for the Iris data set.

In this exercise, we will apply the same set of classifiers over a different data set: hand-written digits. Please write down the code for different classifiers, choose their hyper-parameters, and compare their performance via the accuracy score as in the Iris dataset. Which classifier(s) perform(s) the best and worst, and why?

The classifiers include:

  • perceptron
  • logistic regression
  • SVM
  • decision tree
  • random forest
  • KNN
  • naive Bayes

The dataset is available as part of scikit learn, as follows.


In [1]:
%load_ext watermark
%watermark -a '' -u -d -v -p numpy,pandas,matplotlib,scipy,sklearn
%matplotlib inline


last updated: 2016-10-05 

CPython 3.5.2
IPython 4.2.0

numpy 1.11.1
pandas 0.18.1
matplotlib 1.5.1
scipy 0.17.1
sklearn 0.18

In [2]:
# Added version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

Load data


In [3]:
from sklearn.datasets import load_digits
digits = load_digits()

X = digits.data # training data
y = digits.target # training label

print(X.shape)
print(y.shape)


(1797, 64)
(1797,)

Visualize data


In [4]:
import matplotlib.pyplot as plt
import pylab as pl

num_rows = 4
num_cols = 5

fig, ax = plt.subplots(nrows=num_rows, ncols=num_cols, sharex=True, sharey=True)
ax = ax.flatten()
for index in range(num_rows*num_cols):
    img = digits.images[index]
    label = digits.target[index]
    ax[index].imshow(img, cmap='Greys', interpolation='nearest')
    ax[index].set_title('digit ' + str(label))

ax[0].set_xticks([])
ax[0].set_yticks([])
plt.tight_layout()
plt.show()


Answer

Date Preprocessing

Hint: How you divide training and test data set? And apply other techinques we have learned if needed. You could take a look at the Iris data set case in the textbook.


In [5]:
#Your code comes here

Data sets: training versus test


In [6]:
if Version(sklearn_version) < '0.18':
    from sklearn.cross_validation import train_test_split
else:
    from sklearn.model_selection import train_test_split
    
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

num_training = y_train.shape[0]
num_test = y_test.shape[0]
print('training: ' + str(num_training) + ', test: ' + str(num_test))


training: 1257, test: 540

Data scaling


In [7]:
from sklearn.preprocessing import StandardScaler

scaling_option = 'standard';

if scaling_option == 'standard':
    sc = StandardScaler()
    sc.fit(X_train)
    X_train_std = sc.transform(X_train)
    X_test_std = sc.transform(X_test)
else:
    X_train_std = X_train
    X_test_std = X_test

Accuracy score


In [8]:
from sklearn.metrics import accuracy_score

Classifier #1 Perceptron

Perceptron


In [10]:
from sklearn.linear_model import Perceptron

# training
ppn = Perceptron(n_iter=40, eta0=0.1, random_state=0)
_ = ppn.fit(X_train_std, y_train)

In [11]:
# prediction
y_pred = ppn.predict(X_test_std)
# print('Misclassified samples: %d out of %d' % ((y_test != y_pred).sum(), y_test.shape[0]))
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))


Accuracy: 0.93

Classifier #2 Logistic Regression

Logistic regression


In [13]:
# training
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=1000.0, random_state=0)
_ = lr.fit(X_train_std, y_train)

In [14]:
# prediction
y_pred = lr.predict(X_test_std)
print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))


Accuracy: 0.94

Classifier #3 SVM

SVM


In [16]:
from sklearn.svm import SVC

kernels = ['linear', 'rbf']

for kernel in kernels:
    svm = SVC(kernel=kernel, C=1.0, random_state=0)
    
    # training
    svm.fit(X_train_std, y_train)
    
    # testing
    y_pred = svm.predict(X_test_std)
    print('Accuracy for ' + kernel + ' kernel: %.2f' % accuracy_score(y_test, y_pred))


Accuracy for linear kernel: 0.97
Accuracy for rbf kernel: 0.99

Classifier #4 Decision Tree

Decision tree


In [18]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion='entropy', random_state=1)

# training
tree.fit(X_train, y_train)

# testing
y_pred = tree.predict(X_test)
print('Accuracy: %2f' % accuracy_score(y_test, y_pred))


Accuracy: 0.853704

Classifer #5 Random Forest

Random forest


In [20]:
from sklearn.ensemble import RandomForestClassifier

num_estimators = [1, 5, 10, 20, 40]

for n_estimators in num_estimators:
    forest = RandomForestClassifier(criterion='entropy',
                                    n_estimators=n_estimators, 
                                    random_state=1)
    
    # training
    forest.fit(X_train, y_train)
    
    # testing
    y_pred = forest.predict(X_test)
    print('Accuracy for %d estimators: %2f' % (n_estimators, accuracy_score(y_test, y_pred)))


Accuracy for 1 estimators: 0.755556
Accuracy for 5 estimators: 0.903704
Accuracy for 10 estimators: 0.940741
Accuracy for 20 estimators: 0.964815
Accuracy for 40 estimators: 0.979630

Classifier #6 KNN

KNN


In [22]:
from sklearn.neighbors import KNeighborsClassifier

num_neighbors = [1, 5, 10, 20, 100]

for n_neighbors in num_neighbors:
    knn = KNeighborsClassifier(n_neighbors=n_neighbors, p=2, metric='minkowski')
    
    # training
    knn.fit(X_train_std, y_train)
    
    # testing
    y_pred = knn.predict(X_test_std)
    print('Accuracy for %d neighbors: %2f' % (n_neighbors, accuracy_score(y_test, y_pred)))


Accuracy for 1 neighbors: 0.974074
Accuracy for 5 neighbors: 0.970370
Accuracy for 10 neighbors: 0.962963
Accuracy for 20 neighbors: 0.944444
Accuracy for 100 neighbors: 0.888889

Classifier #7 Naive Bayes

Naive Bayes


In [24]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

_ = gnb.fit(X_train_std, y_train)

In [25]:
y_pred = gnb.predict(X_test_std)

print('Accuracy: %.2f' % accuracy_score(y_test, y_pred))


Accuracy: 0.77

Comparing classifiers

Naive Bayes has the worst performance as its feature independence assumption does not hold for digit-image pixels.

A decision tree has the second worst performance due to similar reasons: bisecting one feature dimenion at a time might not be proper since the features are correlated.

Other classifiers can have decent performance (at least 90% with properly chosen parameters). Perceptron and logistic regression have slightly lower accuracy likely due to their simplicity (linear classifiers). Random forest performs better than a single decision tree, as expected. KNN works well if k is not too large to avoid over smoothing.

Bonus point: kernel SVM with RBF kernel doesn't work well with un-normalized data. Why?

Note that we haven't touched upon over-fitting and training/test data split yet, which will be covered in the next chapter.