Note that this assignment is more difficult than the previous ones, and thus has a higher weighting 3 and longer duration (3 weeks). Each one of the previous two assignments has a weighting 1.
Specifically, the first 3 assignments contribute to your continuous assessment as follows:
Assignment weights: $w_1 = 1, w_2 = 1, w_3 = 3$
Assignment grades: $g_1, g_2, g_3$
Weighted average: $\frac{1}{\sum_i w_i} \times \sum_i \left(w_i \times g_i \right)$
Future assignments will be added analogously.
Show that a Gaussian RBF kernel can be expressed as a dot product: $$ K(\mathbf{x}, \mathbf{y}) = e^\frac{-|\mathbf{x} - \mathbf{y}|^2}{2} = \phi(\mathbf{x})^T \phi(\mathbf{y}) $$ by spelling out the mapping function $\phi$.
For simplicity
even though the proof can be extended for vectors $\mathbf{x}$ $\mathbf{y}$ and general covariance matrices.
Hint: use Taylor series expansion of the exponential function
According to Taylor series, $$ K(\mathbf{x}, \mathbf{y}) = e^\frac{-|\mathbf{x} - \mathbf{y}|^2}{2} = e^\frac{-\left \| \mathbf{x} \right \|^2}{2}\cdot e^\frac{-\left \| \mathbf{y} \right \|^2}{2}\cdot e^{\mathbf{x}^{T} \mathbf{y}} = e^\frac{-\left \| \mathbf{x} \right \|^2}{2}\cdot e^\frac{-\left \| \mathbf{y} \right \|^2}{2} \cdot \sum_{n=0}^{\infty}\frac{(\mathbf{x}^T \mathbf{y})^n}{n!} $$
Let $$K'(\mathbf{x}, \mathbf{y}) = \sum_{n=0}^{\infty}\frac{(\mathbf{x}^T \mathbf{y})^n}{n!}=1+(x_1y_1+x_2y_2)+\frac{(x_1y_1+x_2y_2)^2}{2}+\cdots, $$ which is a polynomial kernel of degree $n$.
Thus, $K'(\mathbf{x}, \mathbf{y})$ corresponds to kernel function $\phi^*(x) = [1, x_1, x_2, \frac{{x_1}^{2}}{\sqrt{2}}, x_1x_2, \frac{{x_2}^{2}}{\sqrt{2}}, ...]$, and
$$ \phi(\mathbf{x}) = e^{-\frac{||\mathbf{x}||^2}{2}} \cdot \phi^*(\mathbf{x})= e^{-\frac{||\mathbf{x}||^2}{2}} \cdot [1, x_1, x_2, \frac{{x_1}^{2}}{\sqrt{2}}, x_1x_2, \frac{{x_2}^{2}}{\sqrt{2}}, ...] $$To be specified, the $(\frac{n(n+1)}{2}+i+1)^{th}$ item in $\phi^*(\mathbf{x})$ is,
$$ \phi_{\frac{n(n+1)}{2}+i+1}^*(\mathbf{x}) = \frac{x_1^i x_2^{n-i}}{\sqrt{i!(n-i)!}},\quad \forall n=0,1,2,\ldots \,\,\, \textrm{and} \quad i=0,1,2,\ldots,n $$How would the complexity (in terms of number of parameters) of a trained kernel SVM change with the amount of training data, and why? Note that the answer may depend on the specific kernel used as well as the amount of training data. Consider specifically the following types of kernels $K(\mathbf{x}, \mathbf{y})$.
Let $n$ is the number of parameters.
For linear kernel $K\left(\mathbf{x}, \mathbf{y}\right) = \mathbf{x}^T \mathbf{y}$, the mapping function is $\phi(x)=[x_1, x_2, x_3, ..., x_n]^T$. Thus, the complexity is $\mathbf{O}(n)$. The complexity will linearly increase as the increasement of amount of training data.
For polynomial kernel, $K(\mathbf{x},\mathbf{y})=\left(\mathbf{x}^{T}\mathbf{y}+1\right)^{q}=\left(1+x_1 y_1+x_2 y_2+\ldots+x_n y_n\right)^{q}=\phi(\mathbf{x})^{T}\phi(\mathbf{y})$, where $\phi(\mathbf{x})$ has $\binom{n+q}{n}$ elements. Thus, the complexity is $\mathbf{O\left( n^q \right)}$. The complexity will increase to the power of $q$ as the increasement of amount of training data.
For RBF kernel $K\left(\mathbf{x}, \mathbf{y} \right) = e^{-\frac{D\left(\mathbf{x}, \mathbf{y} \right)}{2s^2}}$, as shown in Q1, the $K\left(\mathbf{x}, \mathbf{y} \right)$ can be written in the form of $\phi(\mathbf{x})^T \cdot \phi(\mathbf{y})$, in which the $\phi(\mathbf{x})$ is of infinite dimension. Thus, the complexity is infinite and remain the same when the number of parameter increases.
Assume both the likelihood and prior have Gaussian distributions:
$$ \begin{align} p(\mathbf{X} | \Theta) &= \frac{1}{(2\pi)^{N/2}\sigma^N} \exp\left(-\frac{\sum_{t=1}^N (\mathbf{x}^{(t)} - \Theta)^2}{2\sigma^2}\right) \\ p(\Theta) &= \frac{1}{\sqrt{2\pi}\sigma_0} \exp\left( -\frac{(\Theta - \mu_0)^2}{2\sigma_0^2} \right) \end{align} $$Derive $\Theta$ from the dataset $\mathbf{X}$ via the following methods:
$ \therefore \Theta|\mathbf{X}\,\,\textrm{~}\,\, N\left(\mu_0',\sigma'\right) \space , and \quad \Theta_{Bayes} = E(\Theta|\mathbf{X}) = \mu_0' = \frac{\frac{N\bar{x}}{\sigma^2}+\frac{\mu_0}{\sigma_0^2}}{\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2}} $
In the textbook sample code we applied different scikit-learn classifers for the Iris data set.
In this exercise, we will apply the same set of classifiers over a different data set: hand-written digits. Please write down the code for different classifiers, choose their hyper-parameters, and compare their performance via the accuracy score as in the Iris dataset. Which classifier(s) perform(s) the best and worst, and why?
The classifiers include:
The dataset is available as part of scikit learn, as follows.
In [1]:
%load_ext watermark
%watermark -a '' -u -d -v -p numpy,pandas,matplotlib,scipy,sklearn
%matplotlib inline
In [2]:
# Added version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
In [3]:
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data # training data
y = digits.target # training label
print(X.shape)
print(y.shape)
In [4]:
import matplotlib.pyplot as plt
import pylab as pl
num_rows = 4
num_cols = 5
fig, ax = plt.subplots(nrows=num_rows, ncols=num_cols, sharex=True, sharey=True)
ax = ax.flatten()
for index in range(num_rows*num_cols):
img = digits.images[index]
label = digits.target[index]
ax[index].imshow(img, cmap='Greys', interpolation='nearest')
ax[index].set_title('digit ' + str(label))
ax[0].set_xticks([])
ax[0].set_yticks([])
plt.tight_layout()
plt.show()
In [5]:
#Your code comes here
import numpy as np
from sklearn.metrics import accuracy_score
if Version(sklearn_version) < '0.18':
from sklearn.cross_validation import train_test_split
else:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
num_training = y_train.shape[0]
num_test = y_test.shape[0]
print('training: ' + str(num_training) + ', test: ' + str(num_test))
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
In [6]:
#Your code, including traing and testing, to observe the accuracies.
from sklearn.linear_model import Perceptron
ppn = Perceptron(n_iter = 40, eta0 = 0.1, random_state = 0)
ppn.fit(X_train_std, y_train)
y_pred = ppn.predict(X_test_std)
print('[Perceptron] Accuracy: %.2f' % accuracy_score(y_test, y_pred))
In [7]:
#Your code, including traing and testing, to observe the accuracies.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C = 1000.0, random_state = 0)
lr.fit(X_train_std, y_train)
y_pred = lr.predict(X_test_std)
print('[Logistic Regression] Accuracy: %.2f' % accuracy_score(y_test, y_pred))
In [8]:
#Your code, including traing and testing, to observe the accuracies.
from sklearn.svm import SVC
#Linear SVM
svm0 = SVC(kernel='linear', C = 1.0, random_state = 0)
svm0.fit(X_train_std, y_train)
y_pred0 = svm0.predict(X_test_std)
#RBF SVM
svm1 = SVC(kernel='rbf', random_state = 0, gamma = 0.1, C = 1.0)
svm1.fit(X_train_std, y_train)
y_pred1 = svm1.predict(X_test_std)
print('[Linear SVM] Accuracy: %.2f' % accuracy_score(y_test, y_pred0))
print('[RBF SVM] Accuracy: %.2f' % accuracy_score(y_test, y_pred1))
In [9]:
#Your code, including traing and testing, to observe the accuracies.
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion='entropy', max_depth=10, random_state=0)
dt.fit(X_train_std, y_train)
y_pred = dt.predict(X_test_std)
print('[Decision Tree] Accuracy: %.2f' % accuracy_score(y_test, y_pred))
In [10]:
#Your code, including traing and testing, to observe the accuracies.
from sklearn.ensemble import RandomForestClassifier
rf =RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=1, n_jobs=2)
rf.fit(X_train_std, y_train)
y_pred = rf.predict(X_test_std)
print('[Random Forest] Accuracy: %.2f' % accuracy_score(y_test, y_pred))
In [11]:
#Your code, including traing and testing, to observe the accuracies.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, p = 2, metric = 'minkowski')
knn.fit(X_train_std, y_train)
y_pred = knn.predict(X_test_std)
print('[KNN] Accuracy: %.2f' % accuracy_score(y_test, y_pred))
In [12]:
#Your code, including traing and testing, to observe the accuracies.
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train_std, y_train)
y_pred = gnb.predict(X_test_std)
print('[Naive Bayes] Accuracy: %.2f' % accuracy_score(y_test, y_pred))
Best classifers: Linear SVM, Random Forest, KNN
Worst classifer: Naive Bayes (with accuracy score 0.77), because