[65 points] Consider the dataset available here: project_data.txt.
It consists of two-dimensional patterns, $\pmb{x} = [x_1, x_2]^t$ , pertaining to
3 classes $(\omega_1,\omega_2,\omega_3)$. The feature values are indicated in the first two
columns while the class labels are specified in the last column. The priors
of all 3 classes are the same. Randomly partition this dataset into a training set
(70% of each class) and a test set (30% of each class).
The data set consists of 1500 rows and 3 columns, where
Excerpt from the dataset:
1.8569 -3.4702 1
-0.2096 -2.8342 1
-1.0265 2.1614 1
[...]
9.3851 4.0336 2
10.1375 1.1495 2
11.7569 0.8005 2
[...]
3.9854 5.1360 3
2.7592 5.9536 3
4.1379 4.3258 3
[...]
In [2]:
import numpy as np
np.random.seed(1234568)
In [3]:
all_projdata = np.genfromtxt('./data/project_data.txt', delimiter=' ')
# Test if the data is read in in the correct dimensions:
assert(all_projdata.shape == (1500, 3))
for c_label in range(1,4): # class labels 1-3 are in column 3
assert(all_projdata[all_projdata[:,2] == c_label].shape == (500,3))
# Print basic statistics:
for column in range(2):
print(50 * '-')
print("range of x_{}: ({}, {})".format(column+1,
min(all_projdata[:,column]), max(all_projdata[:,0])))
print("mean of x_{}: {}".format(column+1,
np.mean(all_projdata[:,column])))
print("standard_deviation of x_{}: {}".format(
column+1, np.std(all_projdata[:,column])))
In [4]:
# Accomodation, since numpy arrays cannot be empty
test_set = np.zeros(shape=(1,3))
train_set = np.zeros(shape=(1,3))
for i in range(0,3):
cl_shuffle = np.copy(all_projdata[all_projdata[:,2] == i+1])
np.random.shuffle(cl_shuffle)
test_set = np.append(test_set,
cl_shuffle[0:150,:], axis=0)
train_set = np.append(train_set,
cl_shuffle[150:,:], axis=0)
# delete the first placeholder
# row used for initializing the array
test_set = np.delete(test_set, 0, axis=0)
train_set = np.delete(train_set, 0, axis=0)
assert(test_set.shape == (1500*0.3,3))
assert(train_set.shape == (1500*0.7,3))
In [8]:
#%pylab inline
import numpy as np
from matplotlib import pyplot as plt
# Plotting training and test datasets
for dset,title in zip((test_set, train_set), ['Test', 'Training']):
f, ax = plt.subplots(figsize=(7, 7))
ax.scatter(dset[dset[:,2] == 1][:,0],
dset[dset[:,2] == 1][:,1],
marker='o', color='green', s=40, alpha=0.5, label='$\omega_1$')
ax.scatter(dset[dset[:,2] == 2][:,0],
dset[dset[:,2] == 2][:,1],
marker='^', color='red', s=40, alpha=0.5, label='$\omega_2$')
ax.scatter(dset[dset[:,2] == 3][:,0],
dset[dset[:,2] == 3][:,1],
marker='s', color='blue', s=40, alpha=0.5, label='$\omega_3$')
plt.legend(loc='upper right')
plt.title('{} Dataset'.format(title), size=20)
plt.ylabel('$x_2$', size=20)
plt.xlabel('$x_1$', size=20)
plt.show()
Let
$p([x_1, x_2]^t |\omega_1) ∼ N([0,0]^t,4I), \\ p([x_1, x_2]^t |\omega_2) ∼ N([10,0]^t,4I), \\ p([x_1, x_2]^t |\omega_3) ∼ N([5,5]^t,5I),$
where $I$ is the $2 \times 2$ identity matrix.
What is the error rate on the test set when the
Bayesian decision rule is employed for classification?
The objective function is to maximize the discriminant function,
which we define as the posterior probability here to perform
a minimum-error classification (Bayes classifier).
$ g_1(\pmb x) = P(\omega_1 | \; \pmb{x}), \quad g_2(\pmb{x}) = P(\omega_2 | \; \pmb{x}), \quad g_3(\pmb{x}) = P(\omega_2 | \; \pmb{x})$
$ \Rightarrow g_1(\pmb{x}) = P(\pmb{x}\;|\;\omega_1) \;\cdot\; P(\omega_1) \quad | \; ln \\ \quad g_2(\pmb x) = P(\pmb{x}\;|\;\omega_2) \;\cdot\; P(\omega_2) \quad | \; ln \\ \quad g_3(\pmb x) = P(\pmb{x}\;|\;\omega_3) \;\cdot\; P(\omega_3) \quad | \; ln$
We can drop the prior probabilities (since we have equal priors in this case):
$ \Rightarrow g_1(\pmb{x}) = ln(P(\pmb{x}\;|\;\omega_1)) \\ \quad g_2(\pmb{x}) = ln(P(\pmb{x}\;|\;\omega_2)) \\ \quad g_3(\pmb{x}) = ln(P(\pmb{x}\;|\;\omega_3))$
The equations for the general multivariate normal case
with arbitrary covariance matrices
(i.e., different covariance matrices for each case):
$ \Rightarrow g_1(\pmb{x}) = \pmb{x}^{\,t} \bigg( - \frac{1}{2} \Sigma_1^{-1} \bigg) \pmb{x} + \bigg( \Sigma_1^{-1} \pmb{\mu}_{\,1}\bigg)^t \pmb x + \bigg( -\frac{1}{2} \pmb{\mu}_{\,1}^{\,t} \Sigma_{1}^{-1} \pmb{\mu}_{\,1} -\frac{1}{2} ln(|\Sigma_1|)\bigg)$
$g_2(\pmb{x}) = \pmb{x}^{\,t} \bigg( - \frac{1}{2} \Sigma_2^{-1} \bigg) \pmb{x} + \bigg( \Sigma_2^{-1} \pmb{\mu}_{\,2}\bigg)^t \pmb x + \bigg( -\frac{1}{2} \pmb{\mu}_{\,2}^{\,t} \Sigma_{2}^{-1} \pmb{\mu}_{\,2} -\frac{1}{2} ln(|\Sigma_2|)\bigg)$
$g_3(\pmb{x}) = \pmb{x}^{\,t} \bigg( - \frac{1}{2} \Sigma_3^{-1} \bigg) \pmb{x} + \bigg( \Sigma_3^{-1} \pmb{\mu}_{\,3}\bigg)^t \pmb x + \bigg( -\frac{1}{2} \pmb{\mu}_{\,3}^{\,t} \Sigma_{3}^{-1} \pmb{\mu}_{\,3} -\frac{1}{2} ln(|\Sigma_3|)\bigg)$
$\pmb{W}_{i} = - \frac{1}{2} \Sigma_i^{-1}\\ \pmb{w}_i = \Sigma_i^{-1} \pmb{\mu}_{\,i}\\ \omega_{i0} = \bigg( -\frac{1}{2} \pmb{\mu}_{\,i}^{\,t}\; \Sigma_{i}^{-1} \pmb{\mu}_{\,i} -\frac{1}{2} ln(|\Sigma_i|)\bigg)$
$ \pmb{W}_{1} = \bigg[ \begin{array}{cc} -(1/8) & 0\\ 0 & -(1/8) \\ \end{array} \bigg] $
$ \pmb{w}_{1} = \bigg[ \begin{array}{cc} (1/4) & 0\\ 0 & (1/4) \\ \end{array} \bigg] \cdot \bigg[ \begin{array}{c} 0 \\ 0 \\ \end{array} \bigg] = \bigg[ \begin{array}{c} 0 \\ 0 \\ \end{array} \bigg]$
$ \omega_{10} = -\frac{1}{2} [0 \quad 0 ] \bigg[ \begin{array}{cc} (1/4) & 0\\ 0 & (1/4) \\ \end{array} \bigg] \cdot \bigg[ \begin{array}{c} 0 \\ 0 \\ \end{array} \bigg]
with $ \quad g_1(\pmb{x}) = \pmb{x}^{\,t} \bigg( - \frac{1}{2} \Sigma_1^{-1} \bigg) \pmb{x} + \bigg( \Sigma_1^{-1} \pmb{\mu}_{\,1}\bigg)^t \pmb x + \bigg( -\frac{1}{2} \pmb{\mu}_{\,1}^{\,t} \Sigma_{1}^{-1} \pmb{\mu}_{\,1} -\frac{1}{2} ln(|\Sigma_1|)\bigg) $
$ \Rightarrow g_1(\pmb{x}) = \pmb{x}^{t} \bigg[ \begin{array}{cc} -(1/8) & 0\\ 0 & -(1/8) \\ \end{array} \bigg] \pmb{x} + [0 \quad 0 ] \; \pmb x - ln(4) = \pmb{x}^{t} (-\frac{1}{8} \; \pmb{x}) - 2ln(2) $
$ \Rightarrow g_1(\pmb{x}) = - \frac{1}{8} (x^{2}_1 + x^{2}_2 ) - 2ln(2)$
$ \pmb{W}_{2} = \bigg[ \begin{array}{cc} -(1/8) & 0\\ 0 & -(1/8) \\ \end{array} \bigg] $
$ \pmb{w}_{2} = \bigg[ \begin{array}{cc} (1/4) & 0\\ 0 & (1/4) \\ \end{array} \bigg] \cdot \bigg[ \begin{array}{c} 10 \\ 0 \\ \end{array} \bigg] = \bigg[ \begin{array}{c} 2.5 \\ 0 \\ \end{array} \bigg]$
$ \omega_{20} = -\frac{1}{2} [10 \quad 0 ] \bigg[ \begin{array}{cc} (1/4) & 0\\ 0 & (1/4) \\ \end{array} \bigg] \cdot \bigg[ \begin{array}{c} 10 \\ 0 \\ \end{array} \bigg]
with $ \quad g_2(\pmb{x}) = \pmb{x}^{\,t} \bigg( - \frac{1}{2} \Sigma_2^{-1} \bigg) \pmb{x} + \bigg( \Sigma_2^{-1} \pmb{\mu}_{\,2}\bigg)^t \pmb x + \bigg( -\frac{1}{2} \pmb{\mu}_{\,2}^{\,t} \Sigma_{2}^{-1} \pmb{\mu}_{\,2} -\frac{1}{2} ln(|\Sigma_2|)\bigg) $
$ \Rightarrow g_2(\pmb{x}) = \pmb{x}^{t} \bigg[ \begin{array}{cc} -(1/8) & 0\\ 0 & -(1/8) \\ \end{array} \bigg] \pmb{x} + [2.5 \quad 0 ]\;\pmb x - 12.5 - ln(4) = \pmb{x}^{t} \cdot( -\frac{1}{8} \; \pmb{x}) + 2.5 x_1 - 12.5 - 2ln(2) $
$ \Rightarrow g_2(\pmb{x}) = - \frac{1}{8} (x^{2}_1 + x^{2}_2) +2.5 x_1 - 12.5 - 2ln(2)$
$ \pmb{W}_{3} = \bigg[ \begin{array}{cc} -(1/10) & 0\\ 0 & -(1/10) \\ \end{array} \bigg] $
$ \pmb{w}_{3} = \bigg[ \begin{array}{cc} (1/5) & 0\\ 0 & (1/5) \\ \end{array} \bigg] \cdot \bigg[ \begin{array}{c} 5 \\ 5 \\ \end{array} \bigg] = \bigg[ \begin{array}{c} 1 \\ 1 \\ \end{array} \bigg]$
$ \omega_{30} = -\frac{1}{2} [5 \quad 5 ] \bigg[ \begin{array}{cc} (1/5) & 0\\ 0 & (1/5) \\ \end{array} \bigg] \cdot \bigg[ \begin{array}{c} 5 \\ 5 \\ \end{array} \bigg]
with $ \quad g_3(\pmb{x}) = \pmb{x}^{\,t} \bigg( - \frac{1}{2} \Sigma_3^{-1} \bigg) \pmb{x} + \bigg( \Sigma_3^{-1} \pmb{\mu}_{\,3}\bigg)^t \pmb x + \bigg( -\frac{1}{2} \pmb{\mu}_{\,3}^{\,t} \Sigma_{3}^{-1} \pmb{\mu}_{\,3} -\frac{1}{2} ln(|\Sigma_3|)\bigg)$
$ \Rightarrow g_3(\pmb{x}) = \pmb{x}^{t} \bigg[ \begin{array}{cc} -(1/10) & 0\\ 0 & -(1/10) \\ \end{array} \bigg] \pmb{x} + [1 \quad 1 ]\; \pmb x + 5 - ln(5) = \pmb{x}^{t} \cdot ( - \frac{1}{10})\; \pmb x )+ \; {x_1} + {x_2} + 5 - ln(5) $
$ \Rightarrow g_3(\pmb{x}) = - \frac{1}{10} (x^{2}_1 + x^{2}_2)+ \; {x_1} + {x_2} + 5 - ln(5)$
In [10]:
def discriminant_function(x_vec, cov_mat, mu_vec):
"""
Calculates the value of the discriminant function for a dx1 dimensional
sample given covariance matrix and mean vector.
Keyword arguments:
x_vec: A dx1 dimensional numpy array representing the sample.
cov_mat: numpy array of the covariance matrix.
mu_vec: dx1 dimensional numpy array of the sample mean.
Returns a float value as result of the discriminant function.
"""
W_i = (-1/2) * np.linalg.inv(cov_mat)
assert(W_i.shape[0] > 1 and W_i.shape[1] > 1),\
'W_i must be a matrix'
w_i = np.linalg.inv(cov_mat).dot(mu_vec)
assert(w_i.shape[0] > 1 and w_i.shape[1] == 1),\
'w_i must be a column vector'
omega_i_p1 = (((-1/2) * (mu_vec).T).dot(np.linalg.inv(cov_mat))).dot(mu_vec)
omega_i_p2 = (-1/2) * np.log(np.linalg.det(cov_mat))
omega_i = omega_i_p1 - omega_i_p2
assert(omega_i.shape == (1, 1)), 'omega_i must be a scalar'
g = ((x_vec.T).dot(W_i)).dot(x_vec) + (w_i.T).dot(x_vec) + omega_i
return float(g)
In [11]:
import operator
def classify_data(x_vec, g, mu_vecs, cov_mats):
"""
Classifies an input sample into 1 out of 3 classes determined by
maximizing the discriminant function g_i().
Keyword arguments:
x_vec: A dx1 dimensional numpy array representing the sample.
g: The discriminant function.
mu_vecs: A list of mean vectors as input for g.
cov_mats: A list of covariance matrices as input for g.
Returns a tuple (g_i()_value, class label).
"""
assert(len(mu_vecs) == len(cov_mats)),\
'Number of mu_vecs and cov_mats must be equal.'
g_vals = []
for m,c in zip(mu_vecs, cov_mats):
g_vals.append(g(x_vec, mu_vec=m, cov_mat=c))
max_index, max_value = max(enumerate(g_vals),
key=operator.itemgetter(1))
return (max_value, max_index + 1)
In [18]:
# Known parameters
cov_mat_1 = 4 * np.eye(2,2)
mu_vec_1 = np.array([[0],[0]])
cov_mat_2 = 4 * np.eye(2,2)
mu_vec_2 = np.array([[10],[0]])
cov_mat_3 = 5 * np.eye(2,2)
mu_vec_3 = np.array([[5],[5]])
In [19]:
def empirical_error(data_set, classes, classifier_func, classifier_func_args):
"""
Keyword arguments:
data_set: 'n x d'- dimensional numpy array,
class label in the last column.
classes: List of the class labels.
classifier_func: Function that returns the
max argument from the discriminant function.
evaluation and the class label as a tuple.
classifier_func_args: List of arguments for the 'classifier_func'.
Returns a tuple, consisting of a dictionary
with the classif. counts and the error.
e.g., ( {1: {1: 321, 2: 5}, 2: {1: 0, 2: 317}}, 0.05)
where keys are class labels, and values are
sub-dicts counting for which class (key)
how many samples where classified as such.
"""
class_dict = {i:{j:0 for j in classes} for i in classes}
for cl in classes:
for row in data_set[data_set[:,-1] == cl][:,:-1]:
g = classifier_func(row, *classifier_func_args)
class_dict[cl][g[1]] += 1
correct = 0
for i in classes:
correct += class_dict[i][i]
misclass = data_set.shape[0] - correct
return (class_dict, misclass / data_set.shape[0])
In [231]:
import prettytable
classification_dict, error = empirical_error(train_set, [1,2,3],
classify_data, [discriminant_function,
[mu_vec_1, mu_vec_2, mu_vec_3],
[cov_mat_1, cov_mat_2, cov_mat_3]])
labels_predicted = ['w{} (predicted)'.format(i) for i in [1,2,3]]
labels_predicted.insert(0,'training dataset')
train_conf_mat = prettytable.PrettyTable(labels_predicted)
for i in [1,2,3]:
a, b, c = [classification_dict[i][j] for j in [1,2,3]]
# workaround to unpack (since Python does not support just '*a')
train_conf_mat.add_row(['w{} (actual)'.format(i), a, b, c])
print(train_conf_mat)
print('Empirical Error: {:.2f} ({:.2f}%)'.format(error, error * 100))
In [20]:
import prettytable
classification_dict, error = empirical_error(test_set, [1,2,3],
classify_data, [discriminant_function,
[mu_vec_1, mu_vec_2, mu_vec_3],
[cov_mat_1, cov_mat_2, cov_mat_3]])
labels_predicted = ['w{} (predicted)'.format(i) for i in [1,2,3]]
labels_predicted.insert(0,'test dataset')
train_conf_mat = prettytable.PrettyTable(labels_predicted)
for i in [1,2,3]:
a, b, c = [classification_dict[i][j] for j in [1,2,3]]
# workaround to unpack (since Python does not support just '*a')
train_conf_mat.add_row(['w{} (actual)'.format(i), a, b, c])
print(train_conf_mat)
print('Empirical Error: {:.2f} ({:.2f}%)'.format(error, error * 100))
Suppose $p([x_1,x_2]^t\;|\;\omega_i) \;∼ \; N(\pmb \mu_i,\pmb \Sigma_i), \; i = 1,2,3$, where the $\pmb \mu_i$’s and $\pmb \Sigma_i$’s are unknown.
Use the training set to compute the MLE of the μi’s and the $\pmb \Sigma_i$’s.
What is the error rate on the test set when the Bayes decision rule using
the estimated parameters is employed for classification?
In contrast to task one, now we only know the number of parameters for the class conditional densities
$p (\; \pmb x \; | \; \omega_i)$ and we want to use a Maximum Likelihood Estimatation (MLE)
to estimate the quantities of these parameters from the training data.
Given the information about the form of the model - the data is normal distributed -
the 2 parameters to be estimated are $\pmb \mu_i$ and $\pmb \Sigma_i$,
which are summarized by the parameter vector
$\pmb \theta_i = \bigg[ \begin{array}{c}
\ \theta_{i1} \\
\ \theta_{i2} \\
\end{array} \bigg]=
\bigg[ \begin{array}{c}
\pmb \mu_i \\
\pmb \Sigma_i \\
\end{array} \bigg]$
For the Maximum Likelihood Estimate (MLE), we assume that we have a set of samles
$D = \left\{ \pmb x_1, \pmb x_2,..., \pmb x_n \right\} $ that are i.i.d.
(independent and identically distributed, drawn with probability
$p(\pmb x \; | \; \omega_i, \; \pmb \theta_i) $).
Thus, we can work with each class separately and omit the class labels,
so that we write the probability density as $p(\pmb x \; | \; \pmb \theta)$
Thus, the probability of observing $D = \left\{ \pmb x_1, \pmb x_2,..., \pmb x_n \right\} $ is:
$p(D\; | \; \pmb \theta\;) = p(\pmb x_1 \; | \; \pmb \theta\;)\; \cdot \; p(\pmb x_2 \; | \;\pmb \theta\;) \; \cdot \;... \; p(\pmb x_n \; | \; \pmb \theta\;) = \prod_{k=1}^{n} \; p(\pmb x_k \pmb \; | \; \pmb \theta \;)$
Where $p(D\; | \; \pmb \theta\;)$ is also called the likelihood of $\pmb\ \theta$.
We are given the information that $p([x_1,x_2]^t) \;∼ \; N(\pmb \mu,\pmb \Sigma) $
(remember that we dropped the class labels,
since we are working with every class separately).
And the mutlivariate normal density is given as:
$\quad \quad p(\pmb x) = \frac{1}{(2\pi)^{d/2} \; |\Sigma|^{1/2}} exp \bigg[ -\frac{1}{2}(\pmb x - \pmb \mu)^t \Sigma^{-1}(\pmb x - \pmb \mu) \bigg]$
So that
$p(D\; | \; \pmb \theta\;) = \prod_{k=1}^{n} \; p(\pmb x_k \pmb \; | \; \pmb \theta \;) = \prod_{k=1}^{n} \; \frac{1}{(2\pi)^{d/2} \; |\Sigma|^{1/2}} exp \bigg[ -\frac{1}{2}(\pmb x - \pmb \mu)^t \Sigma^{-1}(\pmb x - \pmb \mu) \bigg]$
Since it is easier to work with the logarithm, we define the log-likelihood function $l(\pmb \theta)$ as
$l(\pmb \theta) \equiv ln\; p(D\; | \; \pmb \theta\;) = \sum\limits_{k=1}^{n} \;ln\; p(\pmb x_k \pmb \; | \; \pmb \theta \;)$
and the log of the multivariate density
$ l(\pmb\theta) = \sum\limits_{k=1}^{n} - \frac{1}{2}(\pmb x - \pmb \mu)^t \pmb \Sigma^{-1} \; (\pmb x - \pmb \mu) - \frac{d}{2} \; ln \; 2\pi - \frac{1}{2} \;ln \; |\pmb\Sigma|$
In order to obtain the MLE $\boldsymbol{\hat{\theta}}$, we maximize $l (\pmb \theta)$, which can be done via differentiation:
with $\nabla_{\pmb \theta} \equiv \begin{bmatrix} \frac{\partial \; }{\partial \; \theta_1} \\ \frac{\partial \; }{\partial \; \theta_2} \end{bmatrix} = \begin{bmatrix} \frac{\partial \; }{\partial \; \pmb \mu} \\ \frac{\partial \; }{\partial \; \pmb \sigma} \end{bmatrix}$
$\nabla_{\pmb \theta} l = \sum\limits_{k=1}^n \nabla_{\pmb \theta} \;ln\; p(\pmb x| \pmb \theta) = 0 $
The MLE of the parameter $\pmb\mu$ is given by the equation:
${\hat{\pmb\mu}} = \frac{1}{n} \sum\limits_{k=1}^{n} \pmb x_k$
In [21]:
import prettytable
mu_est_1 = np.sum(train_set[train_set[:,2] == 1],
axis=0)/len(train_set[train_set[:,2] == 1])
mu_est_1 = mu_est_1[0:2].reshape(2,1)
mu_est_2 = np.sum(train_set[train_set[:,2] == 2],
axis=0)/len(train_set[train_set[:,2] == 2])
mu_est_2 = mu_est_2[0:2].reshape(2,1)
mu_est_3 = np.sum(train_set[train_set[:,2] == 3],
axis=0)/len(train_set[train_set[:,2] == 3])
mu_est_3 = mu_est_3[0:2].reshape(2,1)
print(mu_est_1)
mu_mle = prettytable.PrettyTable(["", "mu_1", "mu_2", "mu_3"])
mu_mle.add_row(["MLE",mu_est_1, mu_est_2, mu_est_3])
mu_mle.add_row(["actual",mu_vec_1, mu_vec_2, mu_vec_3])
print(mu_mle)
${\hat{\pmb\Sigma}} = \frac{1}{n} \sum\limits_{k=1}^{n} (\pmb x_k - \hat{\mu})(\pmb x_k - \hat{\mu})^t$
In [22]:
def mle_est_cov(x_samples, mu_est):
"""
Calculates the Maximum Likelihood Estimate for the covariance matrix.
Keyword Arguments:
x_samples: np.array of the samples for 1 class, n x d dimensional
mu_est: np.array of the mean MLE, d x 1 dimensional
Returns the MLE for the covariance matrix as d x d numpy array.
"""
cov_est = np.zeros((2,2))
for x_vec in x_samples:
x_vec = x_vec.reshape(2,1)
assert(x_vec.shape == mu_est.shape),\
'mean and x vector hmust be of equal shape'
cov_est += (x_vec - mu_est).dot((x_vec - mu_est).T)
return cov_est / len(x_samples)
In [23]:
import prettytable
class1_train_samples1 = train_set[train_set[:,2] == 1][:,0:2]
cov_est_1 = mle_est_cov(class1_train_samples1, mu_est_1)
class1_train_samples2 = train_set[train_set[:,2] == 2][:,0:2]
cov_est_2 = mle_est_cov(class1_train_samples2, mu_est_2)
class1_train_samples3 = train_set[train_set[:,2] == 3][:,0:2]
cov_est_3 = mle_est_cov(class1_train_samples3, mu_est_3)
cov_mle = prettytable.PrettyTable(["", "covariance_matrix_1",\
"covariance_matrix_2", "covariance_matrix_3"])
cov_mle.add_row(["MLE", cov_est_1, cov_est_2, cov_est_3])
cov_mle.add_row(['','','',''])
cov_mle.add_row(["actual", cov_mat_1, cov_mat_2, cov_mat_3])
print(cov_mle)
In [24]:
import prettytable
classification_dict, error = empirical_error(test_set,
[1,2,3], classify_data, [discriminant_function,
[mu_est_1, mu_est_2, mu_est_3],
[cov_est_1, cov_est_2, cov_est_3]])
labels_predicted = ['w{} (predicted)'.format(i) for i in [1,2,3]]
labels_predicted.insert(0,'test dataset')
train_conf_mat = prettytable.PrettyTable(labels_predicted)
for i in [1,2,3]:
a, b, c = [classification_dict[i][j] for j in [1,2,3]]
# workaround to unpack (since Python does not support just '*a')
train_conf_mat.add_row(['w{} (actual)'.format(i), a, b, c])
print(train_conf_mat)
print('Empirical Error: {:.2f} ({:.2f}%)'.format(error, error * 100))
Suppose the form of the distributions of $p([x_1, x_2]^t\;|\;\omega_i),\; i = 1,2,3$ is unknown.
Assume that the training dataset can be used to estimate the density at a
point using the Parzen window technique (a spherical Gaussian kernel with h = 1).
What is the error rate on the test set when the Bayes decision rule is employed for classification?
The density for a point x is defined as
$p(\pmb x) = \frac{k / n}{V}$
where k is the number of points inside a specified region, n is the
number of total points, and V is the volume of that region.
Starting with a simple example to count the number $k_n$, the
equation using a hypercube would be
$k_n = \sum\limits_{i=1}^{n} \phi \bigg( \frac{\pmb x - \pmb x_i}{h_n} \bigg)$
where $\phi(\pmb u)$ is the window function, and here: $\pmb u = \bigg( \frac{\pmb x - \pmb x_i}{h_n} \bigg)$
Based on this window, we can now formulate the Parzen-window
estimation with a hypercube kernel as follows:
$P_n(\pmb x) = \frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{h^d} \phi \bigg[ \frac{\pmb x - \pmb x_i}{h_n} \bigg]$
where
$h^d = V_n\quad$ and $\quad\phi \bigg[ \frac{\pmb x - \pmb x_i}{h_n} \bigg] = k$
For the Gaussian kernel (instead of the hypercube), we can just simply swap the terms of the window function by:
$\frac{1}{(\sqrt {2 \pi})^d h_{n}^{d}} exp \; \bigg[ -\frac{1}{2} \bigg(\frac{\pmb x - \pmb x_i}{h_n} \bigg)^2 \bigg]$
So that we have the Parzen window estimation becomes:
$P_n(\pmb x) = \frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{h^d} \phi \Bigg[ \frac{1}{(\sqrt {2 \pi})^d h_{n}^{d}} exp \; \bigg[ -\frac{1}{2} \bigg(\frac{\pmb x - \pmb x_i}{h_n} \bigg)^2 \bigg] \Bigg]$
In [25]:
from scipy.stats import kde
class1_kde = kde.gaussian_kde(
train_set[train_set[:,2] == 1].T[0:2], bw_method=1.0)
class2_kde = kde.gaussian_kde(
train_set[train_set[:,2] == 2].T[0:2], bw_method=1.0)
class3_kde = kde.gaussian_kde(
train_set[train_set[:,2] == 3].T[0:2], bw_method=1.0)
In [26]:
%pylab inline
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.mlab import bivariate_normal
from mpl_toolkits.mplot3d import Axes3D
X = np.linspace(-10, 20, 100)
Y = np.linspace(-10, 20, 100)
X,Y = np.meshgrid(X,Y)
for bwmethod,title in zip([class1_kde, class2_kde, class3_kde],
['class1', 'class2', 'class3']):
fig = plt.figure(figsize=(10, 7))
ax = fig.gca(projection='3d')
Z = bwmethod(np.array([X.ravel(),Y.ravel()]))
Z = Z.reshape(100,100)
surf = ax.plot_surface(X, Y, Z, rstride=1,
cstride=1, cmap=plt.cm.coolwarm,
linewidth=0, antialiased=False)
ax.set_zlim(0, 0.02)
ax.zaxis.set_major_locator(plt.LinearLocator(10))
ax.zaxis.set_major_formatter(plt.FormatStrFormatter('%.02f'))
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('p(x)')
plt.title('Gaussian kernel, bw_method %s' %title)
fig.colorbar(surf, shrink=0.5, aspect=7, cmap=plt.cm.coolwarm)
plt.show()
$decide \; \omega_j \; for \;\; max \;\bigg[ P(\omega_j|x) = \frac{p(x|\omega_j) * P(\omega_j)}{p(x)} \bigg]\;, \;\; j = 1, 2, 3 $
We can remove the prior probabilities (equal priors) and the scale factor:
$decide \; \omega_j \; for \;\; max \;[ p(x|\omega_j) ]\;, \;\; j = 1, 2, 3 $
In [27]:
import operator
def bayes_classifier(x_vec, kdes):
"""
Classifies an input sample into class w_j determined by
maximizing the class conditional probability for p(x|w_j).
Keyword arguments:
x_vec: A dx1 dimensional numpy array representing the sample.
kdes: List of the gausssian_kde (kernel density) estimates
Returns a tuple ( p(x|w_j)_value, class label ).
"""
p_vals = []
for kde in kdes:
p_vals.append(kde.evaluate(x_vec))
max_index, max_value = max(enumerate(p_vals), key=operator.itemgetter(1))
return (max_value, max_index + 1)
In [28]:
import prettytable
classification_dict, error = empirical_error(test_set, [1,2,3],
bayes_classifier, [[class1_kde, class2_kde, class3_kde]])
labels_predicted = ['w{} (predicted)'.format(i) for i in [1,2,3]]
labels_predicted.insert(0,'test dataset')
train_conf_mat = prettytable.PrettyTable(labels_predicted)
for i in [1,2,3]:
a, b, c = [classification_dict[i][j] for j in [1,2,3]]
# workaround to unpack (since Python does not support just '*a')
train_conf_mat.add_row(['w{} (actual)'.format(i), a, b, c])
print(train_conf_mat)
print('Empirical Error: {:.2f} ({:.2f}%)'.format(error, error * 100))
As mentioned previously for the Parzen-window approach,
The density for a point x is defined as
$p(\pmb x) = \frac{k / n}{V}$
where k is the number of points inside a specified region, n is the
number of total points, and V is the volume of that region.
However, in contrast to the Parzen-window technique, we don't have a
fixed volume parameter, but control k (the number of points in a region
of specfic volume). In other words, we set the number of points that we want to
capture and compute the volume of the region. Also, the same convergence assumptions
like for the Parzen-window technique apply for the k-Nearest Neighbor technque.
In [ ]:
def find_1nearest_neighbor(p, dataset):
"""
Finds the nearest neighbor of a point in a dataset.
Keyword arguments:
p: Query point as dx1-dimensional numpy array.
dataset: Dataset as nxd-dimensional numpy array
with class label in the last column.
Returns the nearest neighbor as 1xd dimensional numpy
array from the dataset.
"""
p_index = np.argmin([np.inner(p-x, p-x) for x in dataset[:,0:-1]])
return dataset[p_index]
p = np.array([[1],[1]])
find_1nearest_neighbor(p, test_set)
In [29]:
def one_nn_classify(train_set, data_set):
"""
Classifies points in a dataset based on the
nearest neighbor in the training dataset
(1-nearest neighbor technique).
Keyword arguments:
train_set: training points as nx3-dimensional numpy array.
with class label in the last column
data_set: training points as nx2-dimensional numpy array.
Returns a list of the predicted class labels.
"""
class_labels = [np.argmin([np.inner(p-x, p-x)
for x in train_set[:,0:-1]])for p in data_set]
return [int(train_set[cl,-1]) for cl in class_labels]
true_class = test_set[:,-1]
predicted_class = one_nn_classify(train_set, test_set[:,0:-1])
error = sum([1 for pred,true in zip(predicted_class,true_class)
if pred!=true])/len(test_set)
print(error)
In [30]:
import prettytable
c1_pred = [0, 0, 0]
c2_pred = [0, 0, 0]
c3_pred = [0, 0, 0]
for pred,true in zip(predicted_class,true_class):
if true == 1:
c1_pred[pred-1] += 1
elif true == 2:
c2_pred[pred-1] += 2
else:
c3_pred[pred-1] += 3
labels_predicted = ['w{} (predicted)'.format(i) for i in [1,2,3]]
labels_predicted.insert(0,'test dataset')
train_conf_mat = prettytable.PrettyTable(labels_predicted)
for i in [1,2,3]:
a, b, c = c1_pred[i-1], c2_pred[i-1], c3_pred[i-1]
# workaround to unpack (since Python does not support just '*a')
train_conf_mat.add_row(['w{} (actual)'.format(i), a, b, c])
print(train_conf_mat)
print('Empirical Error: {:.2f} ({:.2f}%)'.format(error, error * 100))
In [31]:
# This workaround is necessary to make sure that every subset
# contains same number of
# samples for each of the three classes
train_subsets_c1 = {}
for p in reversed([i/10 for i in range(1,11)]):
t = train_set[train_set[:,2] == 1]
train_subsets_c1[int(p*100)] =\
t[np.random.choice(t.shape[0], t.shape[0] * p, replace=False),:]
train_subsets_c2 = {}
for p in reversed([i/10 for i in range(1,11)]):
t = train_set[train_set[:,2] == 2]
train_subsets_c2[int(p*100)] =\
t[np.random.choice(t.shape[0], t.shape[0] * p, replace=False),:]
train_subsets_c3 = {}
for p in reversed([i/10 for i in range(1,11)]):
t = train_set[train_set[:,2] == 3]
train_subsets_c3[int(p*100)] =\
t[np.random.choice(t.shape[0], t.shape[0] * p, replace=False),:]
# combine separate-class subsets for convenience
train_subsets = {}
for set_size in sorted(train_subsets_c1.keys()):
t_set = np.zeros(shape=(1,3))
for subset in [train_subsets_c1, train_subsets_c2, train_subsets_c3]:
t_set = np.append(t_set, subset[set_size], axis=0)
# delete the first placeholder row used for initializing the array
t_set = np.delete(t_set, 0, axis=0)
train_subsets[set_size] = t_set
for i in sorted(train_subsets):
print(i, train_subsets[i].shape)
In [168]:
#Plotting one subset to confirm that the reduction worked error-free
f, ax = plt.subplots(figsize=(7, 7))
ax.scatter(train_subsets[30][train_subsets[30][:,2] == 1][:,0],
train_subsets[30][train_subsets[30][:,2] == 1][:,1], \
marker='o', color='green', s=40, alpha=0.5, label='$\omega_1$')
ax.scatter(train_subsets[30][train_subsets[30][:,2] == 2][:,0],
train_subsets[30][train_subsets[30][:,2] == 2][:,1], \
marker='^', color='red', s=40, alpha=0.5, label='$\omega_2$')
ax.scatter(train_subsets[30][train_subsets[30][:,2] == 3][:,0],
train_subsets[30][train_subsets[30][:,2] == 3][:,1], \
marker='s', color='blue', s=40, alpha=0.5, label='$\omega_3$')
plt.legend(loc='upper right')
plt.title('Training datset after reducing the size to 30%', size=20)
plt.ylabel('$x_2$', size=20)
plt.xlabel('$x_1$', size=20)
plt.show()
The MLE of the parameter $\pmb\mu$ is given by the equation:
${\hat{\pmb\mu}} = \frac{1}{n} \sum\limits_{k=1}^{n} \pmb x_k$
In [32]:
mu_estimates = {}
for set_size in sorted(train_subsets):
mu1_est = np.sum(train_subsets[set_size]\
[train_subsets[set_size][:,2] == 1], axis=0)/\
len(train_subsets[set_size]\
[train_subsets[set_size][:,2] == 1])
mu1_est = mu1_est[0:2].reshape(2,1)
mu2_est = np.sum(train_subsets[set_size]\
[train_subsets[set_size][:,2] == 2], axis=0)/\
len(train_subsets[set_size]\
[train_subsets[set_size][:,2] == 2])
mu2_est = mu2_est[0:2].reshape(2,1)
mu3_est = np.sum(train_subsets[set_size]\
[train_subsets[set_size][:,2] == 3], axis=0)/\
len(train_subsets[set_size]\
[train_subsets[set_size][:,2] == 3])
mu3_est = mu3_est[0:2].reshape(2,1)
mu_estimates[set_size] = [mu1_est, mu2_est, mu3_est]
The MLE of the parameter $\pmb\Sigma$ is given by the equation:
${\hat{\pmb\Sigma}} = \frac{1}{n} \sum\limits_{k=1}^{n} (\pmb x_k - \hat{\mu})(\pmb x_k - \hat{\mu})^t$
In [33]:
def mle_est_cov(x_samples, mu_est):
"""
Calculates the Maximum Likelihood Estimate for the covariance matrix.
Keyword Arguments:
x_samples: np.array of the samples for 1 class, n x d dimensional
mu_est: np.array of the mean MLE, d x 1 dimensional
Returns the MLE for the covariance matrix as d x d numpy array.
"""
cov_est = np.zeros((2,2))
for x_vec in x_samples:
x_vec = x_vec.reshape(2,1)
assert(x_vec.shape == mu_est.shape),\
'mean and x vector hmust be of equal shape'
cov_est += (x_vec - mu_est).dot((x_vec - mu_est).T)
return cov_est / len(x_samples)
cov_estimates = {}
for set_size in sorted(train_subsets):
cov1_est = mle_est_cov(train_subsets[set_size]\
[train_subsets[set_size][:,2] == 1][:,0:2],
mu_estimates[set_size][0])
cov2_est = mle_est_cov(train_subsets[set_size]\
[train_subsets[set_size]\
[:,2] == 2][:,0:2], mu_estimates[set_size][1])
cov3_est = mle_est_cov(train_subsets[set_size]\
[train_subsets[set_size][:,2] == 3][:,0:2],
mu_estimates[set_size][2])
cov_estimates[set_size] = [cov1_est, cov2_est, cov3_est]
In [175]:
def discriminant_function(x_vec, cov_mat, mu_vec):
"""
Calculates the value of the discriminant function for a dx1 dimensional
sample given the covariance matrix and mean vector.
Keyword arguments:
x_vec: A dx1 dimensional numpy array representing the sample.
cov_mat: numpy array of the covariance matrix.
mu_vec: dx1 dimensional numpy array of the sample mean.
Returns a float value as result of the discriminant function.
"""
W_i = (-1/2) * np.linalg.inv(cov_mat)
assert(W_i.shape[0] > 1 and W_i.shape[1] > 1), 'W_i must be a matrix'
w_i = np.linalg.inv(cov_mat).dot(mu_vec)
assert(w_i.shape[0] > 1 and w_i.shape[1] == 1), 'w_i must be a column vector'
omega_i_p1 = (((-1/2) *\
(mu_vec).T).dot(np.linalg.inv(cov_mat))).dot(mu_vec)
omega_i_p2 = (-1/2) *\
np.log(np.linalg.det(cov_mat))
omega_i = omega_i_p1 - omega_i_p2
assert(omega_i.shape == (1, 1)),\
'omega_i must be a scalar'
g = ((x_vec.T).dot(W_i)).dot(x_vec) + (w_i.T).dot(x_vec) + omega_i
return float(g)
def classify_data(x_vec, g, mu_vecs, cov_mats):
"""
Classifies an input sample into 1 out of 3 classes determined by
maximizing the discriminant function g_i().
Keyword arguments:
x_vec: A dx1 dimensional numpy array representing the sample.
g: The discriminant function.
mu_vecs: A list of mean vectors as input for g.
cov_mats: A list of covariance matrices as input for g.
Returns a tuple (g_i()_value, class label).
"""
assert(len(mu_vecs) == len(cov_mats)),\
'Number of mu_vecs and cov_mats must be equal.'
g_vals = []
for m,c in zip(mu_vecs, cov_mats):
g_vals.append(g(x_vec, mu_vec=m, cov_mat=c))
max_index, max_value = max(enumerate(g_vals),
key=operator.itemgetter(1))
return (max_value, max_index + 1)
In [34]:
mle_errors = []
for i in range(10,101,10):
classification_dict, error =\
empirical_error(test_set, [1,2,3], classify_data,
[discriminant_function,
mu_estimates[i],
cov_estimates[i]])
mle_errors.append(error)
In [36]:
import operator
def bayes_kde_class(x_vec, kdes):
"""
Classifies an input sample into class w_j determined by
maximizing the class conditional probability for p(x|w_j).
Keyword arguments:
x_vec: A dx1 dimensional numpy array representing the sample.
kdes: List of the gausssian_kde (kernel density) estimates
Returns a tuple ( p(x|w_j)_value, class label ).
"""
p_vals = []
for kde in kdes:
p_vals.append(kde.evaluate(x_vec))
max_index, max_value = max(enumerate(p_vals), key=operator.itemgetter(1))
return (max_value, max_index + 1)
In [45]:
from scipy.stats import kde
kde_errors = []
for i in range(10,101,10):
class1_kde = kde.gaussian_kde(train_subsets[i]\
[train_subsets[i][:,2] == 1].T[0:2], bw_method=1.0)
class2_kde = kde.gaussian_kde(train_subsets[i]\
[train_subsets[i][:,2] == 2].T[0:2], bw_method=1.0)
class3_kde = kde.gaussian_kde(train_subsets[i]\
[train_subsets[i][:,2] == 3].T[0:2], bw_method=1.0)
classification_dict, error = empirical_error(test_set,
[1,2,3], bayes_kde_class, [[class1_kde, class2_kde, class3_kde]])
kde_errors.append(error)
In [40]:
def one_nn_classify(train_set, data_set):
"""
Classifies points in a dataset based on the
nearest neighbor in the training dataset
(1-nearest neighbor technique).
Keyword arguments:
train_set: training points as nx3-dimensional numpy array.
with class label in the last column
data_set: training points as nx2-dimensional numpy array.
Returns a list of the predicted class labels.
"""
class_labels = [np.argmin([np.inner(p-x, p-x)
for x in train_set[:,0:-1]])for p in data_set]
return [int(train_set[cl,-1]) for cl in class_labels]
In [41]:
one_nn_errors = []
for i in train_subsets.keys():
true_class = test_set[:,-1]
predicted_class = one_nn_classify(train_subsets[i], test_set[:,0:-1])
error = sum([1 for pred,true in zip(predicted_class,true_class)
if pred!=true])/len(test_set)
one_nn_errors.append(error)
In [72]:
labels = ['MLE', 'Parzen-window (h=1)', '1=Nearest Neighbor']
f, ax = plt.subplots(figsize=(10, 8))
for err in [mle_errors, kde_errors, one_nn_errors]:
plt.plot([len(train_set)*i for i in np.arange(0.1,1.1,0.1)], err, lw=2)
#plt.ylim([0,0.2])
plt.xlabel('training set size')
plt.ylabel('Error rate on test data set')
plt.title('Error rate for different techniques')
ylim([0,0.12])
plt.legend(labels, loc='lower right')
plt.show()
Discuss you results along the following lines:
i. Which classifier results in the best performance on the test set?
Which classifier results in the best performance on the test set?
Both parametric methods (here MLE) and non-parametric methods
(here: Parzen-window and 1-nearest neighbor) have good convergence properties.
In practice, it depends on the fact if we know the underlying model of the
data distribution when we are about to choose a parametric vs. a non-parametric approach.
If the model of the distribution is incorrect, the parametric approach yields an more
inaccurate estimate of the parameters.
The non-parametric approaches do not require knowledge about the underlying
distributions, since the estimate the probability density for local regions,
not the whole range of the distribution.
Assuming that the model for the Maximum Likelihood Estimate is correct
(here: multivariate Gaussian distribution) I would expect to observe the
largest error for the 1-nearest neighbor technique, since it is a suboptimal
technique, which can be bounded being 2-times larger than the Bayes error - due to
the fact that the "nearest neighbor" can be far away in regions
of sparsity (esp. in smaller datasets).
The performance of the Parzen-window approach largely depends on the chosen window size.
In order to improve this approach, we would test the classifier-performance for different
window width, which we haven't done here.
Indeed, we can observe the worst performance on the training dataset for the 1-nearest
neighbor technique. A relatively low error rate for the MLE and Parzen-window kernel
density estimation suggest that the underlying assumption of the distribution for the
parametric approach (MLE) is accurate, and the chosen window width for the Parzen-window
approach is appropriate (however, it doesn't imply that the classifier cannot improved by
choosing a different window width).
ii. Does the performance of a classifier change significantly
depending upon the patterns used in the training set?
Substantiate your answer with an actual experiment on this dataset.
I would assume that the performance of the classifier should not change significantly
for different patterns in large training datasets, because we assume that the samples are i.i.d.
However, for small training dataset sizes, the performance of the clasifier can change
significantly because the estimates become more inaccurate (according to the principle of
convergence for large datasets) as we have seen in exercise 1 e). Reasons for such a significant
changes in the performance - especially for small datasets - can be the the local sparsity
of datapoints so that the MLE will be biased (because of a skew in the distribution)
and the non-parametric techniques cannot accurately predict the local denisties for certain points.
An experiment to explore this hypothesis would be to draw different subsets of the same training
randomly multiple times and compare them against each other with respect to the error rates of the
different classifiers.
As in exercise 1 d), we will draw random training subsets from the
training dataset again with the sizes compared to the
original training dataset as follows: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%
of each class.
Then we will compute the MLE and estimates for the non-parametric techniques for
those new subsets and compare the error rates on the test dataset to the error rates
from the the old training subsets.
In [6]:
# This workaround is necessary to make sure that every subset
# contains same number of
# samples for each of the three classes
train_subsets2_c1 = {}
for p in reversed([i/10 for i in range(1,11)]):
t = train_set[train_set[:,2] == 1]
train_subsets2_c1[int(p*100)] =\
t[np.random.choice(t.shape[0], t.shape[0] * p, replace=False),:]
train_subsets2_c2 = {}
for p in reversed([i/10 for i in range(1,11)]):
t = train_set[train_set[:,2] == 2]
train_subsets2_c2[int(p*100)] =\
t[np.random.choice(t.shape[0], t.shape[0] * p, replace=False),:]
train_subsets2_c3 = {}
for p in reversed([i/10 for i in range(1,11)]):
t = train_set[train_set[:,2] == 3]
train_subsets2_c3[int(p*100)] =\
t[np.random.choice(t.shape[0], t.shape[0] * p, replace=False),:]
# combine separate-class subsets for convenience
train_subsets2 = {}
for set_size in sorted(train_subsets_c1.keys()):
t_set = np.zeros(shape=(1,3))
for subset in [train_subsets2_c1, train_subsets2_c2, train_subsets2_c3]:
t_set = np.append(t_set, subset[set_size], axis=0)
# delete the first placeholder row used for initializing the array
t_set = np.delete(t_set, 0, axis=0)
train_subsets2[set_size] = t_set
for i in sorted(train_subsets2):
print(i, train_subsets2[i].shape)
In [47]:
one_nn_errors2 = []
for i in train_subsets.keys():
true_class = test_set[:,-1]
predicted_class = one_nn_classify(train_subsets2[i], test_set[:,0:-1])
error = sum([1 for pred,true in zip(predicted_class,true_class)
if pred!=true])/len(test_set)
one_nn_errors2.append(error)
In [48]:
mu_estimates2 = {}
for set_size in sorted(train_subsets2):
mu1_est = np.sum(train_subsets2[set_size]\
[train_subsets2[set_size][:,2] == 1], axis=0)/\
len(train_subsets2[set_size]\
[train_subsets2[set_size][:,2] == 1])
mu1_est = mu1_est[0:2].reshape(2,1)
mu2_est = np.sum(train_subsets2[set_size]\
[train_subsets2[set_size][:,2] == 2], axis=0)/\
len(train_subsets2[set_size]\
[train_subsets2[set_size][:,2] == 2])
mu2_est = mu2_est[0:2].reshape(2,1)
mu3_est = np.sum(train_subsets2[set_size]\
[train_subsets2[set_size][:,2] == 3], axis=0)/\
len(train_subsets2[set_size]\
[train_subsets2[set_size][:,2] == 3])
mu3_est = mu3_est[0:2].reshape(2,1)
mu_estimates2[set_size] = [mu1_est, mu2_est, mu3_est]
cov_estimates2 = {}
for set_size in sorted(train_subsets2):
cov1_est = mle_est_cov(train_subsets2[set_size]\
[train_subsets2[set_size][:,2] == 1][:,0:2],
mu_estimates2[set_size][0])
cov2_est = mle_est_cov(train_subsets2[set_size]\
[train_subsets2[set_size]\
[:,2] == 2][:,0:2], mu_estimates2[set_size][1])
cov3_est = mle_est_cov(train_subsets2[set_size]\
[train_subsets2[set_size][:,2] == 3][:,0:2],
mu_estimates2[set_size][2])
cov_estimates2[set_size] = [cov1_est, cov2_est, cov3_est]
mle_errors2 = []
for i in range(10,101,10):
classification_dict, error =\
empirical_error(test_set, [1,2,3], classify_data,
[discriminant_function,
mu_estimates2[i],
cov_estimates[i]])
mle_errors2.append(error)
In [49]:
from scipy.stats import kde
kde_errors2 = []
for i in range(10,101,10):
class1_kde = kde.gaussian_kde(train_subsets2[i]\
[train_subsets[i][:,2] == 1].T[0:2], bw_method=1.0)
class2_kde = kde.gaussian_kde(train_subsets2[i]\
[train_subsets[i][:,2] == 2].T[0:2], bw_method=1.0)
class3_kde = kde.gaussian_kde(train_subsets2[i]\
[train_subsets[i][:,2] == 3].T[0:2], bw_method=1.0)
classification_dict, error = empirical_error(test_set,
[1,2,3], bayes_kde_class, [[class1_kde, class2_kde, class3_kde]])
kde_errors2.append(error)
In [69]:
def give_color():
for i in ['red', 'blue', 'green']:
yield i
colors = give_color()
def give_labels():
for i in ['MLE tr_set1', 'MLE tr_set2',
'Parzen-wind. tr_set1', 'Parzen-wind. tr_set2',
'1-nn tr_set1', '1-nn tr_set2']:
yield i
colors = give_color()
labels = give_labels()
f, ax = plt.subplots(figsize=(10, 8))
for err1, err2 in zip(
[mle_errors, kde_errors, one_nn_errors],
[mle_errors2, kde_errors2, one_nn_errors2]):
col = next(colors)
plt.plot([len(train_set)*i for i in np.arange(0.1,1.1,0.1)],
err1, color=col, lw=2, label=next(labels))
plt.plot([len(train_set)*i for i in np.arange(0.1,1.1,0.1)],
err2, color=col, linestyle='--', lw=2, label=next(labels))
plt.xlabel('training set size')
plt.ylabel('Error rate on test data set')
plt.title('Error rate for different techniques')
ylim([0,0.12])
plt.legend(loc='lower right')
plt.show()
Significant changes in the performance of the classifier can be observed throughout
the different sample sizes. As mentioned, this most likely occurs for small
training dataet sizes, here: n=300 (30% of the original training dataset size).
It is expected that the performance for all 3 techniques
(MLE, Parzen-window kernel density estimation, 1-nearest neighbor) increase,
since all underly the assumption of convergence.
$\lim\limits_{n \rightarrow \infty}{p_n(\pmb x)} = p(\pmb x)$
However, we don't see this effect so pronounced for our data set.
The error rate for all 3 approaches seems to be fluctuating around an average value.
Probably a wider range of different sample sizes would be necessary
to make this effect more obvious.
iv. Do you think it is necessary for the number of training patterns per class to be the same?
It is not necessary. However, if the prior probabilities are not equal,
we would have to normalize the weights for the observation for every class
in the Parzen-window technique and the 1-nearest neighbor technique.
For the MLE, we wouldn't be able to drop the prior probabilities from Bayes' rule,
which would lead to different discriminant functions.
[10 points] In a New York Times article from 2003
(see http://writing.upenn.edu/~afilreis/88v/kurzweil.html), Ray Kurzweil
is quoted as saying “The real power of human thinking is
based on recognizing patterns. The better computers get at pattern recognition,
the more humanlike they will become.” What are your thoughts on this statement?
To assess this statement, one needs to know what exactly Ray Kurzweil
defines as "human like", however, if he only refers to the ability to recognize
patterns, I would agree that computers will definitely become more "human-like"
in terms of making "appropriate" decisions (or performing an specific action)
for a particular situation.
Limitations of pattern recognition for computers are the amount of available data
and the complexity of the data. The more powerful computers become, the more data
can be processed (given a infinite number of possible training samples) for learning
to distinguish more patterns. Also, more complex input data can be processed, and more
complex features can be incorporated into the classifier design.
However, computers are still limited to distinguish patterns based on
numerical data - input data has to be processed into numbers. This may not be
applicable for all sorts of patterns. For example, for emotions on a human face
it may be challenging to break them down into numerical data.
Also, humans have the ability to make good decisions based on intuition,
where "data" or "experience" may not be available (i.e., for novel patterns
that are encountered for the first time). Since computers have to be "pre-computed"
and rely on data for the learning task, the recognition and choice of an appropriate
pattern might be very challenging (even if clever scientists develop recursive ways
so that a computer may learn novel patterns, it still requires previous data).
Overall, computers are improving in pattern recognition tasks through recursive
self-learning algorithms, however the complexity of the humanlike behavior includes
more aspects than just the recognition of patterns. So, I agree that computer may become
"more" humanlike, but without becoming "entirely" human-like.
In [ ]: