Project

CSE 802 - Pattern Recognition and Analysis Instructor: Dr. Arun Ross
Due Date: May 2, 5:00pm

Sebastian Raschka
Last updated: 04/24/2014

Sections

1)
- Description of task 1
- About the dataset
- Reading in and analyzing the dataset
- Dividing the dataset radomly in training (70%) and test (30%) sets
- Visualizing the 3 classes in a scatter plot
- 1 a)
  - Description of task 1 a)
  - Discriminant Functions
  - Decision Boundaries
  - Implementing the Discriminant Function for arbitrary covariance matrices
  - Implementing the descision rule
  - Classifying data and calculating the empirical error
    - Empirical error of the training dataset
    - Empirical error of the test dataset
- 1 b)
  - Description of task 1 b)
  - About the Maximum Likelihood Estimate (MLE)
  - MLE of the mean vector $\pmb\mu$
  - MLE of the covariance matrix $\pmb\Sigma$
  - Error on the test set using the estimated parameters
- 1 c)
  - Description of task 1 c)
  - Implementing the classifier using Bayes' decision rule
  - Error on the test set using the likelihoods from the Gaussian kernel estimate of the Parzen-window technique](#error_parzen)
  - Using the Parzen-window technique for Gaussian kernel desnity estimation
- 1 d)
  - Description of task 1 d)
  - About the k-Nearest Neighbor Technique
  - Implementing the code for finding the 1-Nearest Neighbor
- 1 e)
  - Description of task 1 e)
  - Creating training sets with variable size
  - MLE error rate for varying training set sizes
  - Error rate of kernel density estimaton via Parzen-window for varying training sizes
  - Error rate of the k-nearest-neighbor technique for varying training sizes
  - Plotting the error rate as a function of the size of the training set for each of the 3 cases
- 1 f) - Discussion
  - 1 f) i. - Best Classifier
  - 1 f) ii. - Performance changes as function of patterns
  - 1 f) iii. - Performance changes and the number of data patterns
  - 1 f) iv. Number of training patterns per class
2 - Thoughts on a Ray Kurzweil quote

1)

Description of Task 1):

[65 points] Consider the dataset available here: project_data.txt.
It consists of two-dimensional patterns, $\pmb{x} = [x_1, x_2]^t$ , pertaining to
3 classes $(\omega_1,\omega_2,\omega_3)$. The feature values are indicated in the first two
columns while the class labels are specified in the last column. The priors
of all 3 classes are the same. Randomly partition this dataset into a training set
(70% of each class) and a test set (30% of each class).

About the dataset

The data set consists of 1500 rows and 3 columns, where

column 1: $x_1$ values
column 2: $x_2$ values
column 3: class labels

Excerpt from the dataset:
1.8569 -3.4702 1 -0.2096 -2.8342 1 -1.0265 2.1614 1 [...] 9.3851 4.0336 2 10.1375 1.1495 2 11.7569 0.8005 2 [...] 3.9854 5.1360 3 2.7592 5.9536 3 4.1379 4.3258 3 [...]

Reading in and analyzing the dataset



In [2]:

    
import numpy as np

np.random.seed(1234568)



In [3]:

    
all_projdata = np.genfromtxt('./data/project_data.txt', delimiter=' ')

# Test if the data is read in in the correct dimensions:
assert(all_projdata.shape == (1500, 3))
for c_label in range(1,4): # class labels 1-3 are in column 3
    assert(all_projdata[all_projdata[:,2] == c_label].shape == (500,3))

# Print basic statistics:
for column in range(2):
    print(50 * '-')
    print("range of x_{}: ({}, {})".format(column+1, 
                min(all_projdata[:,column]), max(all_projdata[:,0])))
    print("mean of x_{}: {}".format(column+1, 
                np.mean(all_projdata[:,column])))
    print("standard_deviation of x_{}: {}".format(
            column+1, np.std(all_projdata[:,column])))









    



--------------------------------------------------
range of x_1: (-6.8114, 17.4559)
mean of x_1: 5.027464333333326
standard_deviation of x_1: 4.574803380930654
--------------------------------------------------
range of x_2: (-7.9943, 17.4559)
mean of x_2: 1.6648250000000002
standard_deviation of x_2: 3.156841470283494

Dividing the dataset radomly in training (70%) and test (30%) sets



In [4]:

    
# Accomodation, since numpy arrays cannot be empty
test_set = np.zeros(shape=(1,3))
train_set = np.zeros(shape=(1,3))


for i in range(0,3):
    cl_shuffle = np.copy(all_projdata[all_projdata[:,2] == i+1])
    np.random.shuffle(cl_shuffle)
    test_set = np.append(test_set, 
                         cl_shuffle[0:150,:], axis=0)
    train_set = np.append(train_set, 
                          cl_shuffle[150:,:], axis=0)
                                  
# delete the first placeholder 
# row used for initializing the array                                  
test_set = np.delete(test_set, 0, axis=0)
train_set = np.delete(train_set, 0, axis=0)

assert(test_set.shape == (1500*0.3,3))
assert(train_set.shape == (1500*0.7,3))

Visualizing the 3 classes in a scatter plot



In [8]:

    
#%pylab inline

import numpy as np
from matplotlib import pyplot as plt

# Plotting training and test datasets
for dset,title in zip((test_set, train_set), ['Test', 'Training']):
    f, ax = plt.subplots(figsize=(7, 7))
    ax.scatter(dset[dset[:,2] == 1][:,0], 
            dset[dset[:,2] == 1][:,1], 
            marker='o', color='green', s=40, alpha=0.5, label='$\omega_1$')
    ax.scatter(dset[dset[:,2] == 2][:,0], 
               dset[dset[:,2] == 2][:,1],
               marker='^', color='red', s=40, alpha=0.5, label='$\omega_2$')
    ax.scatter(dset[dset[:,2] == 3][:,0], 
               dset[dset[:,2] == 3][:,1], 
               marker='s', color='blue', s=40, alpha=0.5, label='$\omega_3$')
    plt.legend(loc='upper right') 
    plt.title('{} Dataset'.format(title), size=20)
    plt.ylabel('$x_2$', size=20)
    plt.xlabel('$x_1$', size=20)
    plt.show()

1a)

Description

Let

$p([x_1, x_2]^t |\omega_1) ∼ N([0,0]^t,4I), \\ p([x_1, x_2]^t |\omega_2) ∼ N([10,0]^t,4I), \\ p([x_1, x_2]^t |\omega_3) ∼ N([5,5]^t,5I),$

where $I$ is the $2 \times 2$ identity matrix.
What is the error rate on the test set when the
Bayesian decision rule is employed for classification?

Bayes' Rule:

$ P(\omega_j|x) = \frac{p(x|\omega_j) * P(\omega_j)}{p(x)}$

with prior probabilities: $P(\omega_1) \; = P(\omega_2) \; = P(\omega_3)\; = \frac{1}{3}$

Discriminant Functions:

The objective function is to maximize the discriminant function,
which we define as the posterior probability here to perform
a minimum-error classification (Bayes classifier).

$ g_1(\pmb x) = P(\omega_1 | \; \pmb{x}), \quad g_2(\pmb{x}) = P(\omega_2 | \; \pmb{x}), \quad g_3(\pmb{x}) = P(\omega_2 | \; \pmb{x})$

$ \Rightarrow g_1(\pmb{x}) = P(\pmb{x}\;|\;\omega_1) \;\cdot\; P(\omega_1) \quad | \; ln \\ \quad g_2(\pmb x) = P(\pmb{x}\;|\;\omega_2) \;\cdot\; P(\omega_2) \quad | \; ln \\ \quad g_3(\pmb x) = P(\pmb{x}\;|\;\omega_3) \;\cdot\; P(\omega_3) \quad | \; ln$

We can drop the prior probabilities (since we have equal priors in this case):

$ \Rightarrow g_1(\pmb{x}) = ln(P(\pmb{x}\;|\;\omega_1)) \\ \quad g_2(\pmb{x}) = ln(P(\pmb{x}\;|\;\omega_2)) \\ \quad g_3(\pmb{x}) = ln(P(\pmb{x}\;|\;\omega_3))$

The equations for the general multivariate normal case
with arbitrary covariance matrices
(i.e., different covariance matrices for each case):

$ \Rightarrow g_1(\pmb{x}) = \pmb{x}^{\,t} \bigg( - \frac{1}{2} \Sigma_1^{-1} \bigg) \pmb{x} + \bigg( \Sigma_1^{-1} \pmb{\mu}_{\,1}\bigg)^t \pmb x + \bigg( -\frac{1}{2} \pmb{\mu}_{\,1}^{\,t} \Sigma_{1}^{-1} \pmb{\mu}_{\,1} -\frac{1}{2} ln(|\Sigma_1|)\bigg)$

$g_2(\pmb{x}) = \pmb{x}^{\,t} \bigg( - \frac{1}{2} \Sigma_2^{-1} \bigg) \pmb{x} + \bigg( \Sigma_2^{-1} \pmb{\mu}_{\,2}\bigg)^t \pmb x + \bigg( -\frac{1}{2} \pmb{\mu}_{\,2}^{\,t} \Sigma_{2}^{-1} \pmb{\mu}_{\,2} -\frac{1}{2} ln(|\Sigma_2|)\bigg)$

$g_3(\pmb{x}) = \pmb{x}^{\,t} \bigg( - \frac{1}{2} \Sigma_3^{-1} \bigg) \pmb{x} + \bigg( \Sigma_3^{-1} \pmb{\mu}_{\,3}\bigg)^t \pmb x + \bigg( -\frac{1}{2} \pmb{\mu}_{\,3}^{\,t} \Sigma_{3}^{-1} \pmb{\mu}_{\,3} -\frac{1}{2} ln(|\Sigma_3|)\bigg)$

Let:

$\pmb{W}_{i} = - \frac{1}{2} \Sigma_i^{-1}\\ \pmb{w}_i = \Sigma_i^{-1} \pmb{\mu}_{\,i}\\ \omega_{i0} = \bigg( -\frac{1}{2} \pmb{\mu}_{\,i}^{\,t}\; \Sigma_{i}^{-1} \pmb{\mu}_{\,i} -\frac{1}{2} ln(|\Sigma_i|)\bigg)$

$ \pmb{W}_{1} = \bigg[ \begin{array}{cc} -(1/8) & 0\\ 0 & -(1/8) \\ \end{array} \bigg] $

$ \pmb{w}_{1} = \bigg[ \begin{array}{cc} (1/4) & 0\\ 0 & (1/4) \\ \end{array} \bigg] \cdot \bigg[ \begin{array}{c} 0 \\ 0 \\ \end{array} \bigg] = \bigg[ \begin{array}{c} 0 \\ 0 \\ \end{array} \bigg]$

$ \omega_{10} = -\frac{1}{2} [0 \quad 0 ] \bigg[ \begin{array}{cc} (1/4) & 0\\ 0 & (1/4) \\ \end{array} \bigg] \cdot \bigg[ \begin{array}{c} 0 \\ 0 \\ \end{array} \bigg]

ln(4) = -ln(4)$

with $ \quad g_1(\pmb{x}) = \pmb{x}^{\,t} \bigg( - \frac{1}{2} \Sigma_1^{-1} \bigg) \pmb{x} + \bigg( \Sigma_1^{-1} \pmb{\mu}_{\,1}\bigg)^t \pmb x + \bigg( -\frac{1}{2} \pmb{\mu}_{\,1}^{\,t} \Sigma_{1}^{-1} \pmb{\mu}_{\,1} -\frac{1}{2} ln(|\Sigma_1|)\bigg) $

$ \Rightarrow g_1(\pmb{x}) = \pmb{x}^{t} \bigg[ \begin{array}{cc} -(1/8) & 0\\ 0 & -(1/8) \\ \end{array} \bigg] \pmb{x} + [0 \quad 0 ] \; \pmb x - ln(4) = \pmb{x}^{t} (-\frac{1}{8} \; \pmb{x}) - 2ln(2) $

$ \Rightarrow g_1(\pmb{x}) = - \frac{1}{8} (x^{2}_1 + x^{2}_2 ) - 2ln(2)$

$ \pmb{W}_{2} = \bigg[ \begin{array}{cc} -(1/8) & 0\\ 0 & -(1/8) \\ \end{array} \bigg] $

$ \pmb{w}_{2} = \bigg[ \begin{array}{cc} (1/4) & 0\\ 0 & (1/4) \\ \end{array} \bigg] \cdot \bigg[ \begin{array}{c} 10 \\ 0 \\ \end{array} \bigg] = \bigg[ \begin{array}{c} 2.5 \\ 0 \\ \end{array} \bigg]$

$ \omega_{20} = -\frac{1}{2} [10 \quad 0 ] \bigg[ \begin{array}{cc} (1/4) & 0\\ 0 & (1/4) \\ \end{array} \bigg] \cdot \bigg[ \begin{array}{c} 10 \\ 0 \\ \end{array} \bigg]

ln(4) = -12.5-ln(4)$

with $ \quad g_2(\pmb{x}) = \pmb{x}^{\,t} \bigg( - \frac{1}{2} \Sigma_2^{-1} \bigg) \pmb{x} + \bigg( \Sigma_2^{-1} \pmb{\mu}_{\,2}\bigg)^t \pmb x + \bigg( -\frac{1}{2} \pmb{\mu}_{\,2}^{\,t} \Sigma_{2}^{-1} \pmb{\mu}_{\,2} -\frac{1}{2} ln(|\Sigma_2|)\bigg) $

$ \Rightarrow g_2(\pmb{x}) = \pmb{x}^{t} \bigg[ \begin{array}{cc} -(1/8) & 0\\ 0 & -(1/8) \\ \end{array} \bigg] \pmb{x} + [2.5 \quad 0 ]\;\pmb x - 12.5 - ln(4) = \pmb{x}^{t} \cdot( -\frac{1}{8} \; \pmb{x}) + 2.5 x_1 - 12.5 - 2ln(2) $

$ \Rightarrow g_2(\pmb{x}) = - \frac{1}{8} (x^{2}_1 + x^{2}_2) +2.5 x_1 - 12.5 - 2ln(2)$

$ \pmb{W}_{3} = \bigg[ \begin{array}{cc} -(1/10) & 0\\ 0 & -(1/10) \\ \end{array} \bigg] $

$ \pmb{w}_{3} = \bigg[ \begin{array}{cc} (1/5) & 0\\ 0 & (1/5) \\ \end{array} \bigg] \cdot \bigg[ \begin{array}{c} 5 \\ 5 \\ \end{array} \bigg] = \bigg[ \begin{array}{c} 1 \\ 1 \\ \end{array} \bigg]$

$ \omega_{30} = -\frac{1}{2} [5 \quad 5 ] \bigg[ \begin{array}{cc} (1/5) & 0\\ 0 & (1/5) \\ \end{array} \bigg] \cdot \bigg[ \begin{array}{c} 5 \\ 5 \\ \end{array} \bigg]

ln(5) = 5 -ln(5)$

with $ \quad g_3(\pmb{x}) = \pmb{x}^{\,t} \bigg( - \frac{1}{2} \Sigma_3^{-1} \bigg) \pmb{x} + \bigg( \Sigma_3^{-1} \pmb{\mu}_{\,3}\bigg)^t \pmb x + \bigg( -\frac{1}{2} \pmb{\mu}_{\,3}^{\,t} \Sigma_{3}^{-1} \pmb{\mu}_{\,3} -\frac{1}{2} ln(|\Sigma_3|)\bigg)$

$ \Rightarrow g_3(\pmb{x}) = \pmb{x}^{t} \bigg[ \begin{array}{cc} -(1/10) & 0\\ 0 & -(1/10) \\ \end{array} \bigg] \pmb{x} + [1 \quad 1 ]\; \pmb x + 5 - ln(5) = \pmb{x}^{t} \cdot ( - \frac{1}{10})\; \pmb x )+ \; {x_1} + {x_2} + 5 - ln(5) $

$ \Rightarrow g_3(\pmb{x}) = - \frac{1}{10} (x^{2}_1 + x^{2}_2)+ \; {x_1} + {x_2} + 5 - ln(5)$

Implementing the Discriminant Function for arbitrary covariance matrices



In [10]:

    
def discriminant_function(x_vec, cov_mat, mu_vec):
    """
    Calculates the value of the discriminant function for a dx1 dimensional
    sample given covariance matrix and mean vector.
    
    Keyword arguments:
        x_vec: A dx1 dimensional numpy array representing the sample.
        cov_mat: numpy array of the covariance matrix.
        mu_vec: dx1 dimensional numpy array of the sample mean.
    
    Returns a float value as result of the discriminant function.
    
    """
    W_i = (-1/2) * np.linalg.inv(cov_mat)
    assert(W_i.shape[0] > 1 and W_i.shape[1] > 1),\
        'W_i must be a matrix'
    
    w_i = np.linalg.inv(cov_mat).dot(mu_vec)
    assert(w_i.shape[0] > 1 and w_i.shape[1] == 1),\
        'w_i must be a column vector'
    
    omega_i_p1 = (((-1/2) * (mu_vec).T).dot(np.linalg.inv(cov_mat))).dot(mu_vec)
    omega_i_p2 = (-1/2) * np.log(np.linalg.det(cov_mat))
    omega_i = omega_i_p1 - omega_i_p2
    assert(omega_i.shape == (1, 1)), 'omega_i must be a scalar'
    
    g = ((x_vec.T).dot(W_i)).dot(x_vec) + (w_i.T).dot(x_vec) + omega_i
    return float(g)

Implementing the decision rule



In [11]:

    
import operator

def classify_data(x_vec, g, mu_vecs, cov_mats):
    """
    Classifies an input sample into 1 out of 3 classes determined by
    maximizing the discriminant function g_i().
    
    Keyword arguments:
        x_vec: A dx1 dimensional numpy array representing the sample.
        g: The discriminant function.
        mu_vecs: A list of mean vectors as input for g.
        cov_mats: A list of covariance matrices as input for g.
    
    Returns a tuple (g_i()_value, class label).
    
    """
    assert(len(mu_vecs) == len(cov_mats)),\
        'Number of mu_vecs and cov_mats must be equal.'
    
    g_vals = []
    for m,c in zip(mu_vecs, cov_mats): 
        g_vals.append(g(x_vec, mu_vec=m, cov_mat=c))
    
    max_index, max_value = max(enumerate(g_vals), 
                               key=operator.itemgetter(1))
    return (max_value, max_index + 1)

Classifying data and calculating the empirical error



In [18]:

    
# Known parameters
cov_mat_1 = 4 * np.eye(2,2)
mu_vec_1 = np.array([[0],[0]])

cov_mat_2 = 4 * np.eye(2,2)
mu_vec_2 = np.array([[10],[0]])

cov_mat_3 = 5 * np.eye(2,2)
mu_vec_3 = np.array([[5],[5]])

Empirical error of the training dataset



In [19]:

    
def empirical_error(data_set, classes, classifier_func, classifier_func_args):
    """
    Keyword arguments:
        data_set: 'n x d'- dimensional numpy array, 
            class label in the last column.
        classes: List of the class labels.
        classifier_func: Function that returns the 
            max argument from the discriminant function.
            evaluation and the class label as a tuple.
        classifier_func_args: List of arguments for the 'classifier_func'.
    
    Returns a tuple, consisting of a dictionary 
        with the classif. counts and the error.
    
    e.g., ( {1: {1: 321, 2: 5}, 2: {1: 0, 2: 317}}, 0.05)
    where keys are class labels, and values are 
    sub-dicts counting for which class (key)
    how many samples where classified as such.
    
    """
    class_dict = {i:{j:0 for j in classes} for i in classes}

    for cl in classes:
        for row in data_set[data_set[:,-1] == cl][:,:-1]:
            g = classifier_func(row, *classifier_func_args)
            class_dict[cl][g[1]] += 1
    
    correct = 0
    for i in classes:
        correct += class_dict[i][i]
    
    misclass = data_set.shape[0] - correct
    return (class_dict, misclass / data_set.shape[0])



In [231]:

    
import prettytable

classification_dict, error = empirical_error(train_set, [1,2,3], 
        classify_data, [discriminant_function,
        [mu_vec_1, mu_vec_2, mu_vec_3],
        [cov_mat_1, cov_mat_2, cov_mat_3]])

labels_predicted = ['w{} (predicted)'.format(i) for i in [1,2,3]]
labels_predicted.insert(0,'training dataset')

train_conf_mat = prettytable.PrettyTable(labels_predicted)
for i in [1,2,3]:
    a, b, c = [classification_dict[i][j] for j in [1,2,3]]
    # workaround to unpack (since Python does not support just '*a')
    train_conf_mat.add_row(['w{} (actual)'.format(i), a, b, c])
print(train_conf_mat)
print('Empirical Error: {:.2f} ({:.2f}%)'.format(error, error * 100))









    



+------------------+----------------+----------------+----------------+
| training dataset | w1 (predicted) | w2 (predicted) | w3 (predicted) |
+------------------+----------------+----------------+----------------+
|   w1 (actual)    |      321       |       5        |       24       |
|   w2 (actual)    |       0        |      317       |       33       |
|   w3 (actual)    |       8        |       13       |      329       |
+------------------+----------------+----------------+----------------+
Empirical Error: 0.08 (7.90%)

Empirical error of the test dataset



In [20]:

    
import prettytable

classification_dict, error = empirical_error(test_set, [1,2,3], 
        classify_data, [discriminant_function,
        [mu_vec_1, mu_vec_2, mu_vec_3],
        [cov_mat_1, cov_mat_2, cov_mat_3]])

labels_predicted = ['w{} (predicted)'.format(i) for i in [1,2,3]]
labels_predicted.insert(0,'test dataset')

train_conf_mat = prettytable.PrettyTable(labels_predicted)
for i in [1,2,3]:
    a, b, c = [classification_dict[i][j] for j in [1,2,3]]
    # workaround to unpack (since Python does not support just '*a')
    train_conf_mat.add_row(['w{} (actual)'.format(i), a, b, c])
print(train_conf_mat)
print('Empirical Error: {:.2f} ({:.2f}%)'.format(error, error * 100))









    



+--------------+----------------+----------------+----------------+
| test dataset | w1 (predicted) | w2 (predicted) | w3 (predicted) |
+--------------+----------------+----------------+----------------+
| w1 (actual)  |      142       |       0        |       8        |
| w2 (actual)  |       1        |      142       |       7        |
| w3 (actual)  |       1        |       3        |      146       |
+--------------+----------------+----------------+----------------+
Empirical Error: 0.04 (4.44%)

1b)

Suppose $p([x_1,x_2]^t\;|\;\omega_i) \;∼ \; N(\pmb \mu_i,\pmb \Sigma_i), \; i = 1,2,3$, where the $\pmb \mu_i$’s and $\pmb \Sigma_i$’s are unknown.
Use the training set to compute the MLE of the μi’s and the $\pmb \Sigma_i$’s.
What is the error rate on the test set when the Bayes decision rule using
the estimated parameters is employed for classification?

About the Maximum Likelihood Estimate (MLE)

In contrast to task one, now we only know the number of parameters for the class conditional densities
$p (\; \pmb x \; | \; \omega_i)$ and we want to use a Maximum Likelihood Estimatation (MLE)
to estimate the quantities of these parameters from the training data.

Given the information about the form of the model - the data is normal distributed -
the 2 parameters to be estimated are $\pmb \mu_i$ and $\pmb \Sigma_i$,
which are summarized by the parameter vector
$\pmb \theta_i = \bigg[ \begin{array}{c} \ \theta_{i1} \\ \ \theta_{i2} \\ \end{array} \bigg]= \bigg[ \begin{array}{c} \pmb \mu_i \\ \pmb \Sigma_i \\ \end{array} \bigg]$

For the Maximum Likelihood Estimate (MLE), we assume that we have a set of samles
$D = \left\{ \pmb x_1, \pmb x_2,..., \pmb x_n \right\} $ that are i.i.d.
(independent and identically distributed, drawn with probability
$p(\pmb x \; | \; \omega_i, \; \pmb \theta_i) $).
Thus, we can work with each class separately and omit the class labels,
so that we write the probability density as $p(\pmb x \; | \; \pmb \theta)$

Likelihood of $ \pmb \theta $

Thus, the probability of observing $D = \left\{ \pmb x_1, \pmb x_2,..., \pmb x_n \right\} $ is:

$p(D\; | \; \pmb \theta\;) = p(\pmb x_1 \; | \; \pmb \theta\;)\; \cdot \; p(\pmb x_2 \; | \;\pmb \theta\;) \; \cdot \;... \; p(\pmb x_n \; | \; \pmb \theta\;) = \prod_{k=1}^{n} \; p(\pmb x_k \pmb \; | \; \pmb \theta \;)$

Where $p(D\; | \; \pmb \theta\;)$ is also called the likelihood of $\pmb\ \theta$.

We are given the information that $p([x_1,x_2]^t) \;∼ \; N(\pmb \mu,\pmb \Sigma) $
(remember that we dropped the class labels,
since we are working with every class separately).

And the mutlivariate normal density is given as:
$\quad \quad p(\pmb x) = \frac{1}{(2\pi)^{d/2} \; |\Sigma|^{1/2}} exp \bigg[ -\frac{1}{2}(\pmb x - \pmb \mu)^t \Sigma^{-1}(\pmb x - \pmb \mu) \bigg]$

So that
$p(D\; | \; \pmb \theta\;) = \prod_{k=1}^{n} \; p(\pmb x_k \pmb \; | \; \pmb \theta \;) = \prod_{k=1}^{n} \; \frac{1}{(2\pi)^{d/2} \; |\Sigma|^{1/2}} exp \bigg[ -\frac{1}{2}(\pmb x - \pmb \mu)^t \Sigma^{-1}(\pmb x - \pmb \mu) \bigg]$

Since it is easier to work with the logarithm, we define the log-likelihood function $l(\pmb \theta)$ as
$l(\pmb \theta) \equiv ln\; p(D\; | \; \pmb \theta\;) = \sum\limits_{k=1}^{n} \;ln\; p(\pmb x_k \pmb \; | \; \pmb \theta \;)$

and the log of the multivariate density

$ l(\pmb\theta) = \sum\limits_{k=1}^{n} - \frac{1}{2}(\pmb x - \pmb \mu)^t \pmb \Sigma^{-1} \; (\pmb x - \pmb \mu) - \frac{d}{2} \; ln \; 2\pi - \frac{1}{2} \;ln \; |\pmb\Sigma|$

Maximum Likelihood Estimate (MLE)

In order to obtain the MLE $\boldsymbol{\hat{\theta}}$, we maximize $l (\pmb \theta)$, which can be done via differentiation:

with $\nabla_{\pmb \theta} \equiv \begin{bmatrix} \frac{\partial \; }{\partial \; \theta_1} \\ \frac{\partial \; }{\partial \; \theta_2} \end{bmatrix} = \begin{bmatrix} \frac{\partial \; }{\partial \; \pmb \mu} \\ \frac{\partial \; }{\partial \; \pmb \sigma} \end{bmatrix}$

$\nabla_{\pmb \theta} l = \sum\limits_{k=1}^n \nabla_{\pmb \theta} \;ln\; p(\pmb x| \pmb \theta) = 0 $

MLE of the mean vector $\pmb \mu$

The MLE of the parameter $\pmb\mu$ is given by the equation:
${\hat{\pmb\mu}} = \frac{1}{n} \sum\limits_{k=1}^{n} \pmb x_k$



In [21]:

    
import prettytable

mu_est_1 = np.sum(train_set[train_set[:,2] == 1], 
                  axis=0)/len(train_set[train_set[:,2] == 1])
mu_est_1 = mu_est_1[0:2].reshape(2,1)
mu_est_2 = np.sum(train_set[train_set[:,2] == 2], 
                  axis=0)/len(train_set[train_set[:,2] == 2])
mu_est_2 = mu_est_2[0:2].reshape(2,1)
mu_est_3 = np.sum(train_set[train_set[:,2] == 3], 
                  axis=0)/len(train_set[train_set[:,2] == 3])
mu_est_3 = mu_est_3[0:2].reshape(2,1)

print(mu_est_1)
mu_mle = prettytable.PrettyTable(["", "mu_1", "mu_2", "mu_3"])
mu_mle.add_row(["MLE",mu_est_1, mu_est_2, mu_est_3])
mu_mle.add_row(["actual",mu_vec_1, mu_vec_2, mu_vec_3])

print(mu_mle)









    



[[ 0.19298829]
 [-0.04618914]]
+--------+-----------------+-----------------+-----------------+
|        |       mu_1      |       mu_2      |       mu_3      |
+--------+-----------------+-----------------+-----------------+
|  MLE   |  [[ 0.19298829] |  [[ 9.916212  ] |  [[ 5.06426086] |
|        |  [-0.04618914]] |  [ 0.05908629]] |  [ 4.88450171]] |
| actual |       [[0]      |      [[10]      |       [[5]      |
|        |       [0]]      |       [ 0]]     |       [5]]      |
+--------+-----------------+-----------------+-----------------+

MLE of the covariance matrix $\pmb \Sigma$

The MLE of the parameter $\pmb\Sigma$ is given by the equation:

${\hat{\pmb\Sigma}} = \frac{1}{n} \sum\limits_{k=1}^{n} (\pmb x_k - \hat{\mu})(\pmb x_k - \hat{\mu})^t$



In [22]:

    
def mle_est_cov(x_samples, mu_est):
    """
    Calculates the Maximum Likelihood Estimate for the covariance matrix.
    
    Keyword Arguments:
        x_samples: np.array of the samples for 1 class, n x d dimensional 
        mu_est: np.array of the mean MLE, d x 1 dimensional
        
    Returns the MLE for the covariance matrix as d x d numpy array.
    
    """
    cov_est = np.zeros((2,2))
    for x_vec in x_samples:
        x_vec = x_vec.reshape(2,1)
        assert(x_vec.shape == mu_est.shape),\
        'mean and x vector hmust be of equal shape'
        cov_est += (x_vec - mu_est).dot((x_vec - mu_est).T)
    return cov_est / len(x_samples)



In [23]:

    
import prettytable

class1_train_samples1 = train_set[train_set[:,2] == 1][:,0:2]
cov_est_1 = mle_est_cov(class1_train_samples1, mu_est_1)

class1_train_samples2 = train_set[train_set[:,2] == 2][:,0:2]
cov_est_2 = mle_est_cov(class1_train_samples2, mu_est_2)

class1_train_samples3 = train_set[train_set[:,2] == 3][:,0:2]
cov_est_3 = mle_est_cov(class1_train_samples3, mu_est_3)

cov_mle = prettytable.PrettyTable(["", "covariance_matrix_1",\
                                   "covariance_matrix_2", "covariance_matrix_3"])
cov_mle.add_row(["MLE", cov_est_1, cov_est_2, cov_est_3])
cov_mle.add_row(['','','',''])
cov_mle.add_row(["actual", cov_mat_1, cov_mat_2, cov_mat_3])

print(cov_mle)









    



+--------+---------------------------------------+-----------------------------+-----------------------------+
|        |          covariance_matrix_1          |     covariance_matrix_2     |     covariance_matrix_3     |
+--------+---------------------------------------+-----------------------------+-----------------------------+
|  MLE   |  [[  4.99895563e+00   2.48237701e-03] |  [[ 5.4868865  -0.02521011] |  [[ 3.95020679 -0.02051392] |
|        |  [  2.48237701e-03   4.73000833e+00]] |  [-0.02521011  5.05639885]] |  [-0.02051392  4.51383839]] |
|        |                                       |                             |                             |
| actual |               [[ 4.  0.]              |          [[ 4.  0.]         |          [[ 5.  0.]         |
|        |               [ 0.  4.]]              |          [ 0.  4.]]         |          [ 0.  5.]]         |
+--------+---------------------------------------+-----------------------------+-----------------------------+

Error on the test set using the estimated parameters

Using the estimated parameters $\pmb \mu_i$ and $\pmb \Sigma_i$, which we obtained
via MLE, we calculate the error on the test data set again.



In [24]:

    
import prettytable

classification_dict, error = empirical_error(test_set, 
        [1,2,3], classify_data, [discriminant_function,
        [mu_est_1, mu_est_2, mu_est_3],
        [cov_est_1, cov_est_2, cov_est_3]])

labels_predicted = ['w{} (predicted)'.format(i) for i in [1,2,3]]
labels_predicted.insert(0,'test dataset')

train_conf_mat = prettytable.PrettyTable(labels_predicted)
for i in [1,2,3]:
    a, b, c = [classification_dict[i][j] for j in [1,2,3]]
    # workaround to unpack (since Python does not support just '*a')
    train_conf_mat.add_row(['w{} (actual)'.format(i), a, b, c])
print(train_conf_mat)
print('Empirical Error: {:.2f} ({:.2f}%)'.format(error, error * 100))









    



+--------------+----------------+----------------+----------------+
| test dataset | w1 (predicted) | w2 (predicted) | w3 (predicted) |
+--------------+----------------+----------------+----------------+
| w1 (actual)  |      143       |       0        |       7        |
| w2 (actual)  |       1        |      147       |       2        |
| w3 (actual)  |       6        |       5        |      139       |
+--------------+----------------+----------------+----------------+
Empirical Error: 0.05 (4.67%)

1 c)

Description

Suppose the form of the distributions of $p([x_1, x_2]^t\;|\;\omega_i),\; i = 1,2,3$ is unknown.
Assume that the training dataset can be used to estimate the density at a
point using the Parzen window technique (a spherical Gaussian kernel with h = 1).
What is the error rate on the test set when the Bayes decision rule is employed for classification?

The density for a point x is defined as

$p(\pmb x) = \frac{k / n}{V}$

where k is the number of points inside a specified region, n is the
number of total points, and V is the volume of that region.

Starting with a simple example to count the number $k_n$, the
equation using a hypercube would be
$k_n = \sum\limits_{i=1}^{n} \phi \bigg( \frac{\pmb x - \pmb x_i}{h_n} \bigg)$

where $\phi(\pmb u)$ is the window function, and here: $\pmb u = \bigg( \frac{\pmb x - \pmb x_i}{h_n} \bigg)$

Based on this window, we can now formulate the Parzen-window
estimation with a hypercube kernel as follows:

$P_n(\pmb x) = \frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{h^d} \phi \bigg[ \frac{\pmb x - \pmb x_i}{h_n} \bigg]$

where
$h^d = V_n\quad$ and $\quad\phi \bigg[ \frac{\pmb x - \pmb x_i}{h_n} \bigg] = k$

For the Gaussian kernel (instead of the hypercube), we can just simply swap the terms of the window function by:

$\frac{1}{(\sqrt {2 \pi})^d h_{n}^{d}} exp \; \bigg[ -\frac{1}{2} \bigg(\frac{\pmb x - \pmb x_i}{h_n} \bigg)^2 \bigg]$

So that we have the Parzen window estimation becomes:

$P_n(\pmb x) = \frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{h^d} \phi \Bigg[ \frac{1}{(\sqrt {2 \pi})^d h_{n}^{d}} exp \; \bigg[ -\frac{1}{2} \bigg(\frac{\pmb x - \pmb x_i}{h_n} \bigg)^2 \bigg] \Bigg]$

Using the Parzen-window technique for Gaussian kernel desnity estimation



In [25]:

    
from scipy.stats import kde
class1_kde = kde.gaussian_kde(
    train_set[train_set[:,2] == 1].T[0:2], bw_method=1.0)
class2_kde = kde.gaussian_kde(
    train_set[train_set[:,2] == 2].T[0:2], bw_method=1.0)
class3_kde = kde.gaussian_kde(
    train_set[train_set[:,2] == 3].T[0:2], bw_method=1.0)



In [26]:

    
%pylab inline
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.mlab import bivariate_normal
from mpl_toolkits.mplot3d import Axes3D

X = np.linspace(-10, 20, 100)
Y = np.linspace(-10, 20, 100)
X,Y = np.meshgrid(X,Y)

for bwmethod,title in zip([class1_kde, class2_kde, class3_kde], 
                          ['class1', 'class2', 'class3']):
    fig = plt.figure(figsize=(10, 7))
    ax = fig.gca(projection='3d')
    Z = bwmethod(np.array([X.ravel(),Y.ravel()]))
    Z = Z.reshape(100,100)
    surf = ax.plot_surface(X, Y, Z, rstride=1, 
        cstride=1, cmap=plt.cm.coolwarm,
        linewidth=0, antialiased=False)

    ax.set_zlim(0, 0.02)
    ax.zaxis.set_major_locator(plt.LinearLocator(10))
    ax.zaxis.set_major_formatter(plt.FormatStrFormatter('%.02f'))
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.set_zlabel('p(x)')
    
    plt.title('Gaussian kernel, bw_method %s' %title)
    fig.colorbar(surf, shrink=0.5, aspect=7, cmap=plt.cm.coolwarm)
    plt.show()









    



Populating the interactive namespace from numpy and matplotlib

Implementing the classifier using Bayes' decision rule

$decide \; \omega_j \; for \;\; max \;\bigg[ P(\omega_j|x) = \frac{p(x|\omega_j) * P(\omega_j)}{p(x)} \bigg]\;, \;\; j = 1, 2, 3 $

We can remove the prior probabilities (equal priors) and the scale factor:

$decide \; \omega_j \; for \;\; max \;[ p(x|\omega_j) ]\;, \;\; j = 1, 2, 3 $



In [27]:

    
import operator

def bayes_classifier(x_vec, kdes):
    """
    Classifies an input sample into class w_j determined by
    maximizing the class conditional probability for p(x|w_j).
    
    Keyword arguments:
        x_vec: A dx1 dimensional numpy array representing the sample.
        kdes: List of the gausssian_kde (kernel density) estimates
      
    Returns a tuple ( p(x|w_j)_value, class label ).
    
    """
    p_vals = []
    for kde in kdes:
        p_vals.append(kde.evaluate(x_vec))
    max_index, max_value = max(enumerate(p_vals), key=operator.itemgetter(1))
    return (max_value, max_index + 1)

Error on the test set using the likelihoods

from the Gaussian kernel estimate of the Parzen-window technique



In [28]:

    
import prettytable

classification_dict, error = empirical_error(test_set, [1,2,3],  
                bayes_classifier, [[class1_kde, class2_kde, class3_kde]])

labels_predicted = ['w{} (predicted)'.format(i) for i in [1,2,3]]
labels_predicted.insert(0,'test dataset')

train_conf_mat = prettytable.PrettyTable(labels_predicted)
for i in [1,2,3]:
    a, b, c = [classification_dict[i][j] for j in [1,2,3]]
    # workaround to unpack (since Python does not support just '*a')
    train_conf_mat.add_row(['w{} (actual)'.format(i), a, b, c])
print(train_conf_mat)
print('Empirical Error: {:.2f} ({:.2f}%)'.format(error, error * 100))









    



+--------------+----------------+----------------+----------------+
| test dataset | w1 (predicted) | w2 (predicted) | w3 (predicted) |
+--------------+----------------+----------------+----------------+
| w1 (actual)  |      142       |       0        |       8        |
| w2 (actual)  |       1        |      144       |       5        |
| w3 (actual)  |       3        |       3        |      144       |
+--------------+----------------+----------------+----------------+
Empirical Error: 0.04 (4.44%)

1d)

Description

Suppose the 1-nearest neighbor method is used for classifying the patterns in the test set.
What is the error rate of this approach?

About the k-Nearest Neighbor Technique

As mentioned previously for the Parzen-window approach,
The density for a point x is defined as

$p(\pmb x) = \frac{k / n}{V}$

where k is the number of points inside a specified region, n is the
number of total points, and V is the volume of that region.

However, in contrast to the Parzen-window technique, we don't have a
fixed volume parameter, but control k (the number of points in a region
of specfic volume). In other words, we set the number of points that we want to
capture and compute the volume of the region. Also, the same convergence assumptions
like for the Parzen-window technique apply for the k-Nearest Neighbor technque.

Special case: 1-Nearest Neighbor

Here, we are using a simpler variant of the knn-algorithm,
where we classify a point as belonging to the class of its nearest neighbor.

Implementing the code for finding the 1-Nearest Neighbor



In [ ]:

    
def find_1nearest_neighbor(p, dataset):
    """
    Finds the nearest neighbor of a point in a dataset.
    
    Keyword arguments:
        p: Query point as dx1-dimensional numpy array.
        dataset: Dataset as nxd-dimensional numpy array 
            with class label in the last column.
            
    Returns the nearest neighbor as 1xd dimensional numpy
        array from the dataset.

"""
    p_index = np.argmin([np.inner(p-x, p-x) for x in dataset[:,0:-1]])
    return dataset[p_index]   

p = np.array([[1],[1]])
find_1nearest_neighbor(p, test_set)



In [29]:

    
def one_nn_classify(train_set, data_set):
    """
    Classifies points in a dataset based on the
    nearest neighbor in the training dataset
    (1-nearest neighbor technique).
    
    Keyword arguments:
        train_set: training points as nx3-dimensional numpy array.
            with class label in the last column
        data_set: training points as nx2-dimensional numpy array.
            
    Returns a list of the predicted class labels.

"""
    class_labels = [np.argmin([np.inner(p-x, p-x)  
                               for x in train_set[:,0:-1]])for p in data_set]
    return [int(train_set[cl,-1]) for cl in class_labels]

true_class = test_set[:,-1]
predicted_class = one_nn_classify(train_set, test_set[:,0:-1])   
error = sum([1 for pred,true in zip(predicted_class,true_class)  
             if pred!=true])/len(test_set)
print(error)









    



0.0866666666667



In [30]:

    
import prettytable

c1_pred = [0, 0, 0]
c2_pred = [0, 0, 0]
c3_pred = [0, 0, 0]

for pred,true in zip(predicted_class,true_class):
    if true == 1:
        c1_pred[pred-1] += 1
    elif true == 2:
        c2_pred[pred-1] += 2
    else:
        c3_pred[pred-1] += 3        


labels_predicted = ['w{} (predicted)'.format(i) for i in [1,2,3]]
labels_predicted.insert(0,'test dataset')

train_conf_mat = prettytable.PrettyTable(labels_predicted)
for i in [1,2,3]:
    a, b, c = c1_pred[i-1], c2_pred[i-1], c3_pred[i-1]
    # workaround to unpack (since Python does not support just '*a')
    train_conf_mat.add_row(['w{} (actual)'.format(i), a, b, c])
print(train_conf_mat)
print('Empirical Error: {:.2f} ({:.2f}%)'.format(error, error * 100))









    



+--------------+----------------+----------------+----------------+
| test dataset | w1 (predicted) | w2 (predicted) | w3 (predicted) |
+--------------+----------------+----------------+----------------+
| w1 (actual)  |      139       |       6        |       24       |
| w2 (actual)  |       0        |      276       |       24       |
| w3 (actual)  |       11       |       18       |      402       |
+--------------+----------------+----------------+----------------+
Empirical Error: 0.09 (8.67%)

1e)

Description

Repeat the above three classification procedures by varying the size of the
training set as follows: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% of each class.
Plot the error rate as a function of the size of the training set for each of the 3 cases.

Creating training sets with variable size



In [31]:

    
# This workaround is necessary to make sure that every subset 
# contains same number of
# samples for each of the three classes

train_subsets_c1 = {}
for p in reversed([i/10 for i in range(1,11)]):
    t = train_set[train_set[:,2] == 1]
    train_subsets_c1[int(p*100)] =\
        t[np.random.choice(t.shape[0], t.shape[0] * p, replace=False),:]
    
train_subsets_c2 = {}
for p in reversed([i/10 for i in range(1,11)]):
    t = train_set[train_set[:,2] == 2]
    train_subsets_c2[int(p*100)] =\
        t[np.random.choice(t.shape[0], t.shape[0] * p, replace=False),:]
    
train_subsets_c3 = {}
for p in reversed([i/10 for i in range(1,11)]):
    t = train_set[train_set[:,2] == 3]
    train_subsets_c3[int(p*100)] =\
        t[np.random.choice(t.shape[0], t.shape[0] * p, replace=False),:]

# combine separate-class subsets for convenience
train_subsets = {}
for set_size in sorted(train_subsets_c1.keys()):
    t_set = np.zeros(shape=(1,3))
    for subset in [train_subsets_c1, train_subsets_c2, train_subsets_c3]:        
        t_set = np.append(t_set, subset[set_size], axis=0)
        # delete the first placeholder row used for initializing the array                                  
    t_set = np.delete(t_set, 0, axis=0)
    train_subsets[set_size] = t_set

for i in sorted(train_subsets):
    print(i, train_subsets[i].shape)









    



10 (105, 3)
20 (210, 3)
30 (315, 3)
40 (420, 3)
50 (525, 3)
60 (630, 3)
70 (732, 3)
80 (840, 3)
90 (945, 3)
100 (1050, 3)



In [168]:

    
#Plotting one subset to confirm that the reduction worked error-free

f, ax = plt.subplots(figsize=(7, 7))
ax.scatter(train_subsets[30][train_subsets[30][:,2] == 1][:,0], 
           train_subsets[30][train_subsets[30][:,2] == 1][:,1], \
           marker='o', color='green', s=40, alpha=0.5, label='$\omega_1$')
ax.scatter(train_subsets[30][train_subsets[30][:,2] == 2][:,0], 
           train_subsets[30][train_subsets[30][:,2] == 2][:,1], \
           marker='^', color='red', s=40, alpha=0.5, label='$\omega_2$')
ax.scatter(train_subsets[30][train_subsets[30][:,2] == 3][:,0], 
           train_subsets[30][train_subsets[30][:,2] == 3][:,1], \
           marker='s', color='blue', s=40, alpha=0.5, label='$\omega_3$')
plt.legend(loc='upper right') 
plt.title('Training datset after reducing the size to 30%', size=20)
plt.ylabel('$x_2$', size=20)
plt.xlabel('$x_1$', size=20)
plt.show()

MLE error rate for varying training set sizes

The MLE of the parameter $\pmb\mu$ is given by the equation:
${\hat{\pmb\mu}} = \frac{1}{n} \sum\limits_{k=1}^{n} \pmb x_k$



In [32]:

    
mu_estimates = {}

for set_size in sorted(train_subsets):
    mu1_est = np.sum(train_subsets[set_size]\
                     [train_subsets[set_size][:,2] == 1], axis=0)/\
        len(train_subsets[set_size]\
            [train_subsets[set_size][:,2] == 1])
    mu1_est = mu1_est[0:2].reshape(2,1)
    mu2_est = np.sum(train_subsets[set_size]\
                     [train_subsets[set_size][:,2] == 2], axis=0)/\
        len(train_subsets[set_size]\
            [train_subsets[set_size][:,2] == 2])
    mu2_est = mu2_est[0:2].reshape(2,1)
    mu3_est = np.sum(train_subsets[set_size]\
                     [train_subsets[set_size][:,2] == 3], axis=0)/\
        len(train_subsets[set_size]\
            [train_subsets[set_size][:,2] == 3])
    mu3_est = mu3_est[0:2].reshape(2,1)
    
    mu_estimates[set_size] = [mu1_est, mu2_est, mu3_est]

The MLE of the parameter $\pmb\Sigma$ is given by the equation:
${\hat{\pmb\Sigma}} = \frac{1}{n} \sum\limits_{k=1}^{n} (\pmb x_k - \hat{\mu})(\pmb x_k - \hat{\mu})^t$



In [33]:

    
def mle_est_cov(x_samples, mu_est):
    """
    Calculates the Maximum Likelihood Estimate for the covariance matrix.
    
    Keyword Arguments:
        x_samples: np.array of the samples for 1 class, n x d dimensional 
        mu_est: np.array of the mean MLE, d x 1 dimensional
        
    Returns the MLE for the covariance matrix as d x d numpy array.
    
    """
    cov_est = np.zeros((2,2))
    for x_vec in x_samples:
        x_vec = x_vec.reshape(2,1)
        assert(x_vec.shape == mu_est.shape),\
        'mean and x vector hmust be of equal shape'
        cov_est += (x_vec - mu_est).dot((x_vec - mu_est).T)
    return cov_est / len(x_samples)


cov_estimates = {}

for set_size in sorted(train_subsets):    
    cov1_est = mle_est_cov(train_subsets[set_size]\
        [train_subsets[set_size][:,2] == 1][:,0:2], 
        mu_estimates[set_size][0])       
    cov2_est = mle_est_cov(train_subsets[set_size]\
        [train_subsets[set_size]\
        [:,2] == 2][:,0:2], mu_estimates[set_size][1])   
    cov3_est = mle_est_cov(train_subsets[set_size]\
        [train_subsets[set_size][:,2] == 3][:,0:2], 
        mu_estimates[set_size][2])
    
    cov_estimates[set_size] = [cov1_est, cov2_est, cov3_est]



In [175]:

    
def discriminant_function(x_vec, cov_mat, mu_vec):
    """
    Calculates the value of the discriminant function for a dx1 dimensional
    sample given the covariance matrix and mean vector.
    
    Keyword arguments:
        x_vec: A dx1 dimensional numpy array representing the sample.
        cov_mat: numpy array of the covariance matrix.
        mu_vec: dx1 dimensional numpy array of the sample mean.
    
    Returns a float value as result of the discriminant function.
    
    """
    W_i = (-1/2) * np.linalg.inv(cov_mat)
    assert(W_i.shape[0] > 1 and W_i.shape[1] > 1), 'W_i must be a matrix'
    
    w_i = np.linalg.inv(cov_mat).dot(mu_vec)
    assert(w_i.shape[0] > 1 and w_i.shape[1] == 1), 'w_i must be a column vector'
    
    omega_i_p1 = (((-1/2) *\
                   (mu_vec).T).dot(np.linalg.inv(cov_mat))).dot(mu_vec)
    omega_i_p2 = (-1/2) *\
    np.log(np.linalg.det(cov_mat))
    omega_i = omega_i_p1 - omega_i_p2
    assert(omega_i.shape == (1, 1)),\
        'omega_i must be a scalar'
    
    g = ((x_vec.T).dot(W_i)).dot(x_vec) + (w_i.T).dot(x_vec) + omega_i
    return float(g)


def classify_data(x_vec, g, mu_vecs, cov_mats):
    """
    Classifies an input sample into 1 out of 3 classes determined by
    maximizing the discriminant function g_i().
    
    Keyword arguments:
        x_vec: A dx1 dimensional numpy array representing the sample.
        g: The discriminant function.
        mu_vecs: A list of mean vectors as input for g.
        cov_mats: A list of covariance matrices as input for g.
    
    Returns a tuple (g_i()_value, class label).
    
    """
    assert(len(mu_vecs) == len(cov_mats)),\
        'Number of mu_vecs and cov_mats must be equal.'
    
    g_vals = []
    for m,c in zip(mu_vecs, cov_mats): 
        g_vals.append(g(x_vec, mu_vec=m, cov_mat=c))
    
    max_index, max_value = max(enumerate(g_vals), 
                               key=operator.itemgetter(1))
    return (max_value, max_index + 1)



In [34]:

    
mle_errors = []
for i in range(10,101,10):
    classification_dict, error =\
            empirical_error(test_set, [1,2,3], classify_data, 
            [discriminant_function,
            mu_estimates[i],
            cov_estimates[i]])
    mle_errors.append(error)

Error rate of kernel density estimaton

via Parzen-window for varying training sizes



In [36]:

    
import operator

def bayes_kde_class(x_vec, kdes):
    """
    Classifies an input sample into class w_j determined by
    maximizing the class conditional probability for p(x|w_j).
    
    Keyword arguments:
        x_vec: A dx1 dimensional numpy array representing the sample.
        kdes: List of the gausssian_kde (kernel density) estimates
      
    Returns a tuple ( p(x|w_j)_value, class label ).
    
    """
    p_vals = []
    for kde in kdes:
        p_vals.append(kde.evaluate(x_vec))
    max_index, max_value = max(enumerate(p_vals), key=operator.itemgetter(1))
    return (max_value, max_index + 1)



In [45]:

    
from scipy.stats import kde
kde_errors = []
for i in range(10,101,10):
    class1_kde = kde.gaussian_kde(train_subsets[i]\
                [train_subsets[i][:,2] == 1].T[0:2], bw_method=1.0)
    class2_kde = kde.gaussian_kde(train_subsets[i]\
                [train_subsets[i][:,2] == 2].T[0:2], bw_method=1.0)
    class3_kde = kde.gaussian_kde(train_subsets[i]\
                [train_subsets[i][:,2] == 3].T[0:2], bw_method=1.0)
    classification_dict, error = empirical_error(test_set, 
            [1,2,3], bayes_kde_class, [[class1_kde, class2_kde, class3_kde]])
    kde_errors.append(error)

Error rate of the k-nearest-neighbor technique for varying training sizes

Plotting the error rate as a

function of the size of the training set for each of the 3 cases



In [40]:

    
def one_nn_classify(train_set, data_set):
    """
    Classifies points in a dataset based on the
    nearest neighbor in the training dataset
    (1-nearest neighbor technique).
    
    Keyword arguments:
        train_set: training points as nx3-dimensional numpy array.
            with class label in the last column
        data_set: training points as nx2-dimensional numpy array.
            
    Returns a list of the predicted class labels.

"""
    class_labels = [np.argmin([np.inner(p-x, p-x)  
                               for x in train_set[:,0:-1]])for p in data_set]
    return [int(train_set[cl,-1]) for cl in class_labels]



In [41]:

    
one_nn_errors = []
for i in train_subsets.keys():
    true_class = test_set[:,-1]
    predicted_class = one_nn_classify(train_subsets[i], test_set[:,0:-1])   
    error = sum([1 for pred,true in zip(predicted_class,true_class)  
             if pred!=true])/len(test_set)
    one_nn_errors.append(error)



In [72]:

    
labels = ['MLE', 'Parzen-window (h=1)', '1=Nearest Neighbor']
f, ax = plt.subplots(figsize=(10, 8))
for err in [mle_errors, kde_errors, one_nn_errors]:
    plt.plot([len(train_set)*i for i in np.arange(0.1,1.1,0.1)], err, lw=2)
#plt.ylim([0,0.2])
plt.xlabel('training set size')
plt.ylabel('Error rate on test data set')
plt.title('Error rate for different techniques')
ylim([0,0.12])
plt.legend(labels, loc='lower right')
plt.show()

1f) - Discussion

Discuss you results along the following lines:

i. Which classifier results in the best performance on the test set?

1 f) i. Best Classifier

Which classifier results in the best performance on the test set?

Both parametric methods (here MLE) and non-parametric methods
(here: Parzen-window and 1-nearest neighbor) have good convergence properties.
In practice, it depends on the fact if we know the underlying model of the
data distribution when we are about to choose a parametric vs. a non-parametric approach.
If the model of the distribution is incorrect, the parametric approach yields an more
inaccurate estimate of the parameters.
The non-parametric approaches do not require knowledge about the underlying
distributions, since the estimate the probability density for local regions,
not the whole range of the distribution.
Assuming that the model for the Maximum Likelihood Estimate is correct
(here: multivariate Gaussian distribution) I would expect to observe the
largest error for the 1-nearest neighbor technique, since it is a suboptimal
technique, which can be bounded being 2-times larger than the Bayes error - due to
the fact that the "nearest neighbor" can be far away in regions
of sparsity (esp. in smaller datasets).
The performance of the Parzen-window approach largely depends on the chosen window size.
In order to improve this approach, we would test the classifier-performance for different
window width, which we haven't done here.

Indeed, we can observe the worst performance on the training dataset for the 1-nearest
neighbor technique. A relatively low error rate for the MLE and Parzen-window kernel
density estimation suggest that the underlying assumption of the distribution for the
parametric approach (MLE) is accurate, and the chosen window width for the Parzen-window
approach is appropriate (however, it doesn't imply that the classifier cannot improved by
choosing a different window width).

1 f) ii. Performance Changes as a function of patterns

ii. Does the performance of a classifier change significantly
depending upon the patterns used in the training set?
Substantiate your answer with an actual experiment on this dataset.

I would assume that the performance of the classifier should not change significantly
for different patterns in large training datasets, because we assume that the samples are i.i.d.
However, for small training dataset sizes, the performance of the clasifier can change
significantly because the estimates become more inaccurate (according to the principle of
convergence for large datasets) as we have seen in exercise 1 e). Reasons for such a significant
changes in the performance - especially for small datasets - can be the the local sparsity
of datapoints so that the MLE will be biased (because of a skew in the distribution)
and the non-parametric techniques cannot accurately predict the local denisties for certain points.

An experiment to explore this hypothesis would be to draw different subsets of the same training
randomly multiple times and compare them against each other with respect to the error rates of the
different classifiers.

Drawing random training subsets again

As in exercise 1 d), we will draw random training subsets from the
training dataset again with the sizes compared to the
original training dataset as follows: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%
of each class.
Then we will compute the MLE and estimates for the non-parametric techniques for
those new subsets and compare the error rates on the test dataset to the error rates from the the old training subsets.



In [6]:

    
# This workaround is necessary to make sure that every subset 
# contains same number of
# samples for each of the three classes

train_subsets2_c1 = {}
for p in reversed([i/10 for i in range(1,11)]):
    t = train_set[train_set[:,2] == 1]
    train_subsets2_c1[int(p*100)] =\
        t[np.random.choice(t.shape[0], t.shape[0] * p, replace=False),:]
    
train_subsets2_c2 = {}
for p in reversed([i/10 for i in range(1,11)]):
    t = train_set[train_set[:,2] == 2]
    train_subsets2_c2[int(p*100)] =\
        t[np.random.choice(t.shape[0], t.shape[0] * p, replace=False),:]
    
train_subsets2_c3 = {}
for p in reversed([i/10 for i in range(1,11)]):
    t = train_set[train_set[:,2] == 3]
    train_subsets2_c3[int(p*100)] =\
        t[np.random.choice(t.shape[0], t.shape[0] * p, replace=False),:]

# combine separate-class subsets for convenience
train_subsets2 = {}
for set_size in sorted(train_subsets_c1.keys()):
    t_set = np.zeros(shape=(1,3))
    for subset in [train_subsets2_c1, train_subsets2_c2, train_subsets2_c3]:        
        t_set = np.append(t_set, subset[set_size], axis=0)
        # delete the first placeholder row used for initializing the array                                  
    t_set = np.delete(t_set, 0, axis=0)
    train_subsets2[set_size] = t_set

for i in sorted(train_subsets2):
    print(i, train_subsets2[i].shape)









    



10 (105, 3)
20 (210, 3)
30 (315, 3)
40 (420, 3)
50 (525, 3)
60 (630, 3)
70 (732, 3)
80 (840, 3)
90 (945, 3)
100 (1050, 3)



In [47]:

    
one_nn_errors2 = []
for i in train_subsets.keys():
    true_class = test_set[:,-1]
    predicted_class = one_nn_classify(train_subsets2[i], test_set[:,0:-1])   
    error = sum([1 for pred,true in zip(predicted_class,true_class)  
             if pred!=true])/len(test_set)
    one_nn_errors2.append(error)



In [48]:

    
mu_estimates2 = {}

for set_size in sorted(train_subsets2):
    mu1_est = np.sum(train_subsets2[set_size]\
                     [train_subsets2[set_size][:,2] == 1], axis=0)/\
        len(train_subsets2[set_size]\
            [train_subsets2[set_size][:,2] == 1])
    mu1_est = mu1_est[0:2].reshape(2,1)
    mu2_est = np.sum(train_subsets2[set_size]\
                     [train_subsets2[set_size][:,2] == 2], axis=0)/\
        len(train_subsets2[set_size]\
            [train_subsets2[set_size][:,2] == 2])
    mu2_est = mu2_est[0:2].reshape(2,1)
    mu3_est = np.sum(train_subsets2[set_size]\
                     [train_subsets2[set_size][:,2] == 3], axis=0)/\
        len(train_subsets2[set_size]\
            [train_subsets2[set_size][:,2] == 3])
    mu3_est = mu3_est[0:2].reshape(2,1)
    
    mu_estimates2[set_size] = [mu1_est, mu2_est, mu3_est]

cov_estimates2 = {}

for set_size in sorted(train_subsets2):    
    cov1_est = mle_est_cov(train_subsets2[set_size]\
        [train_subsets2[set_size][:,2] == 1][:,0:2], 
        mu_estimates2[set_size][0])       
    cov2_est = mle_est_cov(train_subsets2[set_size]\
        [train_subsets2[set_size]\
        [:,2] == 2][:,0:2], mu_estimates2[set_size][1])   
    cov3_est = mle_est_cov(train_subsets2[set_size]\
        [train_subsets2[set_size][:,2] == 3][:,0:2], 
        mu_estimates2[set_size][2])
    
    cov_estimates2[set_size] = [cov1_est, cov2_est, cov3_est]    
    
mle_errors2 = []
for i in range(10,101,10):
    classification_dict, error =\
            empirical_error(test_set, [1,2,3], classify_data, 
            [discriminant_function,
            mu_estimates2[i],
            cov_estimates[i]])
    mle_errors2.append(error)



In [49]:

    
from scipy.stats import kde

kde_errors2 = []
for i in range(10,101,10):
    class1_kde = kde.gaussian_kde(train_subsets2[i]\
                [train_subsets[i][:,2] == 1].T[0:2], bw_method=1.0)
    class2_kde = kde.gaussian_kde(train_subsets2[i]\
                [train_subsets[i][:,2] == 2].T[0:2], bw_method=1.0)
    class3_kde = kde.gaussian_kde(train_subsets2[i]\
                [train_subsets[i][:,2] == 3].T[0:2], bw_method=1.0)
    classification_dict, error = empirical_error(test_set, 
            [1,2,3], bayes_kde_class, [[class1_kde, class2_kde, class3_kde]])
    kde_errors2.append(error)



In [69]:

    
def give_color():
    for i in  ['red', 'blue', 'green']:
        yield i
colors = give_color()

def give_labels():
    for i in ['MLE tr_set1', 'MLE tr_set2',
               'Parzen-wind. tr_set1', 'Parzen-wind. tr_set2',
               '1-nn tr_set1', '1-nn tr_set2']:
        yield i
        
colors = give_color()
labels = give_labels()

f, ax = plt.subplots(figsize=(10, 8))
for err1, err2 in zip(
        [mle_errors, kde_errors, one_nn_errors], 
        [mle_errors2, kde_errors2, one_nn_errors2]):
    col = next(colors)
    plt.plot([len(train_set)*i for i in np.arange(0.1,1.1,0.1)], 
             err1, color=col, lw=2, label=next(labels))
    plt.plot([len(train_set)*i for i in np.arange(0.1,1.1,0.1)], 
             err2, color=col, linestyle='--', lw=2, label=next(labels))

plt.xlabel('training set size')
plt.ylabel('Error rate on test data set')
plt.title('Error rate for different techniques')
ylim([0,0.12])
plt.legend(loc='lower right')
plt.show()

Significant changes in the performance of the classifier can be observed throughout
the different sample sizes. As mentioned, this most likely occurs for small
training dataet sizes, here: n=300 (30% of the original training dataset size).

1 f) iii. Performance changes and the number of data patterns

It is expected that the performance for all 3 techniques
(MLE, Parzen-window kernel density estimation, 1-nearest neighbor) increase,
since all underly the assumption of convergence.

$\lim\limits_{n \rightarrow \infty}{p_n(\pmb x)} = p(\pmb x)$

However, we don't see this effect so pronounced for our data set.
The error rate for all 3 approaches seems to be fluctuating around an average value.
Probably a wider range of different sample sizes would be necessary
to make this effect more obvious.

1 f) iv. Number of training patterns per class

iv. Do you think it is necessary for the number of training patterns per class to be the same?

It is not necessary. However, if the prior probabilities are not equal,
we would have to normalize the weights for the observation for every class
in the Parzen-window technique and the 1-nearest neighbor technique.
For the MLE, we wouldn't be able to drop the prior probabilities from Bayes' rule,
which would lead to different discriminant functions.

2. Thoughts on a Ray Kurzweil quote

[10 points] In a New York Times article from 2003
(see http://writing.upenn.edu/~afilreis/88v/kurzweil.html), Ray Kurzweil
is quoted as saying “The real power of human thinking is
based on recognizing patterns. The better computers get at pattern recognition,
the more humanlike they will become.” What are your thoughts on this statement?

To assess this statement, one needs to know what exactly Ray Kurzweil
defines as "human like", however, if he only refers to the ability to recognize
patterns, I would agree that computers will definitely become more "human-like"
in terms of making "appropriate" decisions (or performing an specific action)
for a particular situation.
Limitations of pattern recognition for computers are the amount of available data
and the complexity of the data. The more powerful computers become, the more data
can be processed (given a infinite number of possible training samples) for learning
to distinguish more patterns. Also, more complex input data can be processed, and more
complex features can be incorporated into the classifier design.

However, computers are still limited to distinguish patterns based on
numerical data - input data has to be processed into numbers. This may not be
applicable for all sorts of patterns. For example, for emotions on a human face
it may be challenging to break them down into numerical data.
Also, humans have the ability to make good decisions based on intuition,
where "data" or "experience" may not be available (i.e., for novel patterns
that are encountered for the first time). Since computers have to be "pre-computed"
and rely on data for the learning task, the recognition and choice of an appropriate
pattern might be very challenging (even if clever scientists develop recursive ways
so that a computer may learn novel patterns, it still requires previous data).

Overall, computers are improving in pattern recognition tasks through recursive
self-learning algorithms, however the complexity of the humanlike behavior includes
more aspects than just the recognition of patterns. So, I agree that computer may become
"more" humanlike, but without becoming "entirely" human-like.



In [ ]: