Save this file as studentid1_studentid2_lab#.ipynb

(Your student-id is the number shown on your student card.)

E.g. if you work with 3 people, the notebook should be named: 12301230_3434343_1238938934_lab1.ipynb.

This will be parsed by a regexp, so please double check your filename.

Before you turn this problem in, please make sure everything runs correctly. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE", as well as your names and email adresses below.



In [1]:

    
NAME = "Michelle Appel"
NAME2 = "Verna Dankers"
NAME3 = "Yves van Montfort"
EMAIL = "michelle.appel@student.uva.nl"
EMAIL2 = "verna.dankers@student.uva.nl"
EMAIL3 = "yves.vanmontfort@student.uva.nl"

Lab 3: Gaussian Processes and Support Vector Machines

Machine Learning 1, September 2017

Notes on implementation:

You should write your code and answers in this IPython Notebook: http://ipython.org/notebook.html. If you have problems, please contact your teaching assistant.
Please write your answers right below the questions.
Among the first lines of your notebook should be "%pylab inline". This imports all required modules, and your plots will appear inline.
Refer to last week's lab notes, i.e. http://docs.scipy.org/doc/, if you are unsure about what function to use. There are different correct ways to implement each problem!
use the provided test boxes to check if your answers are correct



In [2]:

    
%pylab inline
plt.rcParams["figure.figsize"] = [20,10]









    



Populating the interactive namespace from numpy and matplotlib

Part 1: Gaussian Processes

For part 1 we will be refer to Bishop sections 6.4.2 and 6.4.3. You may also want to refer to Rasmussen's Gaussian Process text which is available online at http://www.gaussianprocess.org/gpml/chapters/ and especially to the project found at https://www.automaticstatistician.com/index/ by Ghahramani for some intuition in GP. To understand Gaussian processes, it is highly recommended understand how marginal, partitioned Gaussian distributions can be converted into conditional Gaussian distributions. This is covered in Bishop 2.3 and summarized in Eqns 2.94-2.98. $\newcommand{\bt}{\mathbf{t}}$ $\newcommand{\bx}{\mathbf{x}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\ba}{\mathbf{a}}$

Periodic Data

We will use the same data generating function that we used previously for regression.



In [3]:

    
def true_mean_function(x):
    return np.cos(2*pi*(x+1))

def add_noise(y, sigma):
    return y + sigma*np.random.randn(len(y))

def generate_t(x, sigma):
    return add_noise(true_mean_function(x), sigma)



In [4]:

    
sigma = 0.2
beta  = 1.0 / pow(sigma, 2)
N_test = 100

x_test = np.linspace(-1, 1, N_test) 
mu_test = np.zeros(N_test)
y_test = true_mean_function(x_test)
t_test = add_noise(y_test, sigma)

plt.plot( x_test, y_test, 'b-', lw=2)
plt.plot( x_test, t_test, 'go')
plt.show()

1. Sampling from the Gaussian process prior (30 points)

We will implement Gaussian process regression using the kernel function in Bishop Eqn. 6.63.

1.1 k_n_m( xn, xm, thetas ) (5 points)

To start, implement function k_n_m(xn, xm, thetas) that takes scalars $x_n$ and $x_m$, and a vector of $4$ thetas, and computes the kernel function Bishop Eqn. 6.63 (10 points). NB: usually the kernel function will take $D$ by $1$ vectors, but since we are using a univariate problem, this makes things easier.



In [5]:

    
def k_n_m(xn, xm, thetas):
    theta0, theta1, theta2, theta3 = thetas # Unpack thetas
    
    if(xn == xm):
        k = theta0 + theta2 + theta3*xn*xm
    else:
        k = theta0 * np.exp(-(theta1/2)*(xn-xm)**2) + theta2 + theta3*xn*xm
    
    return k



In [ ]:

1.2 computeK( X1, X2, thetas ) (10 points)

Eqn 6.60 is the marginal distribution of mean output of $N$ data vectors: $p(\mathbf{y}) = \mathcal{N}(0, \mathbf{K})$. Notice that the expected mean function is $0$ at all locations, and that the covariance is a $N$ by $N$ kernel matrix $\mathbf{K}$. Write a function computeK(x1, x2, thetas) that computes the kernel matrix. Use k_n_m as part of an inner loop (of course, there are more efficient ways of computing the kernel function making better use of vectorization, but that is not necessary) (5 points).



In [6]:

    
def computeK(x1, x2, thetas):

    K = np.zeros(shape=(len(x1), len(x2))) # Create empty array
    
    for xn, row in zip(x1, range(len(x1))): # Iterate over x1
        for xm, column in zip(x2, range(len(x2))): # Iterate over x2
            K[row, column] = k_n_m(xn, xm, thetas) # Add kernel to matrix
        
    return K

x1 = [0, 0, 1]
x2 = [0, 0, 1]
thetas = [1, 2, 3, 1]
K = computeK(x1, x2, thetas)



In [7]:

    
### Test your function
x1 = [0, 1, 2]
x2 = [1, 2, 3, 4]
thetas = [1, 2, 3, 4]
K = computeK(x1, x2, thetas)

assert K.shape == (len(x1), len(x2)), "the shape of K is incorrect"

1.3 Plot function samples (15 points)

Now sample mean functions at the x_test locations for the theta values in Bishop Figure 6.5, make a figure with a 2 by 3 subplot and make sure the title reflects the theta values (make sure everything is legible). In other words, sample $\by_i \sim \mathcal{N}(0, \mathbf{K}_{\theta})$. Make use of numpy.random.multivariate_normal(). On your plots include the expected value of $\by$ with a dashed line and fill_between 2 standard deviations of the uncertainty due to $\mathbf{K}$ (the diagonal of $\mathbf{K}$ is the variance of the model uncertainty) (15 points).



In [8]:

    
import matplotlib.pyplot as plt

# The thetas
thetas0 = [1, 4, 0, 0]
thetas1 = [9, 4, 0, 0]
thetas2 = [1, 64, 0, 0]
thetas3 = [1, 0.25, 0, 0]
thetas4 = [1, 4, 10, 0]
thetas5 = [1, 4, 0, 5]

f, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3) # Subplot setup

all_thetas = [thetas0, thetas1, thetas2, thetas3, thetas4, thetas5] # List of all thetas
all_plots = [ax1, ax2, ax3, ax4, ax5, ax6] # List of all plots

n = 5 # Number of samples per subplot

for subplot_, theta_ in zip(all_plots, all_thetas): # Iterate over all plots and thetas
    K = computeK(x_test, x_test, theta_) # Compute K
    
    # Fix numerical error on eigenvalues 0 that are slightly negative (<e^15)
    min_eig = np.min(np.real(np.linalg.eigvals(K)))
    if min_eig < 0:
        K -= 10*min_eig * np.eye(*K.shape)

    mean = np.zeros(shape=len(K)) # Generate Means
    
    random = numpy.random.multivariate_normal(mean, K) 
    
    # Draw n random samples
    samples = [numpy.random.multivariate_normal(mean, K) for i in range(n)]
    
    # Calculate expected y and variance
    expected_y = numpy.mean(np.array(samples), axis=0)
    uncertainties = numpy.sqrt(K.diagonal())

    for sample in samples:
        subplot_.plot(sample) 

    x = np.arange(0, 100) # 100 Steps

    # Plot uncertainty
    subplot_.fill_between(
        x, expected_y - 2 * uncertainties, expected_y + 2 * uncertainties, 
        alpha=0.3, color='pink'
    )
    subplot_.plot(expected_y, 'g--') # Plot ground truth
    # subplot_.legend(['Sampled y', 'Expected y']) # Add legend
    subplot_.set_title(theta_) # Set title

plt.show()

2. Predictive distribution (35 points)

So far we have sampled mean functions from the prior. We can draw actual data $\bt$ two ways. The first way is generatively, by first sampling $\by | \mathbf{K}$, then sampling $\bt | \by, \beta$ (Eqns 6.60 followed by 6.59). The second way is to integrate over $\by$ (the mean draw) and directly sample $\bt | \mathbf{K}, \beta$ using Eqn 6.61. This is the generative process for $\bt$. Note that we have not specified a distribution over inputs $\bx$; this is because Gaussian processes are conditional models. Because of this we are free to generate locations $\bx$ when playing around with the GP; obviously a dataset will give us input-output pairs.

Once we have data, we are interested in the predictive distribution (note: the prior is the predictive distribution when there is no data). Consider the joint distribution for $N+1$ targets, given by Eqn 6.64. Its covariance matrix is composed of block components $C_N$, $\mathbf{k}$, and $c$. The covariance matrix $C_N$ for $\bt_N$ is $C_N = \mathbf{K}_N + \beta^{-1}\mathbf{I}_N$. We have just made explicit the size $N$ of the matrix; $N$ is the number of training points. The kernel vector $\mathbf{k}$ is a $N$ by $1$ vector of kernel function evaluations between the training input data and the test input vector. The scalar $c$ is a kernel evaluation at the test input.

2.1 gp_predictive_distribution(...) (10 points)

Write a function gp_predictive_distribution(x_train, t_train, x_test, theta, beta, C=None) that computes Eqns 6.66 and 6.67, except allow for an arbitrary number of test points (not just one) and now the kernel matrix is for training data. By having C as an optional parameter, we can avoid computing it more than once (for this problem it is unimportant, but for real problems this is an issue). The function should compute $\mathbf{C}$, $\mathbf{k}$, and return the mean, variance and $\mathbf{C}$. Do not forget: the computeK function computes $\mathbf{K}$, not $\mathbf{C}$! (10 points)



In [9]:

    
def computeC(x1, x2, theta, beta):
    K = computeK(x1, x2, theta)
    return K + np.diag(np.array([1/beta for x in x1]))

def gp_predictive_distribution(x_train, t_train, x_test, theta, beta, C=None):
    # Calculate or reuse C
    if C is None:
        C = computeC(x_train, x_train, theta, beta)
    
    # Calculate mean and variance
    c = computeC(x_test, x_test, theta, beta)
    K = computeK(x_train, x_test, theta)
    KC = np.matmul(np.linalg.inv(C), K)
    mean_test = np.asarray(np.matrix(t_train) * KC)
    var_test = c - np.matmul(KC.T, K)
    return mean_test.squeeze(), var_test.squeeze(), C



In [10]:

    
### Test your function
N = 2
train_x = np.linspace(-1, 1, N)
train_t = 2*train_x
test_N = 3
test_x = np.linspace(-1, 1, test_N) 
theta = [1, 2, 3, 4]
beta = 25
test_mean, test_var, C = gp_predictive_distribution(train_x, train_t, test_x, theta, beta, C=None)

assert test_mean.shape == (test_N,), "the shape of mean is incorrect"
assert test_var.shape == (test_N, test_N), "the shape of var is incorrect"
assert C.shape == (N, N), "the shape of C is incorrect"

C_in = np.array([[0.804, -0.098168436], [-0.098168436, 0.804]])
_, _, C_out = gp_predictive_distribution(train_x, train_t, test_x, theta, beta, C=C_in)

assert np.allclose(C_in, C_out), "C is not reused!"

2.2 gp_log_likelihood(...) (10 points)

To learn the hyperparameters, we would need to compute the log-likelihood of the of the training data. Implicitly, this is conditioned on the value setting for $\mathbf{\theta}$. Write a function gp_log_likelihood(x_train, t_train, theta, C=None, invC=None, beta=None), where C and invC can be stored and reused. It should return the log-likelihood, C and invC (10 points)



In [11]:

    
import math

def gp_log_likelihood(x_train, t_train, theta, beta, C=None, invC=None):
    if C is None:
        C = computeC(x_train, x_train, theta, beta)
    if invC is None:
        invC = np.linalg.inv(C)
    
    t_train = np.matrix(t_train)
    
    # Data likelihood as represented in Bishop page 311
    lp = -0.5 * np.log(np.linalg.det(C)) - 0.5 * t_train * \
         invC * t_train.T - len(x_train) / 2 * np.log(2*np.pi)
    lp = np.asscalar(lp)
    return lp, C, invC



In [12]:

    
### Test your function
N = 2
train_x = np.linspace(-1, 1, N)
train_t = 2 * train_x
theta = [1, 2, 3, 4]
beta = 25
lp, C, invC = gp_log_likelihood(train_x, train_t, theta, beta, C=None, invC=None)
    
assert lp < 0, "the log-likelihood should smaller than 0"
assert C.shape == (N, N), "the shape of var is incorrect"
assert invC.shape == (N, N), "the shape of C is incorrect"

C_in = np.array([[0.804, -0.098168436], [-0.098168436, 0.804]])
_, C_out, _ = gp_log_likelihood(train_x, train_t, theta, beta, C=C_in, invC=None)

assert np.allclose(C_in, C_out), "C is not reused!"

invC_in = np.array([[1.26260453, 0.15416407], [0.15416407, 1.26260453]])
_, _, invC_out = gp_log_likelihood(train_x, train_t, theta, beta, C=None, invC=invC_in)

assert np.allclose(invC_in, invC_out), "invC is not reused!"

2.3 Plotting (10 points)

Repeat the 6 plots above, but this time conditioned on the training points. Use the periodic data generator to create 2 training points where x is sampled uniformly between $-1$ and $1$. For these plots, feel free to use the provided function "gp_plot". Make sure you put the parameters in the title and this time also the log-likelihood. Try to understand the two types of uncertainty! If you do not use gp_plot(...), please add a fill between for the model and target noise. (10 points)



In [13]:

    
def gp_plot( x_test, y_test, mean_test, var_test, x_train, t_train, theta, beta ):
    # x_test: 
    # y_test:    the true function at x_test
    # mean_test: predictive mean at x_test
    # var_test:  predictive covariance at x_test 
    # t_train:   the training values
    # theta:     the kernel parameters
    # beta:      the precision (known)
    
    # the reason for the manipulation is to allow plots separating model and data stddevs.
    std_total = np.sqrt(np.diag(var_test))       # includes all uncertainty, model and target noise 
    std_model = np.sqrt(std_total**2 - 1.0/beta) # remove data noise to get model uncertainty in stddev
    std_combo = std_model + np.sqrt(1.0/beta)    # add stddev (note: not the same as full)
    
    plt.plot(x_test, y_test, 'b', lw=3)
    plt.plot(x_test, mean_test, 'k--', lw=2)
    plt.fill_between(x_test, mean_test+2*std_combo,mean_test-2*std_combo, color='k', alpha=0.25)
    plt.fill_between(x_test, mean_test+2*std_model,mean_test-2*std_model, color='r', alpha=0.25)
    plt.plot(x_train, t_train, 'ro', ms=10)



In [14]:

    
np.random.seed(70121327)

# Number of data points
n = 2

def plot_conditioned_on_training(n):
    # Use the periodic data generator to create 2 training points
    x_train = np.random.uniform(low=-1.0, high=1.0, size=n)
    t_train = generate_t(x_train, sigma)

    # 100 data points for testing
    x_test = np.linspace(-1, 1, 100)
    y_test = true_mean_function(x_test)

    # Iterate over all plots and thetas
    for i, theta in enumerate(all_thetas):
        plt.subplot(2, 3, i+1)
        mean, var, C = gp_predictive_distribution(x_train, t_train, x_test, theta, beta)
        lp, C, invC = gp_log_likelihood(x_train, t_train, theta, beta, C)

        # Put theta info and log likelihood in title
        plt.title("thetas : {}, lp : {}".format(theta, lp))
        gp_plot( x_test, y_test, mean, var, x_train, t_train, theta, beta)

    plt.show()

plot_conditioned_on_training(n)

2.4 More plotting (5 points)

Repeat the 6 plots above, but this time conditioned a new set of 10 training points. (5 points)



In [15]:

    
# Number of data points
n = 10

plot_conditioned_on_training(n)

Part 2: Support Vector Machines (45 points)

As seen in Part 1: Gaussian Processes, one of the significant limitations of many such algorithms is that the kernel function $k(\bx_n , \bx_m)$ must be evaluated for all possible pairs $\bx_n$ and $\bx_m$ of training points, which can be computationally infeasible during training and can lead to excessive computation times when making predictions for new data points. In Part 2: Support Vector Machines, we shall look at kernel-based algorithms that have sparse solutions, so that predictions for new inputs depend only on the kernel function evaluated at a subset of the training data points.

2.1 Generating a linearly separable dataset (15 points)

a) (5 points) First of all, we are going to create our own 2D toy dataset $X$. The dataset will consists of two i.i.d. subsets $X_1$ and $X_2$, each of the subsets will be sampled from a multivariate Gaussian distribution,

\begin{align} X_1 \sim &\mathcal{N}(\mu_1, \Sigma_1)\\ &\text{ and }\\ X_2 \sim &\mathcal{N}(\mu_2, \Sigma_2). \end{align}

In the following, $X_1$ will have $N_1=20$ samples and a mean $\mu_1=(1,1)$. $X_2$ will have $N_2=30$ samples and a mean $\mu_2=(3,3)$.

Plot the two subsets in one figure, choose two colors to indicate which sample belongs to which subset. In addition you should choose, $\Sigma_1$ and $\Sigma_2$ in a way that the two subsets become linearly separable. (Hint: Which form has the covariance matrix for a i.i.d. dataset?)



In [16]:

    
np.random.seed(1)
plt.rcParams["figure.figsize"] = [10,10]

# Cov should be diagonal (independency) and have the same values (identical), i.e. a*I.
def create_X(mean, sig, N):
    return np.random.multivariate_normal(mean, sig * np.identity(2), N)

m1 = [1, 1]; m2 = [3, 3]
s1 = 1/2; s2 = 1/2
N1 = 20; N2 = 30

X1 = create_X(m1, s1, N1)
X2 = create_X(m2, s2, N2)

plt.figure()
plt.axis('equal')
plt.scatter(X1[:, 0], X1[:, 1], c='b', marker='o')
plt.scatter(X2[:, 0], X2[:, 1], c='r', marker='o')
plt.show()

b) (10 points) In the next step we will combine the two datasets X_1, X_2 and generate a vector t containing the labels. Write a function create_X_and_t(X1, X2) it should return the combined data set X and the corresponding target vector t.



In [17]:

    
def create_X_and_t(X1, X2):
    X1_len = X1.shape[0]
    X2_len = X2.shape[0]
    
    X = np.vstack((X1, X2))
    t = np.hstack((-np.ones(X1_len), np.ones(X2_len)))
    
    # Shuffle data?
    indices = np.arange(X1_len + X2_len)
    np.random.shuffle(indices)
    
    return X[indices], t[indices]



In [18]:

    
### Test your function
dim = 2
N1_test = 2
N2_test = 3
X1_test = np.arange(4).reshape((N1_test, dim))
X2_test = np.arange(6).reshape((N2_test, dim))
X_test, t_test = create_X_and_t(X1_test, X2_test)

assert X_test.shape == (N1_test + N2_test, dim), "the shape of X is incorrect"
assert t_test.shape == (N1_test + N2_test,), "the shape of t is incorrect"

2.2 Finding the support vectors (15 points)

Finally we going to use a SVM to obtain the decision boundary for which the margin is maximized. We have to solve the optimization problem

\begin{align} \arg \min_{\bw, b} \frac{1}{2} \lVert \bw \rVert^2, \end{align}

subject to the constraints

\begin{align} t_n(\bw^T \phi(\bx_n) + b) \geq 1, n = 1,...,N. \end{align}

In order to solve this constrained optimization problem, we introduce Lagrange multipliers $a_n \geq 0$. We obtain the dual representation of the maximum margin problem in which we maximize

\begin{align} \sum_{n=1}^N a_n - \frac{1}{2}\sum_{n=1}^N\sum_{m=1}^N a_n a_m t_n t_m k(\bx_n, \bx_m), \end{align}

with respect to a subject to the constraints

\begin{align} a_n &\geq 0, n=1,...,N,\\ \sum_{n=1}^N a_n t_n &= 0. \end{align}

This takes the form of a quadratic programming problem in which we optimize a quadratic function of a subject to a set of inequality constraints.

a) (5 points) In this example we will use a linear kernel $k(\bx, \bx') = \bx^T\bx'$. Write a function computeK(X) that computes the kernel matrix $K$ for the 2D dataset $X$.



In [19]:

    
def computeK(X):
    K = np.dot(X, X.T).astype('float')
    return K



In [20]:

    
dim = 2
N_test = 3
X_test = np.arange(6).reshape((N_test, dim))
K_test = computeK(X_test)

assert K_test.shape == (N_test, N_test)

Next, we will rewrite the dual representation so that we can make use of computationally efficient vector-matrix multiplication. The objective becomes

\begin{align} \min_{\ba} \frac{1}{2} \ba^T K' \ba - 1^T\ba, \end{align}

subject to the constraints

\begin{align} a_n &\geq 0, n=1,...,N,\\ \bt^T\ba &= 0. \end{align}

Where \begin{align} K'_{nm} = t_n t_m k(\bx_n, \bx_m), \end{align} and in the special case of a linear kernel function, \begin{align} K'_{nm} = t_n t_m k(\bx_n, \bx_m) = k(t_n \bx_n, t_m \bx_m). \end{align}

To solve the quadratic programming problem we will use a python module called cvxopt. You first have to install the module in your virtual environment (you have to activate it first), using the following command:

conda install -c conda-forge cvxopt

The quadratic programming solver can be called as

cvxopt.solvers.qp(P, q[, G, h[, A, b[, solver[, initvals]]]])

This solves the following problem,

\begin{align} \min_{\bx} \frac{1}{2} \bx^T P \bx + q^T\bx, \end{align}

subject to the constraints,

\begin{align} G\bx &\leq h,\\ A\bx &= b. \end{align}

All we need to do is to map our formulation to the cvxopt interface.

b) (10 points) Write a function compute_multipliers(X, t) that solves the quadratic programming problem using the cvxopt module and returns the lagrangian multiplier for every sample in the dataset.



In [21]:

    
import cvxopt

def compute_multipliers(X, t):
    K = computeK(np.dot(np.diag(t), X))
    q = cvxopt.matrix(-np.ones_like(t, dtype='float'))
    G = cvxopt.matrix(np.diag(-np.ones_like(t, dtype='float')))
    A = cvxopt.matrix(t).T
    h = cvxopt.matrix(np.zeros_like(t, dtype='float'))
    b = cvxopt.matrix(0.0)
    
    P = cvxopt.matrix(K)
    sol = cvxopt.solvers.qp(P, q, G, h, A, b)
    a = np.array(sol['x'])
    return a



In [22]:

    
### Test your function
dim = 2
N_test = 3
X_test = np.arange(6).reshape((N_test, dim))
t_test = np.array([-1., 1., 1.])
a_test = compute_multipliers(X_test, t_test)

assert a_test.shape == (N_test, 1)









    



     pcost       dcost       gap    pres   dres
 0: -7.2895e-01 -1.3626e+00  6e+00  2e+00  2e+00
 1: -3.0230e-01 -6.8816e-01  8e-01  1e-01  1e-01
 2: -2.3865e-01 -3.3686e-01  1e-01  2e-16  4e-15
 3: -2.4973e-01 -2.5198e-01  2e-03  6e-17  3e-16
 4: -2.5000e-01 -2.5002e-01  2e-05  6e-17  2e-16
 5: -2.5000e-01 -2.5000e-01  2e-07  3e-18  2e-16
Optimal solution found.

2.3 Plot support vectors (5 points)

Now that we have obtained the lagrangian multipliers $\ba$, we use them to find our support vectors. Repeat the plot from 2.1, this time use a third color to indicate which samples are the support vectors.



In [23]:

    
np.random.seed(420)

X, t = create_X_and_t(X1, X2)
a_opt = compute_multipliers(X, t)

sv_ind = np.nonzero(np.around(a_opt[:, 0]))

X_sv = X[sv_ind]
t_sv = t[sv_ind]
a_sv = a_opt[sv_ind]

plt.figure()
plt.axis('equal')
plt.scatter(X1[:, 0], X1[:, 1], c='b', marker='o')
plt.scatter(X2[:, 0], X2[:, 1], c='r', marker='o')
plt.scatter(X_sv[:, 0], X_sv[:, 1], s=200, facecolors='none', edgecolors='lime', linewidth='3')
plt.show()









    



     pcost       dcost       gap    pres   dres
 0: -5.4914e+00 -1.0778e+01  1e+02  1e+01  2e+00
 1: -9.4837e+00 -7.9788e+00  5e+01  4e+00  7e-01
 2: -2.0082e+01 -9.3354e+00  2e+01  2e+00  3e-01
 3: -3.0687e+00 -4.4361e+00  2e+00  2e-02  4e-03
 4: -3.8320e+00 -3.8662e+00  5e-02  5e-04  9e-05
 5: -3.8596e+00 -3.8599e+00  5e-04  5e-06  9e-07
 6: -3.8598e+00 -3.8598e+00  5e-06  5e-08  9e-09
 7: -3.8598e+00 -3.8598e+00  5e-08  5e-10  9e-11
Optimal solution found.

2.4 Plot the decision boundary (10 Points)

The decision boundary is fully specified by a (usually very small) subset of training samples, the support vectors. Make use of

\begin{align} \bw &= \sum_{n=1}^N a_n t_n \mathbf{\phi}(\bx_n)\\ b &= \frac{1}{N_S}\sum_{n \in S} (t_n - \sum_{m \in S} a_m t_m k(\bx_n, \bx_m)), \end{align}

where $S$ denotes the set of indices of the support vectors, to calculate the slope and intercept of the decision boundary. Generate a last plot that contains the two subsets, support vectors and decision boundary.



In [24]:

    
w_opt = np.squeeze(np.dot(a_opt.T, np.dot(np.diag(t), X)))
K_sv = computeK(X_sv)
N_sv = size(sv_ind)

atk_sv = np.dot(a_sv.T * t_sv, K_sv)
b = np.sum(t_sv - atk_sv)/N_sv

x_lim = np.array([1, 4])
y_lim = (-w_opt[0] * x_lim - b) / w_opt[1]

plt.figure()
plt.axis('equal')
plt.scatter(X1[:, 0], X1[:, 1], c='b', marker='o')
plt.scatter(X2[:, 0], X2[:, 1], c='r', marker='o')
plt.scatter(X_sv[:, 0], X_sv[:, 1], s=200, facecolors='none', edgecolors='lime', linewidth='3')
plt.plot(x_lim, y_lim, c='black')
plt.show()



In [ ]: