Lab 1: Linear Regression and Overfitting

Machine Learning and Pattern Recognition, September 2015

  • The lab exercises should be made in groups of two or three people.
  • The deadline is sunday September 20, 23:59.
  • Assignment should be sent to Philip Versteeg. (p.j.j.p.versteeg@uva.nl) The subject line of your email should be "#lab_lastname1_lastname2_lastname3".
  • Put your and your teammates' names in the body of the email
  • Attach the .IPYNB (IPython Notebook) file containing your code and answers. Naming of the file follows the same rule as the subject line. For example, if the subject line is "lab01_Kingma_Hu", the attached file should be "lab01_Kingma_Hu.ipynb". Only use underscores ("_") to connect names, otherwise the files cannot be parsed.
  • Make sure we can run your notebook / scripts!

Notes on implementation:

  • You should write your code and answers in this IPython Notebook: http://ipython.org/notebook.html. If you have problems, please contact me.
  • Please write your answers right below the questions.
  • Among the first lines of your notebook should be "%pylab inline". This imports all required modules, and your plots will appear inline.
  • Refer to last week's lab notes, i.e. http://docs.scipy.org/doc/, if you are unsure about what function to use. There are different correct ways to implement each problem!
  • For this lab, your regression solutions should be in closed form, i.e., should not perform iterative gradient-based optimization but find the exact optimum directly.

$\newcommand{\bPhi}{\mathbf{\Phi}}$ $\newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bt}{\mathbf{t}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bm}{\mathbf{m}}$ $\newcommand{\bS}{\mathbf{S}}$ $\newcommand{\bI}{\mathbf{I}}$

Part 1: Polynomial Regression

1.1. Generate sinusoidal data (5 points)

Write a method gen_sinusoidal(N) that generates toy data like in fig 1.2 of the MLPR book. The method should have a parameter $N$, and should return $N$-dimensional vectors $\bx$ and $\bt$, where $\bx$ contains evenly spaced values from 0 to (including) 2$\pi$, and the elements $t_i$ of $\bt$ are distributed according to:

$$t_i \sim \mathcal{N}(\mu_i, \sigma^2)$$

where $x_i$ is the $i$-th elements of $\bf{x}$, the mean $\mu_i = sin(x_i)$ and the standard deviation $\sigma = 0.2$.


In [7]:
%matplotlib inline
import numpy as np
import math
import matplotlib.pyplot as plt

In [5]:
def gen_x(N):
    offset = 2 * math.pi / (N - 1)
    for i in range(N):
        yield i * offset

gen_sinusoidal = lambda N: zip(*[(x, np.random.normal(math.sin(x), 0.2, 1)[0]) for x in gen_x(N)])

# Test
x, t = gen_sinusoidal(5)
print(x)
print(t)


(0.0, 1.5707963267948966, 3.141592653589793, 4.71238898038469, 6.283185307179586)
(0.10869911628181522, 1.1917532332739875, -0.080726098526626627, -0.88243693073060059, -0.17455650453549201)

1.2 Polynomial regression (15 points)

Write a method fit_polynomial(x, t, M) that finds the maximum-likelihood solution of an unregularized $M$-th order polynomial for some dataset x. The error function to minimize w.r.t. $\bw$ is:

$E(\bw) = \frac{1}{2} (\bPhi\bw - \bt)^T(\bPhi\bw - \bt)$

where $\bPhi$ is the feature matrix (or design matrix) as explained in the MLPR book at section 3.1.1, $\bt$ is the vector of target values. Your method should return a vector $\bw$ with the maximum-likelihood parameter estimates.


In [8]:
get_phi = lambda X, M: np.array([[x ** i for i in range(M + 1)] for x in X])

def fit_polynomial(X, t, M):
    Phi = get_phi(X, M)
    PhiT = Phi.transpose()
    w = np.linalg.inv(PhiT.dot(Phi)).dot(PhiT).dot(t)
    return w

# Test
print(fit_polynomial(np.array([1,2,3]), np.array([4,5,6]), 4))


[  8.35976562e+01  -5.01484375e+01   8.27343750e+00   4.68750000e-02
  -1.81884766e-01]

1.3 Plot (5 points)

Sample a dataset with $N=9$, and fit four polynomials with $M \in (0, 1, 3, 9)$. For each value of $M$, plot the prediction function, along with the data and the original sine function. The resulting figure should look similar to fig 1.4 of the MLPR book. Note that you can use matplotlib's plt.pyplot(.) functionality for creating grids of figures.


In [ ]:
from matplotlib import animation
from JSAnimation import IPython_display

N = 9
fig = plt.figure()
fig.set_size_inches(12, 6)
ax = plt.subplot(121)

line, = ax.plot([], [], lw=2)

X = np.arange(-0.1, 2.1 * math.pi, 0.1)
t = np.sin(X)
ax.plot(X, t, label='original function')
ax.plot(X_train, t_train, '*', label='Points with noise')

ax.set_xlim([-0.1, 2.1 * math.pi])
ax.set_ylim([-1.5, 1.5])
X_train, t_train = gen_sinusoidal(N + 4)

ax1 = plt.subplot(122)
m = []
error = []
terror = []
error_line, = ax1.plot([], [], lw=2, label='Testing error')
terror_line, = ax1.plot([], [], lw=2, label='Training error')

def init():
    line.set_data([], [])
    error_line.set_data([], [])
    terror_line.set_data([], [])

    return line, error_line, terror_line

def animate(M):
    w = fit_polynomial(X_train, t_train, M)
    t_fit = get_phi(X, M).dot(w)
    line.set_data(X, t_fit)
    line.set_label('M=%d' % (M))
    ax.legend(loc=1)
        
    m.append(len(m))
    trains_error = t_train - get_phi(X_train, M).dot(w)
    terror.append(math.sqrt(trains_error.transpose().dot(trains_error)/len(X_train)) / 2)
    terror_line.set_data(m, terror)
    squared_error = np.array(t_fit) - np.array(t)
    error.append(math.sqrt(squared_error.transpose().dot(squared_error)/len(X)) / 2)
    ax1.set_xlim([m[0], m[-1]])
    ax1.set_ylim([-0.1 + min(error), max(error) + 0.1])
    error_line.set_data(m, error)
    ax1.legend(loc=1)
    return line, error_line, terror_line

animation.FuncAnimation(fig, animate, init_func=init, frames=15, interval=500, blit=True)

1.4 Regularized linear regression (10 points)

Write a method fit_polynomial_reg(x, t, M, lamb) that fits a regularized $M$-th order polynomial to the sinusoidal data, as discussed in the lectures, where lamb is the regularization term lambda. (Note that 'lambda' cannot be used as a variable name in Python since it has a special meaning). The error function to minimize w.r.t. $\bw$:

$E(\bw) = \frac{1}{2} (\bPhi\bw - \bt)^T(\bPhi\bw - \bt) + \frac{\lambda}{2} \mathbf{w}^T \mathbf{w}$

For background, see section 3.1.4 of the MLPR book.


In [62]:
def fit_polynomial_reg(X, t, M, lamb):
    Phi = get_phi(X, M)
    PhiT = Phi.transpose()
    w = np.linalg.inv(PhiT.dot(Phi) + lamb * np.diag(np.ones(M + 1))).dot(PhiT).dot(t)
    return w

# Test
print(fit_polynomial_reg(np.array([1,2,3]), np.array([4,5,6]), 4, 0.5))


[ 1.43017172  1.11215238  0.60769662 -0.00646621 -0.05054576]

1.5 Model selection by cross-validation (10 points)

Use cross-validation to find a good choice of $M$ and $\lambda$, given a dataset of $N=9$ datapoints generated with gen_sinusoidal(9). You should write a function that tries (loops over) a reasonable range of choices of $M$ and $\lambda$, and returns the choice with the best cross-validation error. In this case you can use $K=9$ folds, corresponding to leave-one-out crossvalidation.

You can let $M \in (0, 1, ..., 10)$, and let $\lambda \in (e^{-10}, e^{-9}, ..., e^{0})$.

To get you started, here's a method you can use to generate indices of cross-validation folds.


In [63]:
def kfold_indices(N, k):
    all_indices = np.arange(N,dtype=int)
    np.random.shuffle(all_indices)
    idx = np.floor(np.linspace(0,N,k+1))
    train_folds = []
    valid_folds = []
    for fold in range(k):
        valid_indices = all_indices[idx[fold]:idx[fold+1]]
        valid_folds.append(valid_indices)
        train_folds.append(np.setdiff1d(all_indices, valid_indices))
    return train_folds, valid_folds

k = 9
X_gen, t_gen = map(np.array, gen_sinusoidal(N))

def plot_approximation(M, lamb, Xi, ti):
    w = fit_polynomial_reg(Xi, ti, M, lamb)
    t_fit = get_phi(X, M).dot(w)
    plt.plot(X, t_fit, label="M=" + str(M) + ", lamb=" + str(lamb))

def plot_orig():
    X = np.arange(-0.1, 2.1 * math.pi, 0.1)
    t = np.sin(X)
    plt.plot(X, t, linewidth=5, label='Original function')
    plt.plot(X_gen, t_gen, '*', label='Points with noise')  

def cross_validate(M, lamb):
    for train_set, test_set in zip(*kfold_indices(N, k)):
        X_train = X_gen[train_set]
        t_train = t_gen[train_set]
        X_test = X_gen[test_set]
        t_test = t_gen[test_set]

        w = fit_polynomial_reg(X_train, t_train, M, lamb)
        t_fit = get_phi(X_test, M).dot(w)
        t_fits = get_phi(X, M).dot(w)
        yield math.pow(t_fit[0] - t_test[0], 2), t_fits

M_opt = 0
lamb_opt = 0
min_avg_error = 10000
for M in range(11):
    for lamb in [math.exp(-i) for i in range(11)]:
        avg_error = np.average([error for error, _ in cross_validate(M, lamb)])
        if min_avg_error > avg_error:
            min_avg_error = avg_error
            M_opt = M
            lamb_opt = lamb

print 'Min Error:', min_avg_error
print 'M_opt:', M_opt
print 'lamb_opt:', lamb_opt


Min Error: 0.0739147338943
M_opt: 3
lamb_opt: 0.0497870683679

In [149]:
from mpl_toolkits.mplot3d import Axes3D
def plot_cross_validation_error():
    points = [(lamb, M, np.average([error for error, _ in cross_validate(M, lamb)]))
              for M in range(11)
              for lamb in [math.exp(-i) for i in range(11)]]
    lambs, Ms, errors = zip(*points)

    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(lambs, Ms, errors, c='g', marker='o', s=10)
    ax.scatter([lamb_opt], [M_opt], [min_avg_error], c='r', marker='^', s=100)
    ax.set_xlabel('Lambda')
    ax.set_ylabel('M')
    ax.set_zlabel('Square Error')
    ax.view_init(elev=200)

    plt.show()

plot_cross_validation_error()



In [150]:
def plot_cross_validation(M, lamb):
    fig = plt.figure()
    fig.set_size_inches(14, 8)

    ax = plt.subplot(121)
    ax.set_title('M=' + str(M) + ' lamb=' + str(lamb))
    mean = np.sum(np.array([t_fit for _, t_fit in cross_validate(M, lamb)]), axis=0) / 9
    for i, (_, t_fit) in enumerate(cross_validate(M, lamb)):
        plt.plot(X, t_fit, label="approximation " + str(i))
    plt.plot(X, mean, linewidth=3, label="Mean", color='r')

    ax = plt.subplot(122)
    plot_orig()
    plt.plot(X, mean, linewidth=3, label="Mean", color='r')
    plt.legend(bbox_to_anchor=(1.05, 1), loc=1, borderaxespad=0.)
    plt.show()

plot_cross_validation(M_opt, lamb_opt)
plot_cross_validation(6, 0.01)
plot_cross_validation(6, 0.1)
plot_cross_validation(6, 1)


Training / Testing error with regularization


In [69]:
N = 9
fig = plt.figure()
fig.set_size_inches(12, 6)
ax = plt.subplot(121)

line, = ax.plot([], [], lw=2)

X = np.arange(-0.1, 2.1 * math.pi, 0.1)
t = np.sin(X)
ax.plot(X, t, label='original function')
ax.plot(X_train, t_train, '*', label='Points with noise')

ax.set_xlim([-0.1, 2.1 * math.pi])
ax.set_ylim([-1.5, 1.5])
X_train, t_train = gen_sinusoidal(N + 4)

ax1 = plt.subplot(122)
m = []
error = []
terror = []
error_line, = ax1.plot([], [], lw=2, label='Testing error')
terror_line, = ax1.plot([], [], lw=2, label='Training error')

def init():
    line.set_data([], [])
    error_line.set_data([], [])
    terror_line.set_data([], [])

    return line, error_line, terror_line

def animate(M):
    w = fit_polynomial_reg(X_train, t_train, M, lamb_opt)
    t_fit = get_phi(X, M).dot(w)
    line.set_data(X, t_fit)
    line.set_label('M=%d' % (M))
    ax.legend(loc=1)
        
    m.append(len(m))
    trains_error = t_train - get_phi(X_train, M).dot(w)
    terror.append(math.sqrt(trains_error.transpose().dot(trains_error)/len(X_train)) / 2)
    terror_line.set_data(m, terror)
    squared_error = np.array(t_fit) - np.array(t)
    error.append(math.sqrt(squared_error.transpose().dot(squared_error)/len(X)) / 2)
    ax1.set_xlim([m[0], m[-1]])
    ax1.set_ylim([-0.1 + min(error), max(error) + 0.1])
    error_line.set_data(m, error)
    ax1.legend(loc=1)
    return line, error_line, terror_line

animation.FuncAnimation(fig, animate, init_func=init, frames=15, interval=500, blit=True)


Out[69]:


Once Loop Reflect

Create a comprehensible plot of the cross-validation error for each choice of $M$ and $\lambda$. Highlight the best choice.

Question: Explain over-fitting and underfitting, illuminated by your plot. Explain the relationship with model bias and model variance.

Answer: TODO: Answer

1.6 Plot best cross-validated fit (5 points)

For some dataset with $N = 9$, plot the model with the optimal $M$ and $\lambda$ according to the cross-validation error, using the method you just wrote. Let the plot make clear which $M$ and $\lambda$ were found.


In [123]:
fig = plt.figure()
fig.set_size_inches(14, 8)

print 'M_opt:', M_opt
print 'lamb_opt:', lamb_opt
plot_orig()
X_gen, t_gen = map(np.array, gen_sinusoidal(N))
print 'New dataset:', X_gen, t_gen
plot_approximation(M_opt, lamb_opt, X_gen, t_gen)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

plt.show()


M_opt: 5
lamb_opt: 0.00673794699909
New dataset: [ 0.          0.78539816  1.57079633  2.35619449  3.14159265  3.92699082
  4.71238898  5.49778714  6.28318531] [ 0.20621935  0.61198152  0.92027229  0.48217294 -0.2398803  -0.68704286
 -1.10174638 -0.68681102 -0.00923546]

Part 2: Bayesian Linear (Polynomial) Regression

2.1 Sinusoidal data 2 (5 points)

Write a function gen_sinusoidal2(N) that behaves identically to gen_sinusoidal(N) except that the generated values $x_i$ are not linearly spaced, but drawn from a uniform distribution between $0$ and $2 \pi$.


In [117]:
import numpy as np
import math

gen_sinusoidal2 = lambda N: zip(*[(x, np.random.normal(math.sin(x), 0.2, 1)[0]) for x in np.random.uniform(0, 2 * math.pi, N)])

# Test
x, t = gen_sinusoidal2(5)
print(x)
print(t)


(3.8121055255843057, 1.1274002963328071, 1.875502803607882, 3.1554130593548599, 5.0608395540644304)
(-0.60547964546166122, 1.0925214619716874, 0.79136159118064198, 0.25265138416167598, -0.72095780835483736)

2.2 Compute Posterior (15 points)

You're going to implement a Bayesian linear regression model, and fit it to the sinusoidal data. Your regression model has a zero-mean isotropic Gaussian prior over the parameters, governed by a single (scalar) precision parameter $\alpha$, i.e.:

$$p(\bw \;|\; \alpha) = \mathcal{N}(\bw \;|\; 0, \alpha^{-1} \bI)$$

The covariance and mean of the posterior are given by:

$$\bS_N= \left( \alpha \bI + \beta \bPhi^T \bPhi \right)^{-1} $$$$\bm_N = \beta\; \bS_N \bPhi^T \bt$$

where $\alpha$ is the precision of the predictive distribution, and $\beta$ is the noise precision. See MLPR chapter 3.3 for background.

Write a method fit_polynomial_bayes(x, t, M, alpha, beta) that returns the mean $\bm_N$ and covariance $\bS_N$ of the posterior for a $M$-th order polynomial, given a dataset, where x, t and M have the same meaning as in question 1.2.


In [118]:
def fit_polynomial_bayes(X, t, M, alpha, beta):
    Phi = np.array([[x ** i for i in range(M + 1)] for x in X])
    PhiT = Phi.transpose()
    Sn = np.linalg.inv(alpha * np.diag(np.ones(M + 1)) + beta * PhiT.dot(Phi))
    mn = beta * Sn.dot(PhiT).dot(t)
    return mn, Sn

# Test
X = np.array([1, 2, 3])
t = np.array([4, 5, 6])
mn, Sn = fit_polynomial_bayes(X, t, 5, 1, 2)
print(mn)
print(Sn)


[ 1.3462863   1.13697024  0.77621538  0.22833748 -0.34652283  0.06644293]
[[ 0.64885686 -0.29285689 -0.19273219 -0.04182615  0.11195577 -0.02457077]
 [-0.29285689  0.74160043 -0.19801882 -0.10285902  0.01065551  0.00726936]
 [-0.19273219 -0.19801882  0.79591878 -0.2026734  -0.15925992  0.04936061]
 [-0.04182615 -0.10285902 -0.2026734   0.66445186 -0.40103551  0.06877604]
 [ 0.11195577  0.01065551 -0.15925992 -0.40103551  0.4095791  -0.08669425]
 [-0.02457077  0.00726936  0.04936061  0.06877604 -0.08669425  0.01946171]]

2.3 Prediction (10 points)

The predictive distribution of Bayesian linear regression is:

$$ p(t \;|\; \bx, \bt, \alpha, \beta) = \mathcal{N}(t \;|\; \bm_N^T \phi(\bx), \sigma_N^2(\bx))$$$$ \sigma_N^2 = \frac{1}{\beta} + \phi(\bx)^T \bS_N \phi(\bx) $$

where $\phi(\bx)$ are the computed features for a new datapoint $\bx$, and $t$ is the predicted variable for datapoint $\bx$.

Write a function that predict_polynomial_bayes(x, m, S, beta) that returns the predictive mean and variance given a new datapoint x, posterior mean m, posterior variance S and a choice of model variance beta.


In [119]:
def predict_polynomial_bayes(X, m, S, beta, M = 5):
    Phi = get_phi(X, M)
    PhiT = Phi.transpose()
    mean = m.dot(PhiT)
    variance = 1 / beta + np.sum(Phi.dot(S) * Phi, axis=1)
    return mean, variance

2.4 Plot predictive distribution (10 points)

a) (5 points) Generate 7 datapoints with gen_sinusoidal2(7). Compute the posterior mean and covariance for a Bayesian polynomial regression model with $M=5$, $\alpha=\frac{1}{2}$ and $\beta=\frac{1}{0.2^2}$. Plot the Bayesian predictive distribution, where you plot (for $x$ between 0 and $2 \pi$) $t$'s predictive mean and a 1-sigma predictive variance using plt.fill_between(..., alpha=0.1) (the alpha argument induces transparency).

Include the datapoints in your plot.

b) (5 points) For a second plot, draw 100 samples from the parameters' posterior distribution. Each of these samples is a certain choice of parameters for 5-th order polynomial regression. Display each of these 100 polynomials.


In [120]:
X_gen, t_gen = gen_sinusoidal2(7)
M = 5
alpha = 1 / 2
beta = 1 / (0.2 * 0.2)

fig = plt.figure()
fig.set_size_inches(14, 8)

ax = plt.subplot(111)

X = np.arange(-0.00001, 2.0001 * math.pi, 0.01)
t = np.sin(X)
ax.plot(X, t, label='Original function')
ax.plot(X_gen, t_gen, '*', label='Points with noise')

mn, Sn = fit_polynomial_bayes(X_gen, t_gen, M, alpha, beta)
means, variances = predict_polynomial_bayes(X, mn, Sn, beta)
variances = np.sqrt(variances)

ax.plot(X, means, 'r', label='Mean')
ax.fill_between(X, means - variances, means + variances, alpha=0.1)

ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

fig = plt.figure()
fig.set_size_inches(14, 8)
ax = plt.subplot(111)
ax.plot(X_gen, t_gen, '*', label='Points with noise')
ax.fill_between(X, means - variances * 3, means + variances * 3, alpha=0.5)
ms = np.random.multivariate_normal(mn, Sn, 100)
for i in range(100):
    m = ms[i]
    means, variances = predict_polynomial_bayes(X, m, Sn, beta)
    ax.plot(X, means, 'r', label='Mean')
plt.show()


2.5 Additional questions (10 points)

a) (5 points) Why is $\beta=\frac{1}{0.2^2}$ the best choice of $\beta$ in section 2.4?

b) (5 points) In the case of Bayesian linear regression, both the posterior of the parameters $p(\bw \;|\; \bt, \alpha, \beta)$ and the predictive distribution $p(t \;|\; \bw, \beta)$ are Gaussian. In consequence (and conveniently), $p(t \;|\; \bt, \alpha, \beta)$ is also Gaussian (See MLPR section 3.3.2 and homework 2 question 4). This is actually one of the (rare) cases where we can make Bayesian predictions without resorting to approximative methods.

Suppose you have to work with some model $p(t\;|\;x,\bw)$ with parameters $\bw$, where the posterior distribution $p(\bw\;|\;\mathcal{D})$ given dataset $\mathcal{D}$ can not be integrated out when making predictions, but where you can still generate samples from the posterior distribution of the parameters. Explain how you can still make approximate Bayesian predictions using samples from the parameters' posterior distribution.

a) Because we have generated our data with a standard deviation for noise $\sigma = 0.2$, therefore $\beta=\frac{1}{\sigma^2}=\frac{1}{0.2^2}$, that's a noise precision will be the best choice of $\beta$ in section 2.4.


In [120]: