Save this file as studentid1_studentid2_lab#.ipynb

(Your student-id is the number shown on your student card.)

E.g. if you work with 3 people, the notebook should be named: 12301230_3434343_1238938934_lab1.ipynb.

This will be parsed by a regexp, so please double check your filename.

Before you turn this problem in, please make sure everything runs correctly. First, restart the kernel (in the menubar, select Kernel$\rightarrow$Restart) and then run all cells (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says YOUR CODE HERE or "YOUR ANSWER HERE", as well as your names and email adresses below.


In [1]:
NAME = "Michelle Appel"
NAME2 = "Verna Dankers"
NAME3 = "Yves van Montfort"
EMAIL = "michelle.appel@student.uva.nl"
EMAIL2 = "verna.dankers@student.uva.nl"
EMAIL3 = "yves.vanmontfort@student.uva.nl"

Lab 1: Linear Regression and Overfitting

Machine Learning 1, September 2017

Notes on implementation:

  • You should write your code and answers in this IPython Notebook: http://ipython.org/notebook.html. If you have problems, please contact your teaching assistant.
  • Please write your answers right below the questions.
  • Among the first lines of your notebook should be "%pylab inline". This imports all required modules, and your plots will appear inline.
  • Refer to last week's lab notes, i.e. http://docs.scipy.org/doc/, if you are unsure about what function to use. There are different correct ways to implement each problem!
  • For this lab, your regression solutions should be in closed form, i.e., should not perform iterative gradient-based optimization but find the exact optimum directly.
  • use the provided test boxes to check if your answers are correct

In [2]:
%pylab inline
plt.rcParams["figure.figsize"] = [20,10]


Populating the interactive namespace from numpy and matplotlib

$\newcommand{\bPhi}{\mathbf{\Phi}}$ $\newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bt}{\mathbf{t}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bm}{\mathbf{m}}$ $\newcommand{\bS}{\mathbf{S}}$ $\newcommand{\bI}{\mathbf{I}}$

Part 1: Polynomial Regression

1.1. Generate periodic data (5 points)

Write a method gen_cosine(N) that generates toy data like in fig 1.2 of Bishop's book. The method should have a parameter $N$, and should return $N$-dimensional vectors $\bx$ and $\bt$, where $\bx$ contains evenly spaced values from 0 to (including) 2$\pi$, and the elements $t_i$ of $\bt$ are distributed according to:

$$t_i \sim \mathcal{N}(\mu_i, \sigma^2)$$

where $x_i$ is the $i$-th elements of $\bf{x}$, the mean $\mu_i = \cos(x_i)$ and the standard deviation $\sigma = 0.2$.


In [3]:
import numpy as np
import matplotlib.pyplot as plt

lims = (0, 2*np.pi)

def gen_cosine(n):
    x = np.linspace(lims[0], lims[1], num=n, endpoint=True)
    
    s = 0.2
    mu = np.cos(x)
    t = np.random.normal(loc=mu, scale=s)
    
    return x, t

In [4]:
### Test your function
np.random.seed(5)
N = 10
x, t = gen_cosine(N)

assert x.shape == (N,), "the shape of x is incorrect"
assert t.shape == (N,), "the shape of t is incorrect"

1.2 Polynomial regression (10 points)

Write a method fit_polynomial(x, t, M) that finds the maximum-likelihood solution of an unregularized $M$-th order polynomial for some dataset x. The error function to minimize w.r.t. $\bw$ is:

$E(\bw) = \frac{1}{2} (\bPhi\bw - \bt)^T(\bPhi\bw - \bt)$

where $\bPhi$ is the feature matrix (or design matrix) as explained in Bishop's book at section 3.1.1, $\bt$ is the vector of target values. Your method should return a vector $\bw$ with the maximum-likelihood parameter estimates, as well as the feature matrix $\bPhi$.


In [5]:
def designmatrix(x, M): # it is highly recommended to write a helper function that computes Phi
    Phi = []
    for i in range(M+1):
        Phi.append(np.power(x, i))
    Phi = np.matrix(Phi).transpose()

    return Phi

def fit_polynomial(x, t, M):
    Phi = designmatrix(x, M)
    w_ml = (np.linalg.inv(Phi.T*Phi)*Phi.T)*np.matrix(t).T

    return np.squeeze(np.asarray(w_ml)), Phi

In [6]:
### Test your function
N = 10
x = np.square((np.linspace(-1, 1, N)))
t = 0.5*x + 1.5
m = 2
w, Phi = fit_polynomial(x,t,m)

assert w.shape == (m+1,), "The shape of w is incorrect"
assert Phi.shape == (N, m+1), "The shape of Phi is incorrect"

1.3 Plot (5 points)

Sample a dataset with $N=10$, and fit four polynomials with $M \in (0, 2, 4, 8)$. For each value of $M$, plot the prediction function, along with the data and the original cosine function. The resulting figure should look similar to fig 1.4 of the Bishop's book. Note that you can use matplotlib's plt.pyplot(.) functionality for creating grids of figures.


In [7]:
np.random.seed(5)

M = (0, 2, 4, 8)
N = 10
Nc = 1000

x, t = gen_cosine(N)
xc = np.linspace(lims[0], lims[1], Nc)
tc = np.cos(xc)

fig = plt.figure()

for i, m in enumerate(M):
    ax = plt.subplot(2,2,i+1)
    w, Phi = fit_polynomial(x,t,m)
    tf = w*designmatrix(xc, m).T
    
    # Plot the prediction function
    plt.plot(xc, tf.T, color='r')
    
    # Plot the cosine function
    plt.plot(xc, tc, color='g')
    
    # Plot the data
    plt.scatter(x,t, marker='x')

    ax.annotate("M={}".format(m), xy=(4,1), xytext=(4, 1), size=15)

    plt.xlabel("x")
    plt.ylabel("t")

plt.show()


1.4 Regularized linear regression (10 points)

Write a method fit_polynomial_reg(x, t, M, lamb) that fits a regularized $M$-th order polynomial to the periodic data, as discussed in the lectures, where lamb is the regularization term lambda. (Note that 'lambda' cannot be used as a variable name in Python since it has a special meaning). The error function to minimize w.r.t. $\bw$:

$E(\bw) = \frac{1}{2} (\bPhi\bw - \bt)^T(\bPhi\bw - \bt) + \frac{\lambda}{2} \mathbf{w}^T \mathbf{w}$

For background, see section 3.1.4 of Bishop's book.

The function should return $\bw$ and $\bPhi$.


In [8]:
def fit_polynomial_reg(x, t, m, lamb):
    Phi = designmatrix(x, m)
    w_ml = (np.linalg.inv(lamb*np.identity(m+1) + Phi.T*Phi)*Phi.T)*np.matrix(t).T
    
    return np.squeeze(np.asarray(w_ml)), Phi

In [9]:
### Test your function
N = 10
x = np.square((np.linspace(-1, 1, N)))
t = 0.5*x + 1.5
m = 2
lamb = 0.1
w, Phi = fit_polynomial_reg(x,t,m, lamb)

assert w.shape == (m+1,), "The shape of w is incorrect"
assert Phi.shape == (N, m+1), "The shape of w is incorrect"

1.5 Model selection by cross-validation (15 points)

Use cross-validation to find a good choice of $M$ and $\lambda$, given a dataset of $N=10$ datapoints generated with gen_cosine(20). You should write a function that tries (loops over) a reasonable range of choices of $M$ and $\lambda$, and returns the choice with the best cross-validation error. In this case you use $K=5$ folds.

You can let $M \in (0, 1, ..., 10)$, and let $\lambda \in (e^{-10}, e^{-9}, ..., e^{0})$.

a) (5 points) First of all, write a method pred_error(x_train, x_valid, t_train, t_valid, M, lamb) that compares the prediction of your method fit_polynomial_reg for a given set of parameters $M$ and $\lambda$ to t_valid. It should return the prediction error for a single fold.


In [10]:
def pred_error(x_train, x_valid, t_train, t_valid, M, reg):
    w_train, Phi_train = fit_polynomial_reg(x_train, t_train, M, reg)
    Phi_valid = designmatrix(x_valid, M)
    w_train = np.matrix(w_train).T
    
    err_t = Phi_valid * w_train - np.matrix(t_valid).T
    pred_err = err_t.T * err_t
    
    return pred_err

In [11]:
### Test your function
N = 10
x = np.linspace(-1, 1, N)
t = 0.5*np.square(x) + 1.5
M = 2
reg = 0.1
pred_err = pred_error(x[:-2], x[-2:], t[:-2], t[-2:], M, reg)

assert pred_err < 0.01, "pred_err is too big"

b) (10 points) Now write a method find_best_m_and_lamb(x, t) that finds the best values for $M$ and $\lambda$. The method should return the best $M$ and $\lambda$. To get you started, here is a method you can use to generate indices of cross-validation folds.


In [12]:
def kfold_indices(N, k):
    all_indices = np.arange(N,dtype=int)
    np.random.shuffle(all_indices)
    idx = [int(i) for i in np.floor(np.linspace(0,N,k+1))]
    train_folds = []
    valid_folds = []
    for fold in range(k):
        valid_indices = all_indices[idx[fold]:idx[fold+1]]
        valid_folds.append(valid_indices)
        train_folds.append(np.setdiff1d(all_indices, valid_indices))
    return train_folds, valid_folds

In [13]:
def find_best_m_and_lamb(x, t, Ms, lambs, K):
    n = np.size(x)
    folds = kfold_indices(n, K)
    
    Mv, lambv = np.meshgrid(Ms, lambs)
    errs = np.empty(Mv.shape)
    
    for i in np.ndindex(Mv.shape):
            for k in range(K):
                ftr = folds[0][k]
                fva = folds[1][k]
                
                errs[i] += pred_error(x[ftr], x[fva], t[ftr], t[fva], Mv[i], lambv[i])        
                
    best_idx = np.unravel_index(np.argmin(errs), errs.shape)
    return Mv[best_idx], lambv[best_idx]

In [14]:
### If you want you can write your own test here

1.7 Plot best cross-validated fit (5 points)

For some dataset with $N = 10$, plot the model with the optimal $M$ and $\lambda$ according to the cross-validation error, using the method you just wrote. In addition, the plot should show the dataset itself and the function that we try to approximate. Let the plot make clear which $M$ and $\lambda$ were found.


In [15]:
np.random.seed(5)
N = 10
Nc = 1000
M = 10
k = 5
lamb_p = 10

Ms = np.arange(M+1)
lambs = np.exp(-np.arange(lamb_p + 1)[::-1])

x, t = gen_cosine(N)
xc = np.linspace(lims[0], lims[1], Nc)
tc = np.cos(xc)

M_best, lamb_best = find_best_m_and_lamb(x, t, Ms, lambs, k)
print("The best values are M = {} and lambda = {:.6f}".format(M_best, lamb_best))

w, Phi = fit_polynomial_reg(x, t, M_best, lamb_best)
tf = w*designmatrix(xc, M_best).T 

fig = plt.figure()

# Make clear which M and lambda were found
ax = fig.add_subplot(111)
ax.annotate("M={}, $\lambda$={:.6f}".format(M_best, lamb_best), xy=(4,1), xytext=(4, 1), size=15)

# Plot the dataset
plt.scatter(x, t, marker='x')

# Plot the function we try to approximate
plt.plot(xc, tc, color='g')

# Plot the model found through cross-validation
plt.plot(xc, tf.T, color='r')
plt.xlabel("x")
plt.ylabel("t")
plt.xlim(0, 2*np.pi)
plt.show()


The best values are M = 4 and lambda = 0.049787

Part 2: Bayesian Linear (Polynomial) Regression

2.1 Cosine 2 (5 points)

Write a function gen_cosine2(N) that behaves identically to gen_cosine(N) except that the generated values $x_i$ are not linearly spaced, but drawn from a uniform distribution between $0$ and $2 \pi$.


In [16]:
import numpy as np
import matplotlib.pyplot as plt

start = 0
stop = 2*np.pi
N = 1000

def gen_cosine2(n):
    """
    Generate x-data from a uniform distribution between 0 and 2pi.
    """
    x = np.random.uniform(0,2*np.pi, (n))
    sigma = 0.2
    mu = np.cos(x)
    t = np.random.normal(loc=mu, scale=sigma)
    return x, t


x2, t2 = gen_cosine2(10)
# plt.scatter(x2, t2)
# plt.show()

In [17]:
### Test your function
np.random.seed(5)
N = 10
x, t = gen_cosine2(N)

assert x.shape == (N,), "the shape of x is incorrect"
assert t.shape == (N,), "the shape of t is incorrect"

2.2 Compute Posterior (15 points)

You're going to implement a Bayesian linear regression model, and fit it to the periodic data. Your regression model has a zero-mean isotropic Gaussian prior over the parameters, governed by a single (scalar) precision parameter $\alpha$, i.e.:

$$p(\bw \;|\; \alpha) = \mathcal{N}(\bw \;|\; 0, \alpha^{-1} \bI)$$

The covariance and mean of the posterior are given by:

$$\bS_N= \left( \alpha \bI + \beta \bPhi^T \bPhi \right)^{-1} $$$$\bm_N = \beta\; \bS_N \bPhi^T \bt$$

where $\alpha$ is the precision of the predictive distribution, and $\beta$ is the noise precision. See MLPR chapter 3.3 for background.

Write a method fit_polynomial_bayes(x, t, M, alpha, beta) that returns the mean $\bm_N$ and covariance $\bS_N$ of the posterior for a $M$-th order polynomial. In addition it should return the design matrix $\bPhi$. The arguments x, t and M have the same meaning as in question 1.2.


In [18]:
import matplotlib.pyplot as plt

def fit_polynomial_bayes(x, t, M, alpha, beta):
    """
    Fit a polynomial to data x with corresponding targets t. 
    M indicates the order of the polynomial, alpha is the precision of the
    predicitve distribution and beta is the noise precision.
    """
    # Calculate S and m
    Phi = designmatrix(x, M)
    S = np.linalg.inv(alpha * np.identity(M+1) + beta * Phi.T * Phi)
    m = np.array((beta * S * Phi.T * np.matrix(t).T).T)[0]

    return m, S, Phi

In [19]:
### Test your function
N = 10
x = np.linspace(-1, 1, N)
t = 0.5*np.square(x) + 1.5
M = 2
alpha = 0.5
beta = 25
m, S, Phi = fit_polynomial_bayes(x, t, M, alpha, beta)

assert m.shape == (M+1,), "the shape of m is incorrect" 
assert S.shape == (M+1, M+1), "the shape of S is incorrect"
assert Phi.shape == (N, M+1), "the shape of Phi is incorrect"

2.3 Prediction (10 points)

The predictive distribution of Bayesian linear regression is:

$$ p(t \;|\; \bx, \bt, \alpha, \beta) = \mathcal{N}(t \;|\; \bm_N^T \phi(\bx), \sigma_N^2(\bx))$$$$ \sigma_N^2 = \frac{1}{\beta} + \phi(\bx)^T \bS_N \phi(\bx) $$

where $\phi(\bx)$ are the computed features for a new datapoint $\bx$, and $t$ is the predicted variable for datapoint $\bx$.

Write a function that predict_polynomial_bayes(x, m, S, beta) that returns the predictive mean, variance and design matrix $\bPhi$ given a new datapoint x, posterior mean m, posterior variance S and a choice of model variance beta.


In [20]:
def predict_polynomial_bayes(x, m, S, beta):
    """
    Predict the target values for input x
    and return the predictions, the posterior variance and Phi.
    """
    Phi = designmatrix(x, len(m)-1)
    sigma = [(1 / beta + np.asscalar(Phi[i] * S * Phi[i].T))
             for i in range(len(x))]
    mean = [np.asscalar(m*Phi[i].T) for i in range(len(x))]

    return np.array(mean), np.array(sigma), Phi

In [21]:
### Test your function
np.random.seed(5)
N = 10
x = np.linspace(-1, 1, N)
m = np.empty(3)
S = np.empty((3, 3))
beta = 25
mean, sigma, Phi = predict_polynomial_bayes(x, m, S, beta)

assert mean.shape == (N,), "the shape of mean is incorrect"
assert sigma.shape == (N,), "the shape of sigma is incorrect"
assert Phi.shape == (N, m.shape[0]), "the shape of Phi is incorrect"

2.4 Plot predictive distribution (10 points)

a) (5 points) Generate 10 datapoints with gen_cosine2(10). Compute the posterior mean and covariance for a Bayesian polynomial regression model with $M=4$, $\alpha=\frac{1}{2}$ and $\beta=\frac{1}{0.2^2}$. Plot the Bayesian predictive distribution, where you plot (for $x$ between 0 and $2 \pi$) $t$'s predictive mean and a 1-sigma predictive variance using plt.fill_between(..., alpha=0.1) (the alpha argument induces transparency).

Include the datapoints in your plot.


In [22]:
import matplotlib.pyplot as plt

# Generate 10 datapoints
x3, t3 = gen_cosine2(10)

# Compute posterior mean and covariance
alpha = 1/2
beta = 1/(0.2*0.2)
M = 4
posterior_mean, covariance, Phi = fit_polynomial_bayes(x3, t3, M, alpha, beta)

# Get Bayesian predictive distribution
mean, sigma, Phi = predict_polynomial_bayes(x3, posterior_mean, covariance, beta)

# Plot the predictive mean
x = np.arange(0.0, 2*np.pi, 0.01)
p1 = plt.plot(x, posterior_mean[0] + posterior_mean[1]*x + posterior_mean[2]*(x*x) + \
         posterior_mean[3]*np.power(x,3) + posterior_mean[4]*np.power(x,4), label="Predictive mean")

# Plot the predictive variance
mean, sigma, Phi = predict_polynomial_bayes(x, posterior_mean, covariance, beta)
p2 = plt.fill_between(x, mean-(np.sqrt(sigma)), mean+(np.sqrt(sigma)), alpha=0.1, label="Predictive variance")

# Include the datapoints in your plot
p3 = plt.scatter(x3, t3, label="Datapoints")

# Control layout
axes = plt.gca()
axes.set_xlim([0, 2*np.pi])
axes.set_ylim([-1.5, 1.5])
plt.xlabel("x")
plt.ylabel("t")
legend()
plt.show()


b) (5 points) For a second plot, draw 100 samples from the parameters' posterior distribution. Each of these samples is a certain choice of parameters for 4-th order polynomial regression. Display each of these 100 polynomials.


In [23]:
# Draw 100 samples from the parameters' posterior distribution
samples = np.random.multivariate_normal(posterior_mean, covariance, size=100)

# Plot every sample
for i, s in enumerate(samples):
    plt.plot(x, s[0] + s[1]*x + s[2]*(x*x) + s[3]*np.power(x,3) + s[4]*np.power(x,4))

# Control layout
x = np.arange(0.0, 2*np.pi, 0.01)
plt.xlabel("x")
plt.ylabel("target")
axes = plt.gca()
axes.set_xlim([0,2*np.pi])
axes.set_ylim([-1.5,1.5])
    
plt.show()


2.5 Additional questions (10 points)

a) (5 points) Why is $\beta=\frac{1}{0.2^2}$ the best choice of $\beta$ in section 2.4?

Originally, in assignment 1.1 and 2.1 (gen_cosine and gen_cosine2), the targets were drawn from a Gaussian distribution with $\mu_i = \cos(x_i)$ and the standard deviation $\sigma = 0.2$. Thus, to approximate this function, the optimal $\beta = \frac{1}{\sigma^2} = \frac{1}{0.2^2}$.

b) (5 points) What problems do we face when it comes to choosing basis functions in linear models?

We have to fix the basis functions before we observe the training data, we do not learn them. As a consequence, the number of basis function grows very fast when the dimensionality of the input data grows.


In [ ]: