Lab 1: Linear Regression and Overfitting

Machine Learning and Pattern Recognition, September 2013

The lab exercises should be made in groups of three people, or at most two people.
The deadline is TBA.
Assignment should be sent to taco.cohen@gmail.comn (Taco Cohen). The subject line of your email should be "#lab_lastname1_lastname2_lastname3".
Put your and your teammates' names in the body of the email
Attach the .IPYNB (IPython Notebook) file containing your code and answers. Naming of the file follows the same rule as the subject line. For example, if the subject line is "lab01_Kingma_Hu", the attached file should be "lab01_Kingma_Hu.ipynb". Only use underscores ("_") to connect names, otherwise the files cannot be parsed.

Notes on implementation:

You should write your code and answers in an IPython Notebook: http://ipython.org/notebook.html. If you have problems, please contact us.
Among the first lines of your notebook should be "%pylab inline". This imports all required modules, and your plots will appear inline.
For this lab, your regression solutions should be in closed form, i.e., should not perform iterative gradient-based optimization but find the exact optimum directly.
NOTE: Make sure we can run your notebook / scripts!



In [10]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib

$\newcommand{\bPhi}{\mathbf{\Phi}}$ $\newcommand{\bx}{\mathbf{x}}$ $\newcommand{\bw}{\mathbf{w}}$ $\newcommand{\bt}{\mathbf{t}}$ $\newcommand{\by}{\mathbf{y}}$ $\newcommand{\bm}{\mathbf{m}}$ $\newcommand{\bS}{\mathbf{S}}$ $\newcommand{\bI}{\mathbf{I}}$

Part 1: Polynomial Regression

1.1. Generate sinusoidal data (5 points)

Write a method gen_sinusoidal(N) that generates toy data like in fig 1.2 of the MLPR book. The method should have a parameter $N$, and should return $N$-dimensional vectors $\bx$ and $\bt$, where $\bx$ contains evenly spaced values from 0 to (including) 2$\pi$, and the elements $t_i$ of $\bt$ are distributed according to:

$$t_i \sim \mathcal{N}(\mu, \sigma^2)$$

where $x_i$ is the $i$-th elements of $\bf{x}$, the mean $\mu = sin(x_i)$ and the standard deviation $\sigma = 0.2$.



In [15]:

    
from pylab import *                                                              
import math                                                                      
import matplotlib.pyplot as plt                                                  
import numpy as np  

def gen_sinusoidal(n):                                                           
    '''                                                                          
        QUESTION 1.1                                                             
                                                                                 
    Generates toy data like in MLPR book                                         
    Returns N-dimensional vectors x and t, where.                                
    x contains evenly spaced values from 0 to 2pi                                
    and elements ti of t are distributed according                               
    to ti ~ N(mean, variance) where xi is the ith                                
    element of x, the mean = sin(xi) and                                         
    standard deviation = 0.2                                                     
                                                                                 
    x, t = gen_sinusoidal(10)                                                    
    '''                                                                          
    x = np.linspace(0, 2*math.pi, n)                                                
    t = []                                                                       
    sigma = 0.2                                                                  
    for i in x:                                                                  
        mu = math.sin(i)                                                         
        s = np.random.normal(mu, sigma)                                          
        t.append(s)                                                              
    return x, array(t)

1.2 Polynomial regression (15 points)

Write a method fit_polynomial(x, t, M) that finds the maximum-likelihood solution of an unregularized $M$-th order polynomial for some dataset x. The error function to minimize w.r.t. $\bw$ is:

$E(\bw) = \frac{1}{2} (\bPhi\bw - \bt)^T(\bPhi\bw - \bt)$

where $\bPhi$ is the feature matrix (or design matrix) as explained in the MLPR book, $\bt$ is the vector of target values. Your method should return a vector $\bw$ with the maximum-likelihood parameter estimates.



In [16]:

    
def create_phi(x, m):                                                            
    """                                                                          
    x, t = gen_sinusoidal(10)                                                    
    create_phi(x, 5)                                                             
    """                                                                          
    if m < 0:                                                                    
        raise ValueError('m can not be negative')                                
                                                                                 
    # plus one for the non-optional first element of the bias vector ^0          
    m += 1                                                                       
    # create array of exponents [0, 1, ..., m-1]                                 
    phi = array(range(m))                                                        
    # reserve space for NxM design matrix                                        
    Phi = zeros((size(x), m))                                                    
                                                                                 
    for n, x_elem in enumerate(x):                                               
        # create array filled with m copies of the nth datapoint                 
        x_ar = array([x_elem] * m)                                               
        # multiply with the bias vector                                          
        Phi[n] = x_ar ** phi                                                     
    return matrix(Phi)                                                           
                                                                                 
def fit_polynomial(x, t, m):                                                     
    '''                                                                          
        QUESTION 1.2                                                             
                                                                                 
    Finds maximum-likelihood solution of.                                        
    unregularized M-th order fit_polynomial for dataset x using t as the target vector.
    Returns w -> maximum-likelihood parameter                                    
    estimates                                                                    
                                                                                 
    x, t = gen_sinusoidal(10)                                                    
    w = fit_polynomial(x, t, 3)                                                  
    '''                                                                          
                                                                                 
    phi = create_phi(x, m)                                                       
                                                                                 
    # solve for w                                                                
    return phi.T.dot(phi).I.dot(phi.T).dot(t)

1.3 Plot (5 points)

Sample a dataset with $N=9$, and fit four polynomials with $M \in (0, 1, 3, 9)$. For each value of $M$, plot the prediction function, along with the data and the original sine function. The resulting figure should look similar to fig 1.4 of the MLPR book. Note that you can use matplotlib's plt.pyplot(.) functionality for creating grids of figures.



In [19]:

    
def plot_sine(linspace, label=None):                                             
    t = array([math.sin(i) for i in linspace])                                   
    if label:                                                                    
        plt.plot(linspace, t, label=label)                                       
    else:                                                                        
        plt.plot(linspace, t)                                                    
                                                                                 
def plot_polynomial(linspace, w, color='g', label=None):                         
    """ plots a function for a w over a given range for x"""                     
                                                                                 
    # for each x: sum for all m:  w[m]*x**m                                      
    f = [sum(w.item(p) * (x_point ** p) for p in range(size(w, 1))) for x_point in linspace]
                                                                                 
    # make pretty plot                                                           
    if label:                                                                    
        plt.plot(linspace, f, color=color, label=label)                          
    else:                                                                        
        plt.plot(linspace, f, color=color)       

def one_point_three():                                                           
    """                                                                          
        QUESTION 1.3                                                             
    """                                                                          
                                                                                 
    N = 10                                                                       
    orders = [0, 1, 3, 9]                                                        
    res = 1000                                                                   
                                                                                 
    x, t = gen_sinusoidal(N)                                                     
    linspace = np.linspace(0, 2*math.pi, res)                                    
                                                                                 
    for i, m in enumerate(orders):                                               
        w = fit_polynomial(x, t, m)                                              
                                                                                 
        plt.subplot(2, 2, i+1)                                                   
        plt.xlabel('x')                                                          
        plt.ylabel('t')                                                          
        plt.xlim(0, 2*math.pi)                                                   
        plt.ylim(-1.5, 1.5)                                                      
        plt.title('m=%d'%m)                                                      
        plt.tight_layout()                                                       
                                                                                 
        plt.plot(x, t, 'o')                                                      
                                                                                 
        plot_sine(linspace)                                                      
        plot_polynomial(linspace, w)                                             
                                                                                 
    plt.show() 
    
one_point_three()

1.4 Regularized linear regression (10 points)

Write a method fit_polynomial_reg(x, t, M, lamb) that fits a regularized $M$-th order polynomial to the sinusoidal data, as discussed in the lectures, where lamb is the regularization term lambda. (Note that 'lambda' cannot be used as a variable name in Python since it has a special meaning). The error function to minimize w.r.t. $\bw$:

$E(\bw) = \frac{1}{2} (\bPhi\bw - \bt)^T(\bPhi\bw - \bt) + \frac{\lambda}{2} \mathbf{w}^T \mathbf{w}$

For background, see section 3.1.4 of the MLPR book.



In [ ]:

    
def fit_polynomial_reg(x, t, m, lamb):                                           
    """                                                                          
    x, t = gen_sinusoidal(10)                                                    
    w = fit_polynomial_reg(x, t, 3)                                              
    """                                                                          
    Phi = create_phi(x, m)                                                       
                                                                                 
    i = np.eye(np.size(Phi, 1))                                                  
                                                                                 
    w = (lamb * i)                                                               
    w = (w + Phi.T.dot(Phi)).I                                                   
    w = w.dot(Phi.T).dot(t)                                                      
    return w

1.5 Model selection by cross-validation (10 points)

Use cross-validation to find a good choice of $M$ and $\lambda$, given a dataset of $N=9$ datapoints generated with gen_sinusoidal(9). You should write a function that tries (loops over) a reasonable range of choices of $M$ and $\lambda$, and returns the choice with the best cross-validation error. In this case you can use $K=9$ folds, corresponding to leave-one-out crossvalidation.

You can let $M \in (0, 1, ..., 10)$, and let $\lambda \in (e^{-10}, e^{-9}, ..., e^{0})$.

To get you started, here's a method you can use to generate indices of cross-validation folds.



In [11]:

    
def kfold_indices(N, k):
    all_indices = np.arange(N,dtype=int)
    np.random.shuffle(all_indices)
    idx = np.floor(np.linspace(0,N,k+1))
    train_folds = []
    valid_folds = []
    for fold in range(k):
        valid_indices = all_indices[idx[fold]:idx[fold+1]]
        valid_folds.append(valid_indices)
        train_folds.append(np.setdiff1d(all_indices, valid_indices))
    return train_folds, valid_folds

Create a comprehensible plot of the cross-validation error for each choice of $M$ and $\lambda$. Highlight the best choice.

Question: Explain over-fitting and underfitting, illuminated by your plot. Explain the relationship with model bias and model variance.

Answer: Overfitting occurs when the model has learned the training data too well. This means that the noise is learned and not the real distribution over the data. The error for the training data will be low but for new data it will be (much) higher. This can be see in the plot at higher values for m.

Underfitting occurs when a model is learned that is too simple. This means it can not get good results for either the test set or the training set. This can be seen in the plot at the lower values for m.

Low bias and high variance leads to underfitting (model too simple) High bias and low variance leads to overfitting (model too complex). In this example, we do not the lambda parameter for third order polynomials because the model is too simple to overfit. More complex models with higher order polynomials do require a lambda parameter, of about $e^{-5}$.

1.6 Plot best cross-validated fit (5 points)

For some dataset with $N = 9$, plot the model with the optimal $M$ and $\lambda$ according to the cross-validation error, using the method you just wrote. Let the plot make clear which $M$ and $\lambda$ were found.



In [25]:

    
def model_selection_by_cross_validation(data=None, plot=True):                   
    """                                                                          
        QUESTION 1.5                                                             
                                                                                 
    Selects the optimal model using cross validation.                            
                                                                                 
    The keyword argument data indicated whether data is given or should be       
    generated (default=None, means it will be generated)                         
                                                                                 
    The keyword argument plot indicates whether the errors for the different m   
        and k is plotted (default=True).                                         
    """                                                                          
    n = k = 9                                                                    
    if not(data):                                                                
        x, t = gen_sinusoidal(n)                                                 
    else:                                                                        
        x, t = data                                                              
                                                                                 
    indices = kfold_indices(n, k)                                                
                                                                                 
    min_error = np.inf                                                           
    max_error = -np.inf                                                          
                                                                                 
    if plot:                                                                     
        afig, ax = plt.subplots()                                                
                                                                                 
    # loop over m and lambda                                                     
    for lambda_exp in range(-10, 1):                                             
        errors = []                                                              
        lamb = np.e ** lambda_exp                                                
        for m in range(9):                                                       
                                                                                 
            # set avg. error to 0 and calculate actual lambda value              
            error = 0                                                            
                                                                                 
            # loop over the folds                                                
            for train, heldout in zip(*indices):                                 
                                                                                 
                # get the indices of the current fold                            
                xs = [x[i] for i in train]                                       
                ts = [t[i] for i in train]                                       
                                                                                 
                # fit model, on selected points                                  
                w = fit_polynomial_reg(xs, ts, m, lamb)                          
                #w = fit_polynomial(xs, ts, m)                                   
                                                                                 
                # get the value were going to predict on                         
                t_value = t[heldout[0]]                                          
                x_value = x[heldout[0]]                                          
                # predict: t = w0 * x ** 0 + w1 ** x ** 1 + ...                  
                prediction = [np.sum(w.item(p) * (x_value** p) for p in range(size(w, 1)))][0]
                                                                                 
                error += .5 * float(prediction - t_value) ** 2 + (lamb/2.) * float(w.dot(w.T))
                                                                                 
            errors.append(error)                                                 
                                                                                 
            if error < min_error:                                                
                min_error = error                                                
                best_model = (m, lambda_exp)                                     
            if error > max_error:                                                
                max_error = error                                                
                                                                                 
        if plot:                                                                 
            ax.plot(range(9), errors, label="lambda = e^" + str(lambda_exp))     
                                                                                 
    if plot:                                                                     
        legend = ax.legend(loc='upper left')                                     
        ax.set_ylim((0, max_error))                                              
        ax.set_xlabel("m")                                                       
        ax.set_ylabel("average absolute error")                                  
        ax.set_title("Error for lambda and m")                                   
                                                                                 
        plt.show()                                                               
                                                                                 
    return best_model 

def plot_best_cross_validated_fit():                                             
    """                                                                          
        QUESTION 1.6                                                             
    """                                                                          
    n = 9                                                                        
    x, t = gen_sinusoidal(n)                                                     
    best_m, best_lamb = model_selection_by_cross_validation(data=(x,t), plot=False)
    w = fit_polynomial_reg(x, t, 4, 0)                              
                                                                                 
    print 'best_m', best_m, 'best lamb', best_lamb                               
                                                                                 
    linspace = np.linspace(0, 2*math.pi, 1000)                                   
    plot_polynomial(linspace, w, label='best fit')                               
    plot_sine(linspace, label='true sin')                                        
    plot(x, t, 'o')                                                              
    plt.xlabel('x')                                                              
    plt.ylabel('y')                                                              
    plt.title('m = %d, lambda = e^%d' % (best_m, best_lamb))                     
    legend()                                                                     
    plt.show() 
    
plot_best_cross_validated_fit()









    



best_m 6 best lamb -4

Part 2: Bayesian Linear (Polynomial) Regression

2.1 Sinusoidal data 2 (5 points)

Write a function gen_sinusoidal2(N) that behaves identically to gen_sinusoidal(N) except that the generated values $x_i$ are not linearly spaced, but drawn from a uniform distribution between $0$ and $2 \pi$.



In [26]:

    
def gen_sinusoidal2(n):                                                          
    x = 2*math.pi*(rand(1,n))[0]                                                 
    t = []                                                                       
    sigma = 0.2                                                                  
    for i in x:                                                                  
        mu = math.sin(i)                                                         
        s = np.random.normal(mu, sigma)                                          
        t.append(s)                                                              
    return x, array(t)

2.2 Compute Posterior (15 points)

You're going to implement a Bayesian linear regression model, and fit it to the sinusoidal data. Your regression model has a zero-mean isotropic Gaussian prior over the parameters, governed by a single (scalar) precision parameter $\alpha$, i.e.:

$$p(\bw \;|\; \alpha) = \mathcal{N}(\bw \;|\; 0, \alpha^{-1} \bI)$$

The covariance and mean of the posterior are given by:

$$\bS_N= $ \alpha \bI + \beta \bPhi^T \bPhi $^{-1} $$$$\bm_N = \beta\; \bS_N \bPhi^T \bt$$

where $\alpha$ is the precision of the predictive distribution. See MLPR chapter 3.3 for background.

Write a method fit_polynomial_bayes(x, t, M, alpha, beta) that returns the mean $\bm_N$ and covariance $\bS_N$ of the posterior for a $M$-th order polynomial, given a dataset, where x, t and M have the same meaning as in question 1.2.



In [31]:

    
def fit_polynomial_bayes(x, t, M, alpha, beta):                                  
    Phi = create_phi(x, M)                                                    
    N = size(Phi, 1)                                                             
    I = eye(N, N)                                                                
                                                                                 
    Sn = (beta * Phi.T.dot(Phi) + alpha * I).I                                   
    mn = beta * Sn.dot(Phi.T).dot(t)                                             
                                                                                 
    return Sn, mn

2.3 Prediction (10 points)

The predictive distribution of Bayesian linear regression is:

$$ p(t \;|\; \bx, \bt, \alpha, \beta) = \mathcal{N}(t \;|\; \bm_N^T \phi(\bx), \sigma_N^2(\bx))$$$$ \sigma_N^2 = \frac{1}{\beta} + \phi(\bx)^T \bS_N \phi(\bx) $$

where $\phi(\bx)$ are the computed features for a new datapoint $\bx$, and $t$ is the predicted variable for datapoint $\bx$.

Write a function that predict_polynomial_bayes(x, m, S, beta) that returns the predictive mean and variance given a new datapoint x, posterior mean m, posterior variance S and a choice of model variance beta.



In [32]:

    
def predict_polynomial_bayes(x, m, S, beta):                                     
    phi = []                                                                     
    for i in range(0,m.size):                                                    
        phi.append(x**i)                                                         
                                                                                 
    p_mean = np.dot(m,phi)                                                       
    p_var = beta**(-1) + matrix(phi).dot(S).dot(matrix(phi).T)                   
                                                                                 
    return p_mean, p_var

2.4 Plot predictive distribution (10 points)

a) (5 points) Generate 7 datapoints with gen_sinusoidal2(7). Compute the posterior mean and covariance for a Bayesian polynomial regression model with $M=5$, $\alpha=\frac{1}{2}$ and $\beta=\frac{1}{0.2^2}$. Plot the Bayesian predictive distribution, where you draw you plot (for $x$ between 0 and $2 \pi$) $t$'s predictive mean and a 1-sigma predictive variance using plt.fill_between(..., alpha=0.1) (the alpha argument induces transparency).

Include the datapoints in your plot.

b) (5 points) For a second plot, draw 100 samples from the parameters' posterior distribution. Each of these samples is a certain choice of parameters for 5-th order polynomial regression. Display each of these 100 polynomials.



In [33]:

    
def question_2_4_a():                                                            
    N = 7                                                                        
    M = 5                                                                        
    alpha = 0.5                                                                  
    beta = 1/0.2**2                                                              
    res = 1000                                                                   
    X, t = gen_sinusoidal2(N)                                              
    Sn, mn = fit_polynomial_bayes(X, t, M, alpha, beta)                          
                                                                                 
    ls = linspace(0, 2*math.pi, res)                                             
    plot_sine(ls)                                                                
    plot_polynomial(ls, mn)                                                      
    plt.plot(X, t, 'o')                                                          
    upper = []                                                                   
    lower = []                                                                   
    for x in ls:                                                                 
        mean, var = predict_polynomial_bayes(x, mn, Sn, beta)                    
        mean = float(mean)                                                       
        var = float(var)                                                         
        upper.append(mean + sqrt(var))                                           
        lower.append(mean - sqrt(var))                                           
                                                                                 
    plt.fill_between(ls, upper, lower, alpha=0.1)                                
    plt.show()     
    
question_2_4_a()
                                                                                 
def question_2_4_b():                                                            
    N = 7                                                                        
    M = 5                                                                        
    alpha = 0.5                                                                  
    beta = 1/0.2**2                                                              
    res = 1000                                                                   
    X, t = gen_sinusoidal2(N)                                                    
    Sn, mn = fit_polynomial_bayes(X, t, M, alpha, beta)                          
    ls = linspace(0, 2*math.pi, res)                                             
    plot_sine(ls)                                                                
    plot_polynomial(ls, mn)                                                      
    plt.plot(X, t, 'o')                                                          
                                                                                 
    mn = [mn.item(i) for i in range(6)]                                          
    for i in range(100):                                                         
        mu = np.random.multivariate_normal(mn, Sn)                               
        mu = matrix(mu)                                                          
        plot_polynomial(ls, mu, color='r')                                       
                                                                                 
    plt.show()    

question_2_4_b()

2.5 Additional questions (10 points)

a) (5 points) Why is $\beta=\frac{1}{0.2^2}$ the best choice of $\beta$ in section 2.4?

b) (5 points) In the case of Bayesian linear regression, both the posterior of the parameters $p(\bw \;|\; \bt, \alpha, \beta)$ and the predictive distribution $p(t \;|\; \bw, \beta)$ are Gaussian. In consequence (and conveniently), $p(t \;|\; \bt, \alpha, \beta)$ is also Gaussian (See MLPR section 3.3.2 and homework 2 question 4). This is actually one of the (rare) cases where we can make Bayesian predictions without resorting to approximative methods.

Suppose you have to work with some model $p(t\;|\;x,\bw)$ with parameters $\bw$, where the posterior distribution $p(\bw\;|\;\mathcal{D})$ given dataset $\mathcal{D}$ can not be integrated out when making predictions, but where you can still generate samples from the posterior distribution of the parameters. Explain how you can still make approximate Bayesian predictions using samples from the parameters' posterior distribution.

2.5 Answers

a) As we can see in the definition of the variance parameter of the predictive distribution, the $\beta$ parameter controls the intrinsic noise in the data. The $\phi(\bx)^T \bS_N \phi(\bx)$ part we have control over through the model $\phi(\bx)$, but $\frac{1}{\beta}$ is fixed. Formally, $\beta$ corresponds to the precision of the intrinsic noise. Since in our example we set the noise standard deviation $\sigma$ to 0.2, we should set $\beta$ to $(\sigma^2)^{-1} = \frac{1}{0.2^2}$.

b) A prediction can be made by finding an optimum point estimate of the parameters, by sampling many times from the posterior distribution; this is done by using the MLE or MAP estimate, which need the posterior to be gaussian. The estimate found is then put into the formula for the distribution of a data point. This approach will underestimate the variance of the predictive distribution, because it does not consider doubts in the value of the new input parameter.



In [ ]: