The aim of this lab is to have you coding linear regression in python. We will do it in two ways, once using iterative updates (coordinate ascent) and then using linear algebra. Firstly we will import the relevant libraries.
In [ ]:
import numpy as np
import matplotlib as plt
Next we want to make sure our plots appear inline rather than in separate windows, so we use the IPython magic command:
In [ ]:
%pylab inline
Remember, that to check what a command does simply type:
In [ ]:
np.random.randn?
For this part we are going to load in some real data, we will use the example from the Olympics: the pace of Marathon winners. To load their data (which is in comma separated values [csv] format) we first need to download it from this link, and then load it as follows:
In [ ]:
olympics = np.genfromtxt('olympicMarathonTimes.csv', delimiter=',')
This loads the data into a numpy array. You can extract the years and pace of the winners, respectively into column vectors as follows:
In [ ]:
x = olympics[:, 0:1]
y = olympics[:, 1:2]
You can se what these values are by typing:
In [ ]:
print(x)
print(y)
And you can make a plot of $y$ vs $x$ with the following command:
In [ ]:
plt.plot(x, y, 'rx')
Now we are going to fit a line, $y_i=mx_i + c$, to the data you've plotted. We are trying to minimize the error function:
with respect to $m$, $c$ and $\sigma^2$. We can start with an initial guess for $m$,
In [ ]:
m = -0.4
c = 80
Then we use the maximum likelihood update, derived in this week's lecture, to find an estimate for the offset, $c$,
In [ ]:
c = (y - m*x).mean()
print c
And now we can make an estimate for the gradient of the line,
In [ ]:
m = ((y - c)*x).sum()/(x**2).sum()
print m
We can have a look at how good our fit is by computing the prediction across the input space. First create a vector of 'test points',
In [ ]:
x_test = np.linspace(1890, 2020, 130)[:, None]
Now use this vector to compute some test predictions,
In [ ]:
f_test = m*x_test + c
Now plot those test predictions with a blue line on the same plot as the data,
In [ ]:
plt.plot(x_test, f_test, 'b-')
plt.plot(x, y, 'rx')
The fit isn't very good, we need to iterate between these parameter updates in a loop to improve the fit, we have to do this several times,
In [ ]:
for i in np.arange(10):
m = ((y - c)*x).sum()/(x*x).sum()
c = (y-m*x).sum()/y.shape[0]
print(m)
print(c)
And let's try plotting the result again
In [ ]:
f_test = m*x_test + c
plt.plot(x_test, f_test, 'b-')
plt.plot(x, y, 'rx')
Clearly we need more iterations than 10! Let's try add more iterations above to try and get closer to the solution.
Hopefully, you are now persuaded of the merits of solving the entire system, simultaneously, using linear algebra. To do that, we need to make a design matrix of the data, which includes the $x_0=1$ column, to represent the bias, remember (from the lecture notes) that we are now moving to a system where our prediction is given by an inner product:
where each vector $\mathbf{x}_i$ is given by appending a 1 onto the original vector
We can do this for the entire data set to form a design matrix $\mathbf{X}$,
which in numpy is done with the following commands:
In [ ]:
X = np.hstack((np.ones_like(x), x))
print(X)
From the multivariate regression solution we derived in this week's lecture, the maximum likelihood solution for $\mathbf{w}^*$ is given by
First let's persuade ourselves of a few things. We suggested in the lecture that
We can show that this is, indeed the case for our data. First we need to know how to do matrix multiplication and transpose using numpy, this is done as
In [ ]:
np.dot(X.T, X)
Now we will compute the same thing with a for loop
In [ ]:
store = np.zeros((2, 2))
for i in range(X.shape[0]):
store += np.outer(X[i, :], X[i, :])
print store
I hope that you agree that the first version is a little more compact.
The solution for $\mathbf{w}$ is given in terms of a matrix inverse, but numerically this isn't the best way to compute it. What we actually want python to do is to solve the system of linear equations given by
for $\mathbf{w}$. This can be done in numpy using the command
In [ ]:
np.linalg.solve?
so we can obtain the solution using
In [ ]:
w = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))
print w
Allowing us to plot the fit as follows
In [ ]:
m = w[1]; c=w[0]
f_test = m*x_test + c
print(m)
print(c)
plt.plot(x_test, f_test, 'b-')
plt.plot(x, y, 'rx')
Now we will fit a quadratic model using basis functions. Given everything we've learnt above, this is now quite easy to do. Firstly, we need to create a design matrix that contains the quadratic basis,
where this notation means that each column of $\mathbf{\Phi}$ is derived from the entire set of input years.
In [ ]:
Phi = np.hstack([np.ones(x.shape), x, x**2])
Now we can solve this system for $\mathbf{w}$ just as we did for the linear case, so we have,
In [ ]:
w = np.linalg.solve(np.dot(Phi.T, Phi), np.dot(Phi.T, y))
print(w)
We can plot the solution in two different ways, either we take
In [ ]:
f_test = w[2]*x_test**2 + w[1]*x_test + w[0]
plt.plot(x_test, f_test, 'b-')
plt.plot(x, y, 'rx')
Or we can do the matrix form of this equation which first involves creating a design matrix for the test points,
In [ ]:
Phi_test = np.hstack((np.ones_like(x_test), x_test, x_test**2))
and then computing the value of the function using a matrix multiply
In [ ]:
f_test = np.dot(Phi_test,w)
plt.plot(x_test, f_test, 'b-')
plt.plot(x, y, 'rx')
Note the values of the coefficient $w_2$ in particular. It is relatively small, because it is multiplying a large number (square of 2000 is 4 million). This need to use small coefficients becomes worse as we increase the order of the fit. As an exercise for this week, try fitting higher order polynomials to the data. See what happens as you increase the polynomial order.