Machines and Intelligence Lab Class

8th November 2013 Neil Lawrence

Welcome to the IPython notebook! We will be using the IPython notebook for our lab class on robotic inference. The IPython notebook is a very convenient way to interact with data using python. We will start the lab session by familiarising ourselves with the notebook and start getting used to python whilst we review some of the material from the first lecture.

Python is a generic programming language with 'numerical' and scientific capabilities added on through the numpy and scipy libraries. There are excellent 2-D plotting facilities available through matplotlib.$$\newcommand{\inputScalar}{x} \newcommand{\lengthScale}{\ell} \newcommand{\mappingVector}{\mathbf{w}} \newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)} \newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)} \newcommand{\zerosVector}{\mathbf{0}} \newcommand{\eye}{\mathbf{I}} \newcommand{\dataStd}{\sigma} \newcommand{\dataScalar}{y} \newcommand{\dataVector}{\mathbf{y}} \newcommand{\dataMatrix}{\mathbf{Y}} \newcommand{\noiseScalar}{\epsilon} \newcommand{\noiseVector}{\mathbf{\epsilon}} \newcommand{\noiseMatrix}{\mathbf{\Epsilon}} \newcommand{\latentScalar}{x} \newcommand{\latentVector}{\mathbf{x}} \newcommand{\latentMatrix}{\mathbf{X}} \newcommand{\inputScalar}{x} \newcommand{\inputVector}{\mathbf{x}} \newcommand{\inputMatrix}{\mathbf{X}} \newcommand{\kernelMatrix}{\mathbf{K}} \newcommand{\basisMatrix}{\mathbf{\Phi}} \newcommand{\basisVector}{\mathbf{\phi}} \newcommand{\basisScalar}{\phi} \newcommand{\expSamp}[1]{\left<#1\right>} \newcommand{\expDist}[2]{\left<#1\right>_{#2}} \newcommand{\covarianceMatrix}{\mathbf{C}} \newcommand{\numData}{n} \newcommand{\mappingScalar}{w} \newcommand{\mappingFunctionScalar}{f} \newcommand{\mappingFunctionVector}{\mathbf{f}} \newcommand{\meanVector}{\boldsymbol{\mu}} \newcommand{\meanScalar}{\mu} \newcommand{\errorFunction}{E}$$

Since we only have one lab session, I thought to try and cover some of the material we've seen in the module again in the lab. You won't have time to do it all now, but you can use this to help keep the material fresh in your mind, and to review things we've talked about in the lectures. You can always come back to this lab sheet and use it to help review material from the lab sessions. Before we start on the material we've seen in the lectures though, let's introduce some simple python programming. You don't have to write any programs in the lab, they've all been written for you. But with luck you might want to play with the code a bit, change a few things and see what effect the change has (that's a really good way to learn!). So it will be helpful to start with a bit of background to the language the lab class is constructed in.

Python Programming

Being a computer used to be a job. A hundred years ago a computer was a person, often a young woman, who performed computations for creating statistical tables. Each person would be given a particular piece of arithmatic (which was often fairly repetetive), or perhaps if they were lucky, slightly complicated mathematics. Automatic computers, or electronic computers, were machines that did this job instead of the young women. From the very start then, computers were seen as machines that could help humans avoid repetetive tasks. However, I'm sure a lot of people's experience of computers is that they are machines that make you do a lot of repetetive tasks. For example, data entry. Sometimes data entry involves copying and pasting lots of different data from the web to an Excel spreadsheet. That's incredibly boring! Computer scientists hate repetetive tasks, because they know that computers should be doing them for us. This can be taken to extremes, as this xkcd comic illustrates, but generally if you find yourself doing something repetetive, you really want to find a way of making the computer do this.

When I first started using computers, the prevailing idea was that a language called BASIC was the right way to do this. As time evolved, the idea of writing scripts, short sets of instructions to avoid repetetive tasks, emerged and a new class of languages known as scripting languages (http://en.wikipedia.org/wiki/Scripting_language). Python is a scripting language that is also designed to be a general purpose language. It's designed to allow you to write small scripts, but also its a fully functioned language, like java, allowing it to also be used for larger applications.

Python is a programming language, just like Java. It is also a high level programming language, just like Java. And it has object orientated programming However, it's design philosophy is quite different. Java is a programming language designed for software engineering: it has its origins in a language called C. Because Python arises from scripting languages, it was designed to be easy to operate 'straight out of the box'. So whereas your first Java program is of the form

/* 
A simple Java program 
Written by: Guy J. Brown 
First written: 19/8/02 
Last rewritten: 24/8/02 
*/ 
public class Simple { 
  public static void main(String[] args) { 
    System.out.print("Running a Java application"); 
    System.out.println("...finished."); 
  } 
}

Your first python program can be of the form:


In [ ]:
"""A symple Python program
Written by: Neil D. Lawrence
First written: 2013/10/11
Last written: 2013/10/11"""

print "Running a Python application",
print "... finished."

Python is freely available and easy to install. I like (love) running programs on Linux. On ubuntu, python comes preinstalled (and also on the Raspberry Pi). On Windows and OSX machines we are recommending the Enthought Python distribution which is freely available for academic use. However, if you don't want to install Python on your local machine you can also do what I'm doing here, and run it on Wakari.

We are currently running the IPython Notebook. This is a web driven interface to python that allows us to treat programming as a conversation with a machine, rather than as a series of instructions to the machine. This means I can use Python more like a calculator. Java doesn't have facilities like this.


In [ ]:
a = 2*24
print a

In [ ]:
a = a + 10
print a

Note, that there is no compilation stage here, we are calling the python interpreter directly to get these results. To do something a little more exciting than this though, it is useful to import some modules.

Importing Modules

Modules in Python are like Java class libraries. They contain code that someone else has written that you can include and call yourself. Let's import a library that allows us to play with values sampled from uniform distribution between 0 and 1.


In [ ]:
import numpy
x = numpy.random.uniform() # This gives us a random number between 0 and 1, uniformly distributed
print x

Most of machine learning comes down to manipulation of matrices and vectors: linear algebra. Python wasn't originally a language designed for this, but it The numpy module provides most of the manipulations we need for arrays in python. We may not want to keep writing numpy at the beginning of our commands, so very often we use a special type of import that allows us to abbreviate the module name:


In [ ]:
import numpy as np
x = np.random.uniform()
print x

numpy is short for numerical python, but as well as providing the numerics, numpy provides contiguous array objects. From our perspective, we can think of these objects as matrices or vectors. We can sample a vector of values from a uniform distribution as well


In [ ]:
x = np.random.uniform(size=10)
print x

To get help about any command in the notebook simply type that command followed by a question mark.


In [ ]:
np.random.uniform?

As well as sampling from a uniform distribution, we can sample from the Gaussian or 'normal' distribution. The command for doing this in python is


In [ ]:
# sample 10 values from a normal (Gaussian) distribution with mean 0.0 (loc) and standard deviation 1.0 (scale) 
x=np.random.normal(loc=0.0, scale=1.0, size=10)

The numpy module also allows us to perform linear algebra operations, like the inner product (or dot product). We can try a few different ways of doing it. First we'll create two vectors of the same length, then we'll compute their inner product in a couple of different ways.


In [ ]:
x = np.random.normal(size=4)
y = np.random.normal(size=4)
# let's see what we generated
print x
print y
# inner product (or dot product) in python can be computed with np.dot
ip = np.dot(x, y)
print "Inner product is ", ip
# the function np.inner also does the same thing
print "Inner product is also ", np.inner(x, y)
# Or we can do it explicitly, by multiplying all array elements together and summing
print "Inner product is also ", np.sum(x*y)
# which can also be written as
print "Finally, inner product is also ", (x*y).sum()

Computing these inner products in python can be very fast because it is all optimized code, whereas doing it with a loop in python is a lot slower. We'll show this by introducing python for loops. In python, a for loop always iterates through a list (in some languages this is called a foreach loop (see http://en.wikipedia.org/wiki/Foreach_loop#Python). To create a for loop a bit more like that available in java we first need to create a list of integers, its counterpart the counter for loop only exists by creating a list of integers. We can use the xrange command to create the numbers of samples.


In [ ]:
# here we create a long array for x and y
length = 1000000
x = np.random.normal(size=length)
y = np.random.normal(size=length)

# computing with a for loop for 1,000,000 values takes a noticeable amount of time
ip = 0
for i in xrange(len(x)):
    ip+=x[i]*y[i]
print ip

In [ ]:
# inner product computed as a built in
np.dot(x, y)
# this takes no noticeable amount of time

TODO Need to introduce indexing in mathematics how it maps to indexing in python.

Now we've explored python a bit, let's compute some standard statistics with it, by sampling from a Gaussian density and computing means and variances.

Estimating the Mean

The sample mean of a set of values is given by the following formula:

$$\mu = \frac{1}{N}\sum_{i=1}^N x_i$$

We can compute the sample mean in python by adding all the samples together and dividing by the total number of samples.


In [ ]:
x = np.random.normal(size=10)
mean = x.sum()/10
print mean

Or we can extract the dimension of the array to compute the mean.


In [ ]:
mean = x.sum()/x.size
print mean

Or finally we can just compute the mean directly.


In [ ]:
mean = x.mean()
print mean

The variance of a set of samples is given by

$$\sigma^2 = \frac{1}{N}\sum_{i=1}^N x_i^2 - \mu^2$$


In [ ]:
variance = (x*x).mean() - mean*mean
print variance

Convergence of Sample Moments to True Moments

Underlying these samples is an actual distribution: the Gaussian (or normal) distribution. Because we sampled from a standard normal (i.e. one with a mean of 0 and a variance of 1) we know that in theory the mean and variance of the samples should be 0 and 1. In practice they aren't. Let's use matplotlib: python's plotting functionality to explore what happens as we increase the number of samples. We will increase the number of samples using loops and Python lists.

We start by creating empty lists for the means and variances. Then we create a list of integers of sample size to iterate through.


In [ ]:
means = []
variances = []
samples = [10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000] 
for N in samples:
    x = np.random.normal(loc=0.0, scale=1.0, size=N)
    mean = x.mean()
    variance = (x**2).mean() - mean**2
    means.append(mean)
    variances.append(variance)

Plotting in Python

We'll now plot the variance and the mean against the number of samples. To do this, we need to first convert the samples, varianes and means from Python lists, to numpy arrays.


In [ ]:
means = np.array(means)
variances = np.array(variances)
samples = np.array(samples)

Next we need to include the plotting functionality from matplotlib, and instruct IPyhton notebook to include the plots inline with the notebook, rather than in a different window. First we import the plotting library, matplotlib.


In [ ]:
import matplotlib as plt
# This is an IPython magic command to make plots appear inline. 
# These "magic" Commands start with % and are sent to the IPython 
# interpreter rather than the Python language interpreter.
%pylab inline

In [ ]:
plt.plot(samples, means)
xlabel('samples')
ylabel('mean')

Here we plot the estimated mean against the number of samples. However, since the samples go up logarithmically it's better to use a logarithmic axis for the x axis, as follows.


In [ ]:
plt.semilogx(samples, means)
xlabel('log samples')
ylabel('mean')

The Gaussian Density

Finally, let's try plotting a Gaussian distribution. The Gaussian density is given by the following formula:

$$p(y) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2\sigma^2}(x-\mu)^2\right)$$

where $\sigma$ is the standard deviation ($\sigma^2$ is the variance) and $\mu$ is the mean. First let's set values for the location and scale


In [ ]:
scale = 1.0 # set the standard deviation
loc = 0.0 # set the mean

Now let's choose some $x$ points where to plot our density


In [ ]:
x_plot = np.linspace(loc - 4*scale, loc + 4*scale, 100) # the x-values to use in the plot

Now we use those $x$ values to plot the form of the normal density. In this equation we use the python operator ** to square a value,


In [ ]:
# compute the values of this density at the locations given by x_plot
py = 1/np.sqrt(2*np.pi*scale**2)*np.exp(-0.5*(x_plot-loc)**2/(scale**2))

Finally let's sample some values from the same density using np.random.normal


In [ ]:
# sample some random values from this density
x_samps = np.random.normal(loc=loc, scale=scale, size=100)

Now let's plot the results. We'll plot the form of the density, and then we'll plot where the samples landed below.


In [ ]:
# Plot the density
plt.plot(x_plot, py)

# Plot crosses where the samples landed. Here we plot their x-location a
# gainst an array filled with 0.05, just to show them of the bottom of the plot. 
plt.plot(x_samps, 0.05*np.ones_like(x_samps), 'gx')
plt.title('The Normal Density')

Quadratic Error Function

Remember in class we introduced the normal density and then said that probability distributions were related to error functions by taking the negative log of the probability. We showed that the normal density was related to a quadratic error function. Let's review that here, by taking the log of this density, and plotting it. The result should look like a quadratic.


In [ ]:
plt.plot(x_plot, -np.log(py), 'b-')

Note that we get a quadratic form, with the minimum at the same point that the maximum was at for the probability density. This is the essence of the idea that minimizing the error with respect to a paramter is the same as maximizing the probability with respect to the parameter (although normally we call it maximum likelihood, not maximum probability).

Supervised Learning

Classification

The first approach to machine learning we are going to look at is classification. In classification we take in a feature matrix and make predictions of class labels given the features. The model for classification we saw in class was one where we predicte the class label, $\dataScalar_i$, given the features associated with that data point, $\inputVector_i$, using the following formula:

$$\dataScalar_i = \text{sign}\left(\mappingVector^\top \inputVector_i + b\right)$$

This implies that the decision boundary for the classification is given by a hyperplane. The vector $\mappingVector$ is the normal vector to the hyperplane (http://en.wikipedia.org/wiki/Normal_(geometry)). The hyperplane is described by the formula $\mappingVector^\top \inputVector = -b$ which uniquely defines the hyperplane in the input dimensions.

Perceptron Algorithm

Now we have the for loop and some Python basics, let's see how the perceptron algorithm looks in python. First of all, we will create a data set to play with. We'll do this by sampling from a Gaussian.


In [ ]:
# let's sample first from the positive class. To separate 
# it from the negative, we'll add 1.5 to the samples.
x_plus = np.random.normal(loc=1.5, size=(50, 2))
# And now the negative class, to separate from the positive
# we subtract 1.5 from the samples
x_minus = np.random.normal(loc=-1.5, size=(50, 2))

In [ ]:
# now let's plot our data. We'll make the
# positive class red crosses and the negative
# class green circles
plt.plot(x_plus[:, 0], x_plus[:, 1], 'rx')
plt.plot(x_minus[:, 0], x_minus[:, 1], 'go')

Now the perceptron algorithm. The way we taught it in class, the first step involves initializing the the weight vector (which remember is normal to the decision boundary) with a data point. First we decide which data point to use, positive or negative, then we set the normal vector to be equal to that point multiplied by the class label

$$\mappingVector = \dataScalar_i \inputVector_i$$

As we saw in class, the predicted label of the $i$th point is given by

$$\text{sign}(\mappingVector^\top\inputVector_i)$$

So setting the inital vector to $\dataScalar_i\inputVector_i$ means that the $i$th point's prediction will be given by

$$\text{sign}(\mappingVector^\top\inputVector_i) = \text{sign}(\dataScalar_i\inputVector_i^\top \inputVector_i)$$

Because $\inputVector_i^\top \inputVector_i$ is the inner product between a vector and itself, it is always positive, so the sign of the argument will be determined only by the sign of $\dataScalar_i$. So setting the inital vector to this value ensures that the $i$th data point is correctly classified. Let's see the code now for doing this.


In [ ]:
# flip a coin (i.e. generate a random number and check if it is greater than 0.5)
choose_plus = np.random.rand(1)>0.5
if choose_plus:
    # generate a random point from the positives
    index = np.random.randint(0, x_plus.shape[1])
    w = x_plus[index, :] # set the normal vector to that point.
    b = 1
else:
    # generate a random point from the negatives
    index = np.random.randint(0, x_minus.shape[1])
    w = -x_minus[index, :] # set the normal vector to minus that point.
    b = -1

Decision Boundary

Now we will try and plot the decision boundary. To do this we need to do a little bit of maths. It's not part of the perceptron algorithm as such, it's just to help you understand how the blot below is made.

The decision boundary is defined as the point at which a data point moves from being classified as -1 to being classified as +1. For the model which predicts the label as $\text{sign}(\inputVector^\top \mappingVector)$, using the definition of the inner product, we can see that if the input dimension is two dimensional, the model we use for prediction can be written as

$$\text{sign}(\mappingScalar_0 + \mappingScalar_1\inputScalar_{i,1} + \mappingScalar_2 \inputScalar_{i, 2})$$

where $\inputScalar_{i, 1}$ is the first feature from the $i$th data point and $\inputScalar_{i, 2}$ is the second feature from the $i$th data point. Here we are using the trick that includes an extra $\inputScalar_{0,i}=1$ to account for the bias. So if we think of $\mappingScalar_0 = b$ we have

$$\text{sign}\left(b + \mappingScalar_1 \inputScalar_{i, 1} + \mappingScalar_2 \inputScalar_{i, 2}\right)$$

The decision boundary is where the output of the function changes from -1 to +1 (or vice versa) so it's the point at which the argument of the $\text{sign}$ function is zero. So in other words, the decision boundary is given by the line defined by $\inputScalar_1 \mappingScalar_1 + \inputScalar_2 \mappingScalar_2 = -b$ (where we have dropped the index $i$ for convenience). In this two dimensional space the decision boundary is defined by a line. In a three dimensional space it would be defined by a plane and in higher dimensional spaces it is defined by something called a hyperplane (http://en.wikipedia.org/wiki/Hyperplane). This equation is therefore often known as the separating hyperplane because it defines the hyperplane that separates the data. To draw it in 2-D we can choose some values to plot from $\inputScalar_1$ and then find the corresponding values for $\inputScalar_2$ to plot using the rearrangement of the hyperplane formula as follows

$$\inputScalar_2 = -\frac{(b+\inputScalar_1\mappingScalar_1)}{\mappingScalar_2}$$

Of course, we can also choose to specify the values for $\inputScalar_2$ and compute the values for $\inputScalar_1$ given the values for $\inputScalar_2$,

$$\inputScalar_1 = -\frac{b + \inputScalar_2\mappingScalar_2}{\mappingScalar_1}$$

It turns out that sometimes we need to use the first formula, and sometimes we need to use the second. Which formula we use depends on how the separating hyperplane leaves the plot.

We want to draw the separating hyperplane in the bounds of the plot which is showing our data. To think about which equation to use, let's consider two separate situations (actually there are a few more).

  1. If the separating hyperplane leaves the top and bottom of the plot then we want to plot a line with values in the $y$ direction (given by $x_2$) given by the upper and lower limits of our plot. The values in the $x$ direction can then be computed from the formula for the plane.

  2. Conversely if the line leaves the sides of the plot then we want to plot a line with values in the $x$ direction given by the limits of the plot. Then the values in the $y$ direction can be computed from the formula. Whether the line leaves the top/bottom or the sides of the plot is dependent on the relative values of $w_1$ and $w_2$.

This motivates a simple if statement to check which situation we're in.


In [ ]:
# Compute coordinates for the separating hyperplane
plot_limits = np.asarray([-5, 4])
if abs(w[1])>abs(w[0]):
    # If w[1]>[w[0] in absolute value, plane is likely to be leaving tops of plot.
    x0 = plot_limits
    x1 = -(b + x0*w[0])/w[1]
else:
    # otherwise plane is likely to be leaving sides of plot.
    x1 = plot_limits
    x0 = -(b + x1*w[1])/w[0]

# Plot the data again
plot(x_plus[:, 0], x_plus[:, 1], 'rx')
plot(x_minus[:, 0], x_minus[:, 1], 'go')
# plot a line to represent the separating 'hyperplane'
plot(x0, x1, 'b-')

This little drawing routine we've defined is going to be useful for updating the hyperplane when we update the parameters (to visualize how things change). We don't want paste in the code every time we need it, so let's define it as a function which computes the coordinates of the hyperplane. To define a function we declare it as follows:


In [ ]:
def hyperplane_coordinates(w, b, plot_limits):
    
    if abs(w[1])>abs(w[0]):
        # If w[1]>w[0] in absolute value, plane is likely to be leaving tops of plot.
        x0 = plot_limits
        x1 = -(b + x0*w[0])/w[1]
    else:
        # otherwise plane is likely to be leaving sides of plot.
        x1 = plot_limits
        x0 = -(b + x1*w[1])/w[0]
    return x0, x1

Right, we are now almost ready to write some code that initializes the first value of the weight vector and draws the associated hyperplane. The final component we need is the perceptron weight update. For the perceptron algorithm, the next step is to test the weights we have. We can select a data point at random, and test to see if it is classified correctly. If it is classified correctly, we do nothing. However, if it isn't we need to try and change the weight vector to make the data point classified correctly. We already know that if the weight vector is set to $\mappingVector = \dataScalar_i \inputVector_i$ then the $i$th data point will be correctly classified, but that's no good, because it means forgetting everything we've learnt so far. Instead, we choose to move the weight vector by adding some portion of $\dataScalar_i \inputVector_i$ to $\mappingVector$,

$$\mappingVector_\text{new} \leftarrow \mappingVector_\text{old} + \eta\dataScalar_i \inputVector_i$$

so that the new classification is given by

$$\text{sign}(\mappingVector_\text{old}^\top\inputVector_i + \eta\dataScalar_i \inputVector_i^\top\inputVector_i)$$

So in other words, because $\inputVector_i^\top\inputVector_i$ is positive, then the argument of $\text{sign}()$ will be adapted in the direction of $\dataScalar_i$. I.e. the weight vector is pulled in a direction which makes the data vector $\inputVector_i$ more likely to be classified correctly. In fact, if $\eta$ is large enough then $\inputVector_i$ is guaranteed to be correctly classified.

Let's put it all together. To animate the drawing in the ipython notebook, we need two further modules: time which will allow us to pause the algorithm for updating the hyperplane drawing and IPython.display.clear_output which will allow us to clear the plot between updates.


In [ ]:
# import the time model to allow python to pause.
import time
# import the IPython display module to clear the output.
from IPython.display import clear_output

learn_rate = 0.1
num_iterations = 1000
for i in xrange(num_iterations):
    w_updated = False
    # select a point at random from the data
    choose_plus = np.random.uniform(size=1)>0.5
    if choose_plus:
        # choose a point from the positive data
        index = np.random.random_integers(x_plus.shape[0]-1)
        if not np.sign(np.dot(w, x_plus[index, :])+b) == 1:
            # point is currently incorrectly classified
            w = w + learn_rate*x_plus[index, :]
            b = b + learn_rate
            w_updated = True
    else:
        # choose a point from the negative data
        index = np.random.random_integers(x_minus.shape[0]-1)
        if not np.sign(np.dot(w, x_minus[index, :])+b) == -1:
            # point is currently incorrectly classified
            w = w - learn_rate*x_minus[index, :]
            b = b - learn_rate
            w_updated = True
    if w_updated or i==0: 
        # Plot the data again
        plt.gca().clear()
        plt.plot(x_plus[:, 0], x_plus[:, 1], 'rx')
        plt.plot(x_minus[:, 0], x_minus[:, 1], 'go')
        # plot a line to represent the separating 'hyperplane'
        x0, x1 = hyperplane_coordinates(w, b, np.asarray([-5, 4]))
        plt.plot(x0, x1, 'b-')
        display(plt.gcf())
        time.sleep(1)
        clear_output()

Perceptron Reflection

How could we improve this algorithm? Think about the following issues. What happens when the classes aren't linearly separable? Why are we only using 1000 iterations, how do we know we have the right result when we stop? How could we select the learning rate, $\eta$?

Regression

Classification is about mapping an input point to a class label which is either positive or negative. Regression is mapping an input point to a real value. One example of a model for regression is a linear regression involving a slope and an intercept.

$$\dataScalar_i = m\inputScalar_i + c$$

We now need an algorithm to fit this model. We are going to use regression to introduce the idea of an error function. The error function is an objective function that we minimize to fit our model. One commonly used error function for regression is least squares. The idea in least squares is to minimize the sum of squared differences between our prediction and the observed data. We can write our error function in the following form

$$\errorFunction(m, c) = \sum_{i=1}^\numData (\dataScalar_i - m\inputScalar_i - c)^2$$

Let's create an artifical data set and plot what the contours of this error function look like in python. Artificial data sets are a really useful way to understand a model. By creating an artifical data set that is sampled from the model, we are forced to think about the type of data we are expecting. To create an artificial data set we first sample some inputs, let's sample 4 points.


In [ ]:
x = np.random.normal(size=4)

We now need to decide on a true value for $m$ and a true value for c to use for generating the data.


In [ ]:
m_true = 1.4
c_true = -3.1

We can use these values to create our artificial data. The formula $\dataScalar_i = m\inputScalar_i + c$ is translated to code as follows:


In [ ]:
y = m_true*x+c_true

We can now plot the artifical data we've created.


In [ ]:
plt.plot(x, y, 'rx') # plot data as red crosses

These points lie exactly on a straight line, that's not very realistic, let's corrupt them with a bit of Gaussian 'noise'.


In [ ]:
noise = np.random.normal(scale=0.5, size=4) # standard deviation of the noise is 0.5
y = m_true*x + c_true + noise
plt.plot(x, y, 'rx')

Contour Plot of Error Function

Now we are going to do something a little bit complicated. We are going to compute the error function for a grid of different values for $m$ and $c$. Doing this will help us visualize the shape of the error function. What we need is to compute a grid of all different values of $m$ and $c$ which we think are sensible. Since we know the true values, let's centre our grid around these. In python, a special command is provided for making the grid. First of all, we use the linspace command for creating an array of linearly spaced values of $m$ and $c$, then we use meshgrid to create a grid from those two vectors.


In [ ]:
m_vals = np.linspace(m_true-3, m_true+3, 100) # create an array of linearly separated values around m_true
c_vals = np.linspace(c_true-3, c_true+3, 100) # create an array of linearly separated values around c_true

Now we create the grid on which to compute the error function. meshgrid creates a matrix of values that form the grid, for both $c$ and $m$.


In [ ]:
m_grid, c_grid = np.meshgrid(m_vals, c_vals) 
# we can see that this is now a 100x100 matrix of values with the print command
print "The dimensionality of m_grid is", m_grid.shape

Now we compute the error function at each of those combinations of $c$ and $m$.


In [ ]:
E_grid = np.zeros((100, 100))
for i in xrange(100):
    for j in xrange(100):
        E_grid[i, j] = ((y - m_grid[i, j]*x - c_grid[i, j])**2).sum()

This loop computes the error function for each value in m_grid and c_grid, storing the result in E_grid. This allows us to make a contour function plot of the error. We can do this using the matplotlib module and the command contour.


In [ ]:
plt.contour(m_vals, c_vals, E_grid)
plt.xlabel('$m$')
plt.ylabel('$c$')

We fit a regression model by minimizing this sum of squares error function. One way of doing that is gradient descent. We initialize with a guess for $m$ and $c$ and we update that guess by subtracting a portion of the gradient from the guess. Like walking down a hill in the steepest direction of the hill in order to get to the bottom. So let's start with a guess for $m$ and $c$.


In [ ]:
m_star = 3.0
c_star = -5.0

Now we need to compute the gradient of the error function, firstly with respect to $c$,

$$\frac{\text{d}\errorFunction(m, c)}{\text{d} c} = -2\sum_{i=1}^\numData (\dataScalar_i - m\inputScalar_i - c)$$

We'll review how this gradient is derived in a moment, but first let's just see how it's computed in numpy.


In [ ]:
c_grad = -2*(y-m_star*x - c_star).sum()
print "Gradient with respect to c is ", c_grad

To see how the gradient was derived, first note that the $c$ appears in every term in the sum. So we are just differentiating $(\dataScalar_i - m\inputScalar_i - c)^2$ for each term in the sum. The gradient of this term with respect to $c$ is simply the gradient of the outer quadratic, multiplied by the gradient with respect to $c$ of the part inside the quadratic. The gradient of a quadratic is two times the argument of the quadratic, and the gradient of the inside linear term is just minus one. This is true for all terms in the sum, so we are left with the sum in the gradient. The gradient with respect tom $m$ is similar, but now the gradient of the quadratic's argument is $-\inputScalar_i$ so the gradient with respect to $m$ is

$$\frac{\text{d}\errorFunction(m, c)}{\text{d} m} = -2\sum_{i=1}^\numData \inputScalar_i(\dataScalar_i - m\inputScalar_i - c)$$

which can be implemented in python (numpy) as


In [ ]:
m_grad = -2*(x*(y-m_star*x - c_star)).sum()
print "Gradient with respect to m is ", m_grad

This gives us the gradients with respect to $m$ and $c$. We can now update our inital guesses for $m$ and $c$ using the gradient. However, we don't want to just add the gradient to $m$ and $c$, we need to take a small step in the gradient direction, otherwise we might overshoot the minimum. We want to follow the gradient to get to the minimum, and the gradient changes all the time. The step size has already been introduced, it's again known as the learning rate and is denoted by $\eta$.

$$c_\text{new} \leftarrow c_{\text{old}} - \eta \frac{\text{d}\errorFunction(m, c)}{\text{d}c}$$

gives us an update for our estimate of $c$ (which in the code we've been calling c_star to represent a common way of writing a parameter estimate, $c^*$) and

$$m_\text{new} \leftarrow m_{\text{old}} - \eta \frac{\text{d}\errorFunction(m, c)}{\text{d}m}$$

gives us an update for our estimate of $m$. These two udpates can be coded as


In [ ]:
print "Original m was ", m_star, " and original c was ", c_star
learn_rate = 0.01
c_star = c_star - learn_rate*c_grad
m_star = m_star - learn_rate*m_grad
print "New m is ", m_star, " and new c is ", c_star

Of course, to fit the model we need to keep computing the gradients (they change as we change $m$ and $c$) and keep doing the parameter updates. Let's watch a gradient descent in action with the following code.


In [ ]:
# first let's plot the error surface
f, (ax1, ax2) = plt.subplots(1, 2) # this is to create 'side by side axes'
ax1.contour(m_vals, c_vals, E_grid) # this makes the contour plot on axes 1.
plt.xlabel('$m$')
plt.ylabel('$c$')
m_star = 3.0
c_star = -5.0
for i in xrange(100): # do 100 iterations (parameter updates)
    # compute the gradients
    c_grad = -2*(y-m_star*x - c_star).sum()
    m_grad = -2*(x*(y-m_star*x - c_star)).sum()
    
    # update the parameters
    m_star = m_star - learn_rate*m_grad
    c_star = c_star - learn_rate*c_grad
    
    # update the location of our current best guess on the contour plot
    ax1.plot(m_star, c_star, 'g*')
    
    # show the current status on the plot of the data
    ax2.plot(x, y, 'rx')
    plt.ylim((-9, -1)) # set the y limits of the plot fixed
    x_plot = np.asarray(plt.xlim()) # get the x limits of the plot for plotting the current best line fit.
    y_plot = m_star*x_plot + c_star
    ax2.plot(x_plot, y_plot, 'b-')
    display(plt.gcf())
    time.sleep(0.25) # pause between iterations to see update
    ax2.cla()
    clear_output()

Stochastic Gradient Descent

This approach of steepest gradient descent works fine if the number of data points, $\numData$ isn't too large. But what do Google do when they have 1 billion data points? We can get an algorithm that is a bit more similar to the perceptron (it looks at one data point at a time rather than summing across all data points) by using stochastic gradient descent. In this case, instead of computing the true gradient of $\errorFunction(m, c)$ we just look at one term from the full sum over $\numData$. The real gradient with respect to $m$ is given by

$$\frac{\text{d}\errorFunction(m, c)}{\text{d} m} = -2\sum_{i=1}^\numData \inputScalar_i(\dataScalar_i - m\inputScalar_i - c)$$

but it has $\numData$ terms in the sum. Substituting in the gradient we can see that the full update is of the form

$$m_\text{new} \leftarrow m_\text{old} + 2\eta \left[\inputScalar_1 (\dataScalar_1 - m_\text{old}\inputScalar_1 - c_\text{old}) + (\inputScalar_2 (\dataScalar_2 - m_\text{old}\inputScalar_2 - c_\text{old}) + \dots + (\inputScalar_\numData (\dataScalar_\numData - m_\text{old}\inputScalar_\numData - c_\text{old})\right]$$

This could be split up into lots of individual updates

$$m_1 \leftarrow m_\text{old} + 2\eta \left[\inputScalar_1 (\dataScalar_1 - m_\text{old}\inputScalar_1 - c_\text{old})\right]$$$$m_2 \leftarrow m_1 + 2\eta \left[\inputScalar_2 (\dataScalar_2 - m_\text{old}\inputScalar_2 - c_\text{old})\right]$$$$m_3 \leftarrow m_2 + 2\eta \left[\dots\right]$$$$m_\numData \leftarrow m_{\numData-1} + 2\eta \left[\inputScalar_\numData (\dataScalar_\numData - m_\text{old}\inputScalar_\numData - c_\text{old})\right]$$

which would lead to the same final update, but note that we aren't changing the $m$ and $c$ we use for computing the gradient term at each update. In stochastic gradient descent we do update $m$ and $c$ we use. This means that we aren't quite following the steepest gradient, but a stochastic approximation to the gradient. However, it also means we can just present each data point in a random order, like we did for the perceptron. This makes the algorithm suitable for large scale web use (recently this domain is know as 'Big Data') and algorithms like this are widely used by Google, Microsoft, Yahoo! and Facebook.

$$m_1 \leftarrow m_\text{old} + 2\eta \left[\inputScalar_1 (\dataScalar_1 - m_\text{old}\inputScalar_1 - c_\text{old})\right]$$$$m_2 \leftarrow m_1 + 2\eta \left[\inputScalar_2 (\dataScalar_2 - m_1\inputScalar_2 - c_1)\right]$$$$m_3 \leftarrow m_2 + 2\eta \left[\dots\right]$$$$m_\numData \leftarrow m_{\numData-1} + 2\eta \left[\inputScalar_\numData (\dataScalar_\numData - m_{\numData - 1}\inputScalar_\numData - c_{\numData-1})\right]$$

Or more accurate, since the data is normally presented in a random order we just can write

$$m_\text{new} = m_\text{old} + 2\eta\left[\inputScalar_i (\dataScalar_i - m_\text{old}\inputScalar_i - c_\text{old})\right]$$

which is easily implemented as code


In [ ]:
# choose a random point for the update
i = np.random.randint(x.shape[0]-1)
# update m
m_star = m_star + 2*learn_rate*(x[i]*(y[i]-m_star*x[i] - c_star))
# update c
c_star = c_star + 2*learn_rate*(y[i]-m_star*x[i] - c_star)

Putting it all together in an algorithm, we can do stochastic gradient descent for our regression data.


In [ ]:
# first let's plot the error surface
f, (ax1, ax2) = plt.subplots(1, 2) # this is to create 'side by side axes'
ax1.contour(m_vals, c_vals, E_grid) # this makes the contour plot on axes 1.
plt.xlabel('$m$')
plt.ylabel('$c$')
m_star = 3.0
c_star = -5.0
for i in xrange(100): # do 100 iterations (parameter updates)
    # choose a random point
    i = np.random.randint(x.shape[0]-1)

    # update m
    m_star = m_star + 2*learn_rate*(x[i]*(y[i]-m_star*x[i] - c_star))
    # update c
    c_star = c_star + 2*learn_rate*(y[i]-m_star*x[i] - c_star)# compute the gradients
    
    # update the location of our current best guess on the contour plot
    ax1.plot(m_star, c_star, 'g*')
    
    # show the current status on the plot of the data
    ax2.plot(x, y, 'rx')
    plt.ylim((-9, -1)) # set the y limits of the plot fixed
    x_plot = np.asarray(plt.xlim()) # get the x limits of the plot for plotting the current best line fit.
    y_plot = m_star*x_plot + c_star
    ax2.plot(x_plot, y_plot, 'b-')
    display(plt.gcf())
    time.sleep(0.25)  # pause between iterations to see update
    ax2.cla()
    clear_output()

Reflection on Linear Regression and Supervised Learning

  1. Easy What effect does the learning rate have in the optimization? What's the effect of making it too small, what's the effect of making it too big? Do you get the same result for both stochastic and steepest gradient descent?

  2. Harder The stochastic gradient descent doesn't help very much for such a small data set. It might be fun to try it with more than 4 data points. Can you modify the code above to run with thousands of data points? Do you also need to change the number of iterations? Which code base is faster? Does it make sense to summarize both algorithms in terms of the number of iterations?

  3. Very Hard Can you use what you've learnt to come up with a batch algorithm for the perceptron? We justified the perceptron by geometric arguments to construct the algorithm. Does it have an error function associated with it? What is the error function? Is the perceptron doing gradient descent on this error function?

Nonlinear Regression with Basis Functions

In the lectures, we also looked at the idea of doing non linear regression with the same algorithms. The way we chose to do this was to introduce the concept of a basis function. A basis function is a feature space that is computed from our original data. A very common one is a polynomial basis, for example a quadratic basis. In a quadratic basis we assume our regression is given by

$$\dataScalar_i = \mappingScalar_2 \inputScalar_i^2 + \mappingScalar_1 \inputScalar_i + \mappingScalar_0.$$

This gives a nonlinear relationship between our observation and the inputs. Here the basis vector is given by $\basisVector = \left[1\ \inputScalar_i\ \inputScalar_i^2\right]$. Although the set up is non-linear we can still use the same type of algorithm we used above. For convenience we have replace $c$ with $\mappingScalar_0$, $m$ with $\mappingScalar_1$ and we have added $\mappingScalar_2$ as the coefficient associated with the quadratic term. This approach to making linear algorithms nonlinear is very common, all sorts of different nonlinearities can be used (not just polynomials). The algorithms stay quite simple as long as they remain linear in the parameters. Here linear in the parameters means that the parameters can only appear in multiplications and sums, not inside nonlinear functions (like the basis functions). In other words they need to be writable as

$$\dataScalar_i = \mappingScalar_3 \basisScalar_3 + \mappingScalar_2 \basisScalar_2 + \mappingScalar_1 \basisScalar_1 + \mappingScalar_0 \basisScalar_0.$$

This is what makes linear algebra so important (we've already seen the importance of differential calculus in finding a minimum!). Linear algebra gives us a short-cut to writing down functions like the one above, and makes algorithm derivation with such models much easier. For example, I would always write such an model down as an inner product,

$$\dataScalar_i = \mappingVector^\top \basisVector$$

which is a much more compact form. Here I've written the inner product in the matrix notation. This tuns out to be necessary to use when we are using more advanced methods to design algorithms to fit the models described above. We need to use linear algebra (i.e. matrices, determinants, inverses etc.) and multivariate calculus.

Unsupervised Learning

The examples of regression and classification assume that we are given some data, $\inputVector_i$ and some associated labels for the data $\dataScalar_i$. Whether we use regression or classification is dependent on whether the label information is discrete (like in classifying whether someone is a friend of yours on Facebook) or whether the label information is continuous (like predicting the demand for bandwidth you might have on a particular day). Because there are labels present, these two examples of learning are both known as supervised learning. Another type of learning you might be interested in is when there are no labels present. That is known as unsupervised learning. Broadly speaking there are two types of unsupervised learning. Firstly, when you are looking to automatically categorize your data into different groups (like political parties, or types of animal: mammal, reptile, bird etc). This is known as clustering or in some contexts vector quantisation (we will refer to it as clustering). It is the unsupervised equivalent of classification.

Clustering

We will look at one clustering algorithm: $k$-means clustering. In $k$-means clustering the idea is to come up with a representative 'centre' associated with each of the clusters. This is the mean of the cluster. Points are allocated to that cluster centre if they are closer to that cluster centre than any of the other cluster centres. To start the algorithm, you first need a number of centres (these might be chosen randomly from the data). You then allocate each data point to the centre it's closest too. Then you update the values for each centre by setting it to the mean of the points that are allocated to it. To illustrate this in practice, we will first create a data set containing three clusters that we'll create by sampling from Gaussian densities.


In [ ]:
# Generate data from three Gaussian densities
# X0 is Gaussian centred at 2.5, 2.5
X0 = np.random.normal(loc=2.5, size=(30, 2))
# X1 is Gaussian centred at -2.5, -2.5
X1 = np.random.normal(loc=-2.5, size=(30, 2))
# X2 empty for moment
X2 = zeros((30, 2))

# put first column as Gaussian centred at 2.5
X2[:, 0] = np.random.normal(loc=2.5, size=30)

# put second column as Gaussian centred at -2.5
X2[:, 1] = np.random.normal(loc=-2.5, size=30)

plt.plot(X0[:, 0], X0[:, 1], 'rx')
plt.plot(X1[:, 0], X1[:, 1], 'go')
plt.plot(X2[:, 0], X2[:, 1], 'b+')

In [ ]:
# Put all the data together in one matrix.
X=np.vstack((X0, X1, X2))
np.random.sample?
num_centres = 3
num_data = X.shape[0]
# choose three points to be the centres of the clusters
center_indices = np.random.random_integers(0, X.shape[0], num_centers)
centers = X[center_indices, :]
allocation = np.zeros(num_data)

for i in xrange(X.shape[0]):
    dists = ((centres - X[i, :])**2).sum(1)

In [ ]:
np.random.random_sample(10)

Dimensionality Reduction

Dimensionality reduction is the other main approach to unsupervised learning. We haven't looked closely at any dimensionality reduction algorithms, but the idea with them is to try and summarize your data set with a reduced number of 'latent' variables. Like a codebook. Dimensionality reduction algorithms were first studied in the social sciences, in particular in psychology. The IQ measurement is an attempt to reduce all facets of your intelligence down to one single value. Talking about how left wing or right wing someone is is an attempt to reduce all their political opinions down to one value. In practice, do make good models, we often need more than one value to summarize the complexity of a data set. Common approaches for dimensionality reduction include principal components analysis and factor analysis.

Probability: The Calculus of Uncertainty

We have spent most of our introduction to artificial intelligence and machine learning focussing on particular problem types which we have categorized into supervised and unsupervised learning. The wider field of artificial intelligence contains many challenges that aren't so easy to categorize. In fact there are times when such categorizations can become a hindrance to progress. For example, is robot navigation supervised or unsupervised learning? I'm not sure, but I think it depends on whether the robot has a map or not. But how does the map appear as labels?

In general modern artificial intelligence is about modelling your data and propagating the uncertainty around your system. The main way we do this is through probability and in particular Bayes rule. All the algorithms we have introduced have a probabilistic intepretation. The common way of finding it is to look at the error function, and see what the equivalent maximum likelihood model is. Remember the error function can often be derived as the negative log of the probability distribution we use for defining the model.

When using probabilities for artificial intelligence, we think about two things: the state of the world (often represented by $\dataMatrix$) and our belief about the state of the world, which is represented by a probability distribution. The model involves specifying a probability distribution over what we think the most likely state of the world is, making observations, and updating that probability distribution. A nice example is robot navigation.

A robot's state includes it's $x$, $y$ position, the direction it's facing, and perhaps a number of other variables such as where it's arms are legs etc. You can see quite quickly how the state space of the robot increases. Maybe the robot doesn't know exactly where it is: it is uncertain. The robot expresses this uncertainty by placing a probability distribution over its location: $p(\latentMatrix)$. This is known as the prior distribution. When the robot makes an observation, for example of landmark, like a wall, or perhaps if the robot is in the Peak District, a particular hill, the robot can absorb that information through probability, it does this by modelling the observations, $\dataMatrix$, with a probability distribution that says what the observations should be given the state of the robot, $p(\dataMatrix|\latentMatrix)$. This distribution is known as the likelihood. What we want to know is what effect the observation has had on the robots updated belief about its new location. This distribution is known as the posterior distribution.

These three distributions are related by Bayes' rule. Bayes' rule of probability comes about from combining the two main rules of probability.

Product Rule of Probability

The product rule of probability says that the joint distribution, $p(\latentMatrix, \dataMatrix)$ can be computed from the conditional and marginal distributions as follows

$$p(\latentMatrix, \dataMatrix) = p(\dataMatrix | \latentMatrix) p(\latentMatrix)$$

or conversely

$$p(\latentMatrix, \dataMatrix) = p(\latentMatrix |\dataMatrix) p(\dataMatrix).$$

The product rule gives the relationship between the marginal probability, the conditional probability and the joint probability.

Sum Rule of Probability

The sum rule of probability gives the relationship between the marginal probabability and the joint probability.

$$p(\dataMatrix) = \sum_{\latentMatrix} p(\dataMatrix, \latentMatrix)$$

where here the sum is over all possible values that $\latentMatrix$ can take (in the robot example, over all possible states of the robot! For continuous density functions, the sum has the form of an integral.

Bayes' Rule

Bayes' rule is so simple to derive from these two rules, that it doesn't really deserve a name (in fact I don't like the name Bayes rule, because it was first used by other people including Laplace). However, that's the common name for it. Bayes' rule gives us the relationship between what we new before we observed the location of the hill ($p(\latentMatrix)$) and what we know after ($p(\latentMatrix|\dataMatrix)$. The relationship is given by equating the two possible forms of the joint distribution as given by the product rule, so we have

$$p(\latentMatrix|\dataMatrix)p(\dataMatrix) = p(\dataMatrix|\latentMatrix) p(\latentMatrix)$$

With a little rearrangement we can have

$$p(\latentMatrix|\dataMatrix) = \frac{p(\dataMatrix|\latentMatrix) p(\latentMatrix)}{p(\dataMatrix)}$$

Everything in this rule is specified by our model of the world: our prior belief, $p(\latentMatrix)$, and the relationship between our measurements and the state of the world, known as the likelihood, $p(\dataMatrix|\latentMatrix)$. The only thing we're missing is the marginal likelihood, $p(\dataMatrix)$, but this can be derived through the sum rule and product rules,

$$p(\dataMatrix) = \sum_{\latentMatrix} p(\dataMatrix, \latentMatrix) = \sum_{\latentMatrix} p(\dataMatrix|\latentMatrix) p(\latentMatrix).$$

Now we can express everything we need to get our updated belief about the state of the world through knowing the likelihood and the prior, i.e. our initial belief about the state of the world, and the relationship between the state of the world and the types of measurements we can make. The major difficulty is in computing the sum: it can involve many terms and therefore be intractable. Or in continuous systems, it involves a high dimensional integral, which is also difficult to do.

Despite the challenges associated with computing the marginal likelihood, Bayesian inference (as it's widely known, although I prefer the simpler term 'the calculus of uncertainty') is the most promising approach we have for practical Aritificial Intelligence. Many of the advances you may have heard about in the media (such as Sebastian Thrun and the Google driverless car) rely on application of this formula. Amazing what can emerge from a fairly simple set of rules. Although, of course, practical application requires a great deal of engineering knowledge and approximations. In robot navigation the formula is applied repeatedly to absorb information as it arrives. When this is done the algorithm is known as a Bayes filter.