Lab Class 1: IPython and Expectations

Neil D. Lawrence COM4509/6509 Machine Learning and Adaptive Intelligence

Welcome to the IPython notebook! We will be using the IPython notebook for all our lab classes and assignments. It is a really convenient way to interact with data using python. In this first lab session we are going to familiarise ourselves with the notebook and start getting used to python whilst we review some of the material from the first lecture.

Python is a generic programming language with 'numerical' and scientific capabilities added on through the numpy and scipy libraries. There are excellent 2-D plotting facilities available through matplotlib.

Importing Libraries

The numpy library provides most of the manipulations we need for arrays in python. numpy is short for numerical python, but as well as providing the numerics, numpy provides contiguous array objects. These objects weren't available in the original python. The first step is to import numpy. We'll then use it to draw samples from a "standard normal". A standard normal is a Gaussian density with mean of zero and variance of one. We'll draw 10 samples from the standard normal.



In [ ]:

To get help about any command in the notebook simply type that command followed by a question mark.



In [ ]:

Now let's look at the samples, we can show them using the print command.



In [ ]:

Estimating Moments

We can compute the sample mean by adding all the samples together and dividing by the number of samples.



In [ ]:

Or we can extract the dimension of the array to compute the mean.



In [ ]:

Or finally we can just compute the mean directly.



In [ ]:

Rember from the lecture (and your revision of probability) that an expectation is the integral of a function under the distribution. For example, the second moment of a distribution is given by

$$\langle x^2\rangle_{p(x)} = \int_{-\infty}^\infty x^2 p(x) \text{d}x$$

The law of large numbers allows us to approximate this moment through a sample based approximation. We have just computed the sample mean (which is also known as the first moment), we can now compute the sample based approximation to the second moment.

$$\langle x^2 \rangle_{p(x)} \approx \frac{1}{N}\sum_i x^2_i$$



In [ ]:

Distribution Variance

The variance of a distribution is given by the second moment minus the mean squared. This is computed as

$$\text{var}(x) = \langle x^2\rangle - \langle x\rangle^2$$

or from a sample based approximation of the moments it can be estimated as

$$\frac{1}{N}\sum_{i=1}^N x_i^2 - \left(\frac{1}{N}\sum_{i=1}^N x_i\right)^2$$

which is easy to write in code as follows



In [ ]:

Convergence of Sample Moments to True Moments

We know in this case, because we sampled from a standard normal, that the mean and variance of the distribution should be 0 and 1. Why do you not get a mean of 0 and a variance of 1? Let's explore what happens as we increase the number of samples. To do this we are going to use for loops and python lists. We start by creating empty lists for the means and variances. Then we create a list of integers to iterate through. In Python, a for loop always iterates through a list (in some languages this is called a foreach loop, its counterpart the counter for loop only exists by creating a list of integers, see http://en.wikipedia.org/wiki/Foreach_loop#Python). We can use the range command to create the numbers of samples.



In [ ]:

    
means = []
variances = []
samples = [10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000] 
for N in samples:
    x = np.random.randn(N)
    mean = x.mean()
    variance = (x**2).mean() - mean**2
    means.append(mean)
    variances.append(variance)

Plotting in Python

We'll now plot the variance and the mean against the number of samples. To do this, we need to first convert the samples, varianes and means from Python lists, to numpy arrays.



In [ ]:

Next we need to include the plotting functionality from matplotlib, and instruct IPyhton notebook to include the plots inline with the notebook, rather than in a different window. First we import the plotting library, matplotlib.



In [ ]:



In [ ]:

Here we plot the estimated mean against the number of samples. However, since the samples go up logarithmically it's better to use a logarithmic axis for the x axis, as follows.



In [ ]:

We can do the same for the variances, again using a logarithmic axis for the samples. This time, we're going to lavel the x axis using a latex formula.



In [ ]:

    
plt.semilogx(samples, variances)
xlabel('$\log_{10}N$')
ylabel('variance')

Entropy

We saw in class that the entropy is a special type of expectation. To compute the entropy of a distribution we compute

$$-\langle \log p(x)\rangle = -\int p(x) \log p(x) \text{d}x$$

But we can't compute it directly from samples, we need to know the functional form of $p(x)$. In this case, we know it's Gaussian. Let's plot it:



In [ ]:



In [ ]:

The log of the standard normal is given by

$$\log p(x) = -\frac{1}{2}\log 2\pi - \frac{1}{2} x^2$$

What is the expectation of this under a Gaussian?

$$-\frac{1}{2}\left[\log 2\pi +1\right]$$

Now let's repeat the above for sampling from, and computing the entropy of a normal density with mean 3 and variance 4.



In [ ]: