Week 1: Mean/Covariance of a data set and effect of linear transformation

In this week, we are going to investigate how the mean and (co)variance of a dataset changes when we apply affine transformation to the dataset.

Learning objectives

Get Farmiliar with basic programming using Python and Numpy/Scipy.
Learn to appreciate implementing functions to compute statistics of dataset in vectorized way.
Understand the effects of affine transformations on a dataset.
Understand the importance of testing in programming for machine learning.

Here are a few links for your reference. You may want to refer back to them throughout the whole course.

If you are less comfortable with programming in Python, have a look at this Coursera course https://www.coursera.org/learn/python.
To learn more about using Scipy/Numpy, have a look at the Getting Started Guide. You should also refer to the numpy documentation for references of available functions.
If you want to learn more about creating plots in Python, checkout the tutorials found on matplotlib's website https://matplotlib.org/tutorials/index.html. Once you are more familiar with plotting, check out this excellent blog post http://pbpython.com/effective-matplotlib.html.
There are more advanced libraries for interactive data visualization. For example, bqplot or d3.js. You may want to check out other libraries if you feel adventurous.
Although we use Jupyter notebook for these exercises, you may also want to check out Jupyter Lab when you want to work on your own projects.
If you are having issues with the grader, be sure to checkout the Q&A.
If you are stuck with the programming assignments, you can visit the discussion forum and discuss with your peers.

First, let's import the packages that we will use for the week. Run the cell below to import the packages.



In [1]:

    
# PACKAGE: DO NOT EDIT
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from sklearn.datasets import fetch_lfw_people, fetch_mldata, fetch_olivetti_faces
import time
import timeit



In [2]:

    
%matplotlib inline
from ipywidgets import interact

Next, we are going to retrieve Olivetti faces dataset.

When working with some datasets, before digging into further analysis, it is almost always useful to do a few things to understand your dataset. First of all, answer the following set of questions:

What is the size of your dataset?
What is the dimensionality of your data?

The dataset we have are usually stored as 2D matrices, then it would be really important to know which dimension represents the dimension of the dataset, and which represents the data points in the dataset.



In [3]:

    
image_shape = (64, 64)
# Load faces data
dataset = fetch_olivetti_faces()
faces = dataset.data

print('Shape of the faces dataset: {}'.format(faces.shape))
print('{} data points'.format(faces.shape[0]))









    



Shape of the faces dataset: (400, 4096)
400 data points

When your dataset are images, it's a really good idea to see what they look like.

One very convenient tool in Jupyter is the interact widget, which we use to visualize the images (faces). For more information on how to use interact, have a look at the documentation here.



In [4]:

    
@interact(n=(0, len(faces)-1))
def display_faces(n=0):
    plt.figure()
    plt.imshow(faces[n].reshape((64, 64)), cmap='gray')
    plt.show()

1. Mean and Covariance of a Dataset

You will now need to implement functions to which compute the mean and covariance of a dataset.

There are two ways to compute the mean and covariance. The naive way would be to iterate over the dataset to compute them. This would be implemented as a for loop in Python. However, computing them for large dataset would be slow. Alternatively, you can use the functions provided by numpy to compute them, these are much faster as numpy uses machine code to compute them. You will implment function which computes mean and covariane both in the naive way and in the fast way. Later we will compare the performance between these two approaches. If you need to find out which numpy routine to call, have a look at the documentation https://docs.scipy.org/doc/numpy/reference/. It is a good exercise to refer to the official documentation whenever you are not sure about something.

When you implement the functions for your assignment, make sure you read the docstring which dimension of your inputs corresponds to the number of data points and which corresponds to the dimension of the dataset.



In [5]:

    
# ===YOU SHOULD EDIT THIS FUNCTION===
def mean_naive(X):
    """Compute the mean for a dataset by iterating over the dataset
    
    Arguments
    ---------
    X: (N, D) ndarray representing the dataset.
        N is the number of samples in the dataset 
        and D is the feature dimension of the dataset
    
    Returns
    -------
    mean: (D, ) ndarray which is the mean of the dataset.
    """
    N, D = X.shape
    mean = np.zeros(D)
    # The naive approach requires us to iterate over the whole dataset with a for loop.
    for n in range(N):
        mean = mean + X[n]/N
    return mean

# ===YOU SHOULD EDIT THIS FUNCTION===
def cov_naive(X):
    """Compute the covariance for a dataset
    Arguments
    ---------
    X: (N, D) ndarray representing the dataset. 
        N is the number of samples in the dataset 
        and D is the feature dimension of the dataset
    
    Returns
    -------
    covariance: (D, D) ndarray which is the covariance matrix of the dataset.
    
    """
    N, D = X.shape
    covariance = np.zeros((D, D))
    for n in range(N):
        covariance = covariance + np.dot((X[n]-mean_naive(X)).T, (X[n]-mean_naive(X)))/N
    return covariance



In [69]:

    
# GRADED FUNCTION: DO NOT EDIT THIS LINE

# ===YOU SHOULD EDIT THIS FUNCTION===
def mean(X):
    """Compute the mean for a dataset
    
    Arguments
    ---------
    X: (N, D) ndarray representing the dataset.
    
    Returns
    -------
    mean: (D, ) ndarray which is the mean of the dataset.
    """
    mean = np.sum(X,0)/X.shape[0]
    return mean
 
# ===YOU SHOULD EDIT THIS FUNCTION===
def cov(X):
    """Compute the covariance for a dataset
    Arguments
    ---------
    X: (N, D) ndarray representing the dataset.
    
    Returns
    -------
    covariance_matrix: (D, D) ndarray which is the covariance matrix of the dataset.
    
    """
    # It is possible to vectorize our code for computing the covariance, i.e. we do not need to explicitly
    # iterate over the entire dataset as looping in Python tends to be slow
    N, D = X.shape
    covariance_matrix = np.dot((X-mean(X)).T, (X-mean(X)))/X.shape[0]
    return covariance_matrix

With the mean function implemented, let's take a look at the mean face of our dataset!



In [68]:

    
def mean_face(faces):
    """Compute the mean of the `faces`
    
    Arguments
    ---------
    faces: (N, 64 * 64) ndarray representing the faces dataset.
    
    Returns
    -------
    mean_face: (64, 64) ndarray which is the mean of the faces.
    """
    mean_face = mean(faces)
    return mean_face

plt.imshow(mean_face(faces).reshape((64, 64)), cmap='gray');

To put things into perspective, we can benchmark the two different implementation with the %time function in the following way:



In [19]:

    
# We have some huge data matrix, and we want to compute its mean
X = np.random.randn(1000, 20)
# Benchmarking time for computing mean
%time mean_naive(X)
%time mean(X)
pass









    



Wall time: 3 ms
Wall time: 0 ns



In [20]:

    
# Benchmarking time for computing covariance
%time cov_naive(X)
%time cov(X)
pass









    



Wall time: 3.38 s
Wall time: 500 µs

Alternatively, we can also see how running time increases as we increase the size of our dataset. In the following cell, we run mean, mean_naive and cov, cov_naive for many times on different sizes of the dataset and collect their running time. If you are less familiar with Python, you may want to spend some time understanding what the code does. Understanding how your code scales with the size of your dataset (or dimensionality of the dataset) is crucial when you want to apply your algorithm to larger dataset. This is really important when we propose alternative methods a more efficient algorithms to solve the same problem. We will use these techniques again later in this course to analyze the running time of our code.



In [12]:

    
def time(f, repeat=10):
    """A helper function to time the execution of a function.
    
    Arguments
    ---------
    f: a function which we want to time it.
    repeat: the number of times we want to execute `f`
    
    Returns
    -------
    the mean and standard deviation of the execution.
    """
    times = []
    for _ in range(repeat):
        start = timeit.default_timer()
        f()
        stop = timeit.default_timer()
        times.append(stop-start)
    return np.mean(times), np.std(times)

In the cell below, we compare the running time between mean and mean_naive, we repeatedly run these functions on a "fake" dataset X of different sizes and record a list of running_times (fast_time and slow_time), we then plot the running time against the dataset size to visualize how the running time scales with the size of the dataset.



In [21]:

    
fast_time = []
slow_time = []

for size in np.arange(100, 5000, step=100):
    X = np.random.randn(size, 20) # Create a random dataset with of size (size, 20)
    f = lambda : mean(X)
    # Record the running time for mean
    mu, sigma = time(f)
    fast_time.append((size, mu, sigma))
    
    f = lambda : mean_naive(X)
    mu, sigma = time(f)
    # Record the running time for mean_naive
    slow_time.append((size, mu, sigma))

fast_time = np.array(fast_time)
slow_time = np.array(slow_time)



In [22]:

    
fig, ax = plt.subplots()
ax.errorbar(fast_time[:,0], fast_time[:,1], fast_time[:,2], label='fast mean', linewidth=2)
ax.errorbar(slow_time[:,0], slow_time[:,1], slow_time[:,2], label='naive mean', linewidth=2)
ax.set_xlabel('size of dataset')
ax.set_ylabel('running time')
plt.legend();



In [23]:

    
## === FILL IN THIS, follow the approach we have above ===
fast_time_cov = []
slow_time_cov = []
for size in np.arange(100, 5000, step=100):
    X = np.random.randn(size, 20)
    # Record the running time for cov
    f = lambda : cov_naive(X)               # EDIT THIS
    mu, sigma = time(f) # EDIT THIS
    fast_time_cov.append((size, mu, sigma))
    
    # Record the running time for cov_naive
    f = lambda : cov(X)         # EDIT THIS
    mu, sigma = time(f) # EDIT THIS
    slow_time_cov.append((size, mu, sigma))

fast_time_cov = np.array(fast_time_cov)
slow_time_cov = np.array(slow_time_cov)



In [24]:

    
fig, ax = plt.subplots()
ax.errorbar(fast_time_cov[:,0], fast_time_cov[:,1], fast_time_cov[:,2], label='fast covariance', linewidth=2)
ax.errorbar(slow_time_cov[:,0], slow_time_cov[:,1], slow_time_cov[:,2], label='naive covariance', linewidth=2)
ax.set_xlabel('size of dataset')
ax.set_ylabel('running time')
plt.legend();

2. Affine Transformation of Dataset

In this week we are also going to verify a few properties about the mean and covariance of affine transformation of random variables.

Consider a data matrix $\boldsymbol{X}$ of size (N, D). We would like to know what is the covariance when we apply an affine transformation $\boldsymbol{A}\boldsymbol{x}_i + \boldsymbol{b}$ with a matrix $\boldsymbol A$ and a vector $\boldsymbol b$ to each datapoint $\boldsymbol{x}_i$ in $\boldsymbol{X}$, i.e. we would like to know what happens to the mean and covariance for the new dataset if we apply affine transformation.



In [63]:

    
# GRADED FUNCTION: DO NOT EDIT THIS LINE

# ===YOU SHOULD EDIT THIS FUNCTION===
def affine_mean(mean, A, b):
    """Compute the mean after affine transformation
    Args:
        mean: ndarray, the mean vector
        A, b: affine transformation applied to x. i.e. Ax + b
    Returns:
        ndarray of size (D, ): mean vector after affine transformation
    """
    affine_m = A @ mean +b # EDIT THIS
    return affine_m

# ===YOU SHOULD EDIT THIS FUNCTION===
def affine_covariance(S, A, b):
    """Compute the covariance matrix after affine transformation
    Args:
        S: ndarray of size (D, D), the covariance matrix
        A, b: affine transformation applied to each element in X        
    Returns:
        ndarray of size (D, D): covariance matrix after the transformation
    """
    affine_cov = A @ S @ A.T # EDIT THIS
    return affine_cov

Once the two functions above are implemented, we can verify the correctness our implementation. Assuming that we have some matrix $\boldsymbol A$ and vector $\boldsymbol b$.



In [55]:

    
random = np.random.RandomState(42)
A = random.randn(4,4)
b = random.randn(4)

Next we can generate some random dataset $\boldsymbol{X}$



In [56]:

    
X = random.randn(100, 4)

Assuming that for some dataset $\boldsymbol X$, the mean and covariance are $\boldsymbol m$, $\boldsymbol S$, and for the new dataset after affine transformation $ \boldsymbol X'$, the mean and covariance are $\boldsymbol m'$ and $\boldsymbol S'$, then we would have the following identity:

$$\boldsymbol m' = \text{affine_mean}(\boldsymbol m, \boldsymbol A, \boldsymbol b)$$$$\boldsymbol S' = \text{affine_covariance}(\boldsymbol S, \boldsymbol A, \boldsymbol b)$$



In [57]:

    
X1 = ((A @ (X.T)).T + b)  # applying affine transformation once
X2 = ((A @ (X1.T)).T + b) # and again

One very useful way to compare whether arrays are equal/similar is use the helper functions in numpy.testing. the functions in numpy.testing will throw an AssertionError when the output does not satisfy the assertion.



In [70]:

    
np.testing.assert_almost_equal(mean(X1), affine_mean(mean(X), A, b))
np.testing.assert_almost_equal(cov(X1),  affine_covariance(cov(X), A, b))
print('correct')









    



correct

Fill in the ??? below



In [71]:

    
np.testing.assert_almost_equal(mean(X2), affine_mean(mean(X1), A, b))
np.testing.assert_almost_equal(cov(X2),  affine_covariance(cov(X1), A, b))
print('correct')









    



correct

Check out the numpy documentation for details.

If you are interested in learning more about floating point arithmetic, here is a good paper.