Here are a few links for your reference. You may want to refer back to them throughout the whole course.
To learn more about using Scipy/Numpy, have a look at the Getting Started Guide. You should also refer to the numpy documentation for references of available functions.
If you want to learn more about creating plots in Python, checkout the tutorials found on matplotlib's website https://matplotlib.org/tutorials/index.html. Once you are more familiar with plotting, check out this excellent blog post http://pbpython.com/effective-matplotlib.html.
There are more advanced libraries for interactive data visualization. For example, bqplot or d3.js. You may want to check out other libraries if you feel adventurous.
Although we use Jupyter notebook for these exercises, you may also want to check out Jupyter Lab when you want to work on your own projects.
If you are having issues with the grader, be sure to checkout the Q&A.
If you are stuck with the programming assignments, you can visit the discussion forum and discuss with your peers.
First, let's import the packages that we will use for the week. Run the cell below to import the packages.
In [1]:
# PACKAGE: DO NOT EDIT
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from sklearn.datasets import fetch_lfw_people, fetch_mldata, fetch_olivetti_faces
import time
import timeit
In [2]:
%matplotlib inline
from ipywidgets import interact
Next, we are going to retrieve Olivetti faces dataset.
When working with some datasets, before digging into further analysis, it is almost always useful to do a few things to understand your dataset. First of all, answer the following set of questions:
The dataset we have are usually stored as 2D matrices, then it would be really important to know which dimension represents the dimension of the dataset, and which represents the data points in the dataset.
In [3]:
image_shape = (64, 64)
# Load faces data
dataset = fetch_olivetti_faces()
faces = dataset.data
print('Shape of the faces dataset: {}'.format(faces.shape))
print('{} data points'.format(faces.shape[0]))
When your dataset are images, it's a really good idea to see what they look like.
One very
convenient tool in Jupyter is the interact
widget, which we use to visualize the images (faces). For more information on how to use interact, have a look at the documentation here.
In [4]:
@interact(n=(0, len(faces)-1))
def display_faces(n=0):
plt.figure()
plt.imshow(faces[n].reshape((64, 64)), cmap='gray')
plt.show()
You will now need to implement functions to which compute the mean and covariance of a dataset.
There are two ways to compute the mean and covariance. The naive way would be to iterate over the dataset
to compute them. This would be implemented as a for
loop in Python. However, computing them for large
dataset would be slow. Alternatively, you can use the functions provided by numpy to compute them, these are much
faster as numpy uses machine code to compute them. You will implment function which computes mean and covariane both
in the naive way and in the fast way. Later we will compare the performance between these two approaches. If you need to find out which numpy routine to call, have a look at the documentation https://docs.scipy.org/doc/numpy/reference/.
It is a good exercise to refer to the official documentation whenever you are not sure about something.
When you implement the functions for your assignment, make sure you read the docstring which dimension of your inputs corresponds to the number of data points and which corresponds to the dimension of the dataset.
In [5]:
# ===YOU SHOULD EDIT THIS FUNCTION===
def mean_naive(X):
"""Compute the mean for a dataset by iterating over the dataset
Arguments
---------
X: (N, D) ndarray representing the dataset.
N is the number of samples in the dataset
and D is the feature dimension of the dataset
Returns
-------
mean: (D, ) ndarray which is the mean of the dataset.
"""
N, D = X.shape
mean = np.zeros(D)
# The naive approach requires us to iterate over the whole dataset with a for loop.
for n in range(N):
mean = mean + X[n]/N
return mean
# ===YOU SHOULD EDIT THIS FUNCTION===
def cov_naive(X):
"""Compute the covariance for a dataset
Arguments
---------
X: (N, D) ndarray representing the dataset.
N is the number of samples in the dataset
and D is the feature dimension of the dataset
Returns
-------
covariance: (D, D) ndarray which is the covariance matrix of the dataset.
"""
N, D = X.shape
covariance = np.zeros((D, D))
for n in range(N):
covariance = covariance + np.dot((X[n]-mean_naive(X)).T, (X[n]-mean_naive(X)))/N
return covariance
In [69]:
# GRADED FUNCTION: DO NOT EDIT THIS LINE
# ===YOU SHOULD EDIT THIS FUNCTION===
def mean(X):
"""Compute the mean for a dataset
Arguments
---------
X: (N, D) ndarray representing the dataset.
Returns
-------
mean: (D, ) ndarray which is the mean of the dataset.
"""
mean = np.sum(X,0)/X.shape[0]
return mean
# ===YOU SHOULD EDIT THIS FUNCTION===
def cov(X):
"""Compute the covariance for a dataset
Arguments
---------
X: (N, D) ndarray representing the dataset.
Returns
-------
covariance_matrix: (D, D) ndarray which is the covariance matrix of the dataset.
"""
# It is possible to vectorize our code for computing the covariance, i.e. we do not need to explicitly
# iterate over the entire dataset as looping in Python tends to be slow
N, D = X.shape
covariance_matrix = np.dot((X-mean(X)).T, (X-mean(X)))/X.shape[0]
return covariance_matrix
With the mean
function implemented, let's take a look at the mean face of our dataset!
In [68]:
def mean_face(faces):
"""Compute the mean of the `faces`
Arguments
---------
faces: (N, 64 * 64) ndarray representing the faces dataset.
Returns
-------
mean_face: (64, 64) ndarray which is the mean of the faces.
"""
mean_face = mean(faces)
return mean_face
plt.imshow(mean_face(faces).reshape((64, 64)), cmap='gray');
To put things into perspective, we can benchmark the two different implementation with the %time
function
in the following way:
In [19]:
# We have some huge data matrix, and we want to compute its mean
X = np.random.randn(1000, 20)
# Benchmarking time for computing mean
%time mean_naive(X)
%time mean(X)
pass
In [20]:
# Benchmarking time for computing covariance
%time cov_naive(X)
%time cov(X)
pass
Alternatively, we can also see how running time increases as we increase the size of our dataset.
In the following cell, we run mean
, mean_naive
and cov
, cov_naive
for many times on different sizes of
the dataset and collect their running time. If you are less familiar with Python, you may want to spend
some time understanding what the code does. Understanding how your code scales with the size of your dataset (or dimensionality of the dataset) is crucial when you want to apply your algorithm to larger dataset. This is really important when we propose alternative methods a more efficient algorithms to solve the same problem. We will use these techniques again later in this course to analyze the running time of our code.
In [12]:
def time(f, repeat=10):
"""A helper function to time the execution of a function.
Arguments
---------
f: a function which we want to time it.
repeat: the number of times we want to execute `f`
Returns
-------
the mean and standard deviation of the execution.
"""
times = []
for _ in range(repeat):
start = timeit.default_timer()
f()
stop = timeit.default_timer()
times.append(stop-start)
return np.mean(times), np.std(times)
In the cell below, we compare the running time between mean
and mean_naive
, we repeatedly run these functions on a "fake" dataset X
of different sizes and record a list of running_times (fast_time
and slow_time
), we then plot the running time against the dataset size to visualize how the running time scales with the size of the dataset.
In [21]:
fast_time = []
slow_time = []
for size in np.arange(100, 5000, step=100):
X = np.random.randn(size, 20) # Create a random dataset with of size (size, 20)
f = lambda : mean(X)
# Record the running time for mean
mu, sigma = time(f)
fast_time.append((size, mu, sigma))
f = lambda : mean_naive(X)
mu, sigma = time(f)
# Record the running time for mean_naive
slow_time.append((size, mu, sigma))
fast_time = np.array(fast_time)
slow_time = np.array(slow_time)
In [22]:
fig, ax = plt.subplots()
ax.errorbar(fast_time[:,0], fast_time[:,1], fast_time[:,2], label='fast mean', linewidth=2)
ax.errorbar(slow_time[:,0], slow_time[:,1], slow_time[:,2], label='naive mean', linewidth=2)
ax.set_xlabel('size of dataset')
ax.set_ylabel('running time')
plt.legend();
In [23]:
## === FILL IN THIS, follow the approach we have above ===
fast_time_cov = []
slow_time_cov = []
for size in np.arange(100, 5000, step=100):
X = np.random.randn(size, 20)
# Record the running time for cov
f = lambda : cov_naive(X) # EDIT THIS
mu, sigma = time(f) # EDIT THIS
fast_time_cov.append((size, mu, sigma))
# Record the running time for cov_naive
f = lambda : cov(X) # EDIT THIS
mu, sigma = time(f) # EDIT THIS
slow_time_cov.append((size, mu, sigma))
fast_time_cov = np.array(fast_time_cov)
slow_time_cov = np.array(slow_time_cov)
In [24]:
fig, ax = plt.subplots()
ax.errorbar(fast_time_cov[:,0], fast_time_cov[:,1], fast_time_cov[:,2], label='fast covariance', linewidth=2)
ax.errorbar(slow_time_cov[:,0], slow_time_cov[:,1], slow_time_cov[:,2], label='naive covariance', linewidth=2)
ax.set_xlabel('size of dataset')
ax.set_ylabel('running time')
plt.legend();
In this week we are also going to verify a few properties about the mean and covariance of affine transformation of random variables.
Consider a data matrix $\boldsymbol{X}$ of size (N, D). We would like to know what is the covariance when we apply an affine transformation $\boldsymbol{A}\boldsymbol{x}_i + \boldsymbol{b}$ with a matrix $\boldsymbol A$ and a vector $\boldsymbol b$ to each datapoint $\boldsymbol{x}_i$ in $\boldsymbol{X}$, i.e. we would like to know what happens to the mean and covariance for the new dataset if we apply affine transformation.
In [63]:
# GRADED FUNCTION: DO NOT EDIT THIS LINE
# ===YOU SHOULD EDIT THIS FUNCTION===
def affine_mean(mean, A, b):
"""Compute the mean after affine transformation
Args:
mean: ndarray, the mean vector
A, b: affine transformation applied to x. i.e. Ax + b
Returns:
ndarray of size (D, ): mean vector after affine transformation
"""
affine_m = A @ mean +b # EDIT THIS
return affine_m
# ===YOU SHOULD EDIT THIS FUNCTION===
def affine_covariance(S, A, b):
"""Compute the covariance matrix after affine transformation
Args:
S: ndarray of size (D, D), the covariance matrix
A, b: affine transformation applied to each element in X
Returns:
ndarray of size (D, D): covariance matrix after the transformation
"""
affine_cov = A @ S @ A.T # EDIT THIS
return affine_cov
Once the two functions above are implemented, we can verify the correctness our implementation. Assuming that we have some matrix $\boldsymbol A$ and vector $\boldsymbol b$.
In [55]:
random = np.random.RandomState(42)
A = random.randn(4,4)
b = random.randn(4)
Next we can generate some random dataset $\boldsymbol{X}$
In [56]:
X = random.randn(100, 4)
Assuming that for some dataset $\boldsymbol X$, the mean and covariance are $\boldsymbol m$, $\boldsymbol S$, and for the new dataset after affine transformation $ \boldsymbol X'$, the mean and covariance are $\boldsymbol m'$ and $\boldsymbol S'$, then we would have the following identity:
$$\boldsymbol m' = \text{affine_mean}(\boldsymbol m, \boldsymbol A, \boldsymbol b)$$$$\boldsymbol S' = \text{affine_covariance}(\boldsymbol S, \boldsymbol A, \boldsymbol b)$$
In [57]:
X1 = ((A @ (X.T)).T + b) # applying affine transformation once
X2 = ((A @ (X1.T)).T + b) # and again
One very useful way to compare whether arrays are equal/similar is use the helper functions
in numpy.testing
. the functions in numpy.testing
will throw an AssertionError
when the output does not satisfy the assertion.
In [70]:
np.testing.assert_almost_equal(mean(X1), affine_mean(mean(X), A, b))
np.testing.assert_almost_equal(cov(X1), affine_covariance(cov(X), A, b))
print('correct')
Fill in the ???
below
In [71]:
np.testing.assert_almost_equal(mean(X2), affine_mean(mean(X1), A, b))
np.testing.assert_almost_equal(cov(X2), affine_covariance(cov(X1), A, b))
print('correct')
Check out the numpy documentation for details.
If you are interested in learning more about floating point arithmetic, here is a good paper.