In [1]:
%matplotlib notebook
In [2]:
import numpy as np
from dimensionality_reduction import generate_A, generate_B, generate_C
from dimensionality_reduction import plot_3d, plot_pca, plot_lowdim, PCA
Dimensionality reduction is a general technique to find low-dimensional representations for high-dimensional stimuli. In this problem, we'll examine the classical approach called principal component analysis (PCA).
Run each of the following cells to generate three datasets and then plot each dataset in a separate figure so you can see what they each look like. Note that the colors of the points in the figures have no meaning; they’re only colored to help you visualize how the points are distributed in space. You can click and grab the images in order to rotate them and see the data from multiple angles. This will help you better visualize the points.
In [3]:
# Generate dataset A and show the first few values
A_data, A_colors = generate_A()
A_data[:10]
Out[3]:
In [4]:
# Show dataset A
plot_3d(A_data, A_colors, "Dataset A")
In [5]:
# Generate dataset A and show the first few values
B_data, B_colors = generate_B()
B_data[:10]
Out[5]:
In [6]:
plot_3d(B_data, B_colors, "Dataset B")
In [7]:
# Generate dataset C and show the first few values
C_data, C_colors = generate_C()
C_data[:10]
Out[7]:
In [8]:
plot_3d(C_data, C_colors, "Dataset C")
Dataset A is embedded in 2 dimensions but varies in 1 dimension. Dataset B is embedded in 3 dimensions, but varies in 2 dimensions. Dataset C is embedded in 3 dimensions but varies in 2 dimensions (it is like a rolled up sheet of paper, and a sheet of paper only has 2 dimensions).
Run the following cells, which plots each dataset again, but additionally performs PCA on the data and plots some additional info.
In [9]:
# Plot the A dataset
plot_3d(A_data, A_colors, "A data")
plot_pca(A_data, 1)
In [10]:
# Plot the B dataset
plot_3d(B_data, B_colors, "B data")
plot_pca(B_data, 2)
In [11]:
# Plot the C dataset
plot_3d(C_data, C_colors, "C data")
plot_pca(C_data, 2)
In [15]:
# documentation for plot_pca
plot_pca??
In [16]:
# documentation for PCA
PCA??
In [17]:
# documentation for np.linalg.eig
np.linalg.eig??
The red and green bars represent the principal components that capture the greatest variation in the dataset. They are actually the eigenvectors corresponding to the highest eigenvalues of the covariance matrix of the datasets. The special property the bars have with respect to each other is that they are orthogonal (or perpendicular to one another). This is because eigenvectors of a matrix have this special property.
Run the following few cells, which plot each of the datasets in their low-dimensional coordinates found by the PCA algorithm.
In [18]:
plot_lowdim(A_data, A_colors, "Dataset A")
In [19]:
plot_lowdim(B_data, B_colors, "Dataset B")
In [20]:
plot_lowdim(C_data, C_colors, "Dataset C")
PCA works well on dataset A since the dataset is originally just a line plotted in 2D. Dataset B can be projected into 2 dimensions and again PCA works well since the dataset is just a 2D plane originally plotted in 3D. Dataset C is projected into 2D but it does not look good. PCA doesn't work well on this dataset since it is nonlinear -- the points lie along what's called a manifold. A more sophisticated dimensionality reduction technique would be required to project dataset C down to two dimensions in a way that makes sense.