Guillaume Lemaitre - Fabrice Meriaudeau - Joan Massich
In [1]:
%matplotlib inline
%pprint off
# Matplotlib file
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
# MPLD3 extension
import mpld3
# Plotly extension
import plotly.plotly as py
from plotly.graph_objs import *
py.sign_in('glemaitre', 'se04g0bmi2')
# Numpy library
import numpy as np
In this section, we will deal with the data located in ./data/pca_lda_data.mat.
In [2]:
# Import the module to import .mat file
from scipy.io import loadmat
# Read the data
data = loadmat('./data/pca_lda_data.mat')
# Extract the data
class_1 = np.asmatrix(data['class1'].T)
class_2 = np.asmatrix(data['class2'].T)
class_3 = np.asmatrix(data['class3'].T)
# Concatenate all the data into a single matrix
### Concatenate the class #1, class #2, class #3
data = np.concatenate((class_1,
class_2,
class_3),
axis = 0)
# Concatenate the ground-truth and make sure that the output is a vector
### Concatenate the ground-truth of the class #1, class #2, class #3
gt = np.ravel(np.concatenate((np.ones((np.shape(class_1)[0], 1)) * 1,
np.ones((np.shape(class_2)[0], 1)) * 2,
np.ones((np.shape(class_3)[0], 1)) * 3),
axis = 0))
(a) Plot the data using plotly toolbox and the Scatter3d function. Check the following example: https://plot.ly/python/3d-scatter-plots/. Do not use plot_url in order to plot the figure in this notebook. Use the following properties:
In [ ]:
...
Implement PCA by completing the following Python code in order to return the eigenvalues and eigenvectors. Your implementation should contain the small size trick in the case that it will be necessary. To do so, we will decompose the problem in several steps:
(a) Complete the following function by:
In [ ]:
# Define a function to obtain mean invariance
def MeanInvariance(X):
# Compute the mean vector
### Use the function np.mean() with the axis argument
mean_data = ...
# Subtract the mean vector and return the matrix
return ...
(b) Complete the following function in order to obtain the eigenvalues and eigenvectors. This function neglecte the small size trick. To do so, you will need to:
In [ ]:
# Define a function to obtain the eigen decomposition as in the original PCA
def NormalDecompositionPCA(X):
# Compute the covariance matrix with the size trick
C = ...
# Compute the eigenvalues and eigenvectors of C
### Use the function np.linalg.eig()
w, v = ...
### Use the function np.atleast_2d()
w = ...
v = ...
return (w, v)
(c) Complete the following Python function, to implement the small size trick. To do so, you will need to:
In [ ]:
# Define a function to obtain the eigen decomposition as in the original PCA
def SmallSizeTrickDecompositionPCA(X):
# Apply PCA by transposing the data
w, v = ...
# Recompute the eigenvector for the original data
v = ...
return (w, v)
(d) Complete the following all together to get the PCA.
In [ ]:
# Implement PCA returning the eigenvectors and eigenvalues
### data should be N x L with N the number of samples and L the number of feature dimensions
def PCAPr(X):
# NOTE: Transpose the data in order to obtain an LxN matrix as the PCA convention
# Apply the mean invariance
data_pca = ...
# Check if we apply the small size trick or not
### Apply small trick
if ...
return ...
### Normal PCA
else:
return ...
The projection of the data to a lower space using the eigenvalues and eigenvectors is similiar for PCA and LDA. Thus, you will create a specific function in order to handle this projection.
(a) Complete the following function in order to project the data into a space with a lower dimensionality. To do so, you will have to:
In [ ]:
# Decomposition using PCA
### data should be N x L with N the number of samples and L the number of feature dimensions
def ProjectionData(X, v, w, nb_component = 1):
if nb_component > np.shape(X)[1]:
raise NameError('The number of component cannot be higher than the inital number of dimensions of X. \
We do dimensionality reduction here!!!!!')
# Start by sorting the eigenvalues and eigenvectors based on the eigenvalues
### Find the sorting index using np.argsort()
### Use np.ravel to force to be a vector
idx_sorted = ...
# Re-order eigenvalues and eigenvector
w = ...
v = ...
return ...
(a) Apply PCA and the dimension reduction for each class, plot the distribution of each class.
In [ ]:
...
(b) Plot the distribution of the different once projected in the low dimensional space. Use the following function:
plt.hist in order to plot the pdf of each class,plt.setp in order to change the color of the bins.
In [ ]:
# Plot the different class
nb_bins = 50
fig = plt.figure()
# Distribution of the projection of the class #1
n, bins, patches = plt.hist(...)
plt.setp(...)
# Distribution of the projection of the class #2
n, bins, patches = plt.hist(...)
plt.setp(...)
# Distribution of the projection of the class #3
n, bins, patches = plt.hist(...)
plt.setp(...)
plt.legend(framealpha=0)
mpld3.display(fig)
As reference, we give an example of how to use PCA implemented in scikit-learn.
In [3]:
from sklearn.decomposition import PCA
pca_data = PCA(n_components = 1)
pca_data.fit(data)
class_1_pca = pca_data.transform(class_1)
class_2_pca = pca_data.transform(class_2)
class_3_pca = pca_data.transform(class_3)
# Plot the different class
nb_bins = 50
fig = plt.figure()
# Distribution of the projection of the class #1
n, bins, patches = plt.hist(class_1_pca, nb_bins, normed=1, histtype='stepfilled', label='Class #1')
plt.setp(patches, 'facecolor', 'b', 'alpha', 0.75)
# Distribution of the projection of the class #1
n, bins, patches = plt.hist(class_2_pca, nb_bins, normed=1, histtype='stepfilled', label='Class #2')
plt.setp(patches, 'facecolor', 'r', 'alpha', 0.75)
# Distribution of the projection of the class #1
n, bins, patches = plt.hist(class_3_pca, nb_bins, normed=1, histtype='stepfilled', label='Class #3')
plt.setp(patches, 'facecolor', 'g', 'alpha', 0.75)
plt.legend(framealpha=0)
mpld3.display(fig)
Out[3]:
LDA differs from PCA since that it is taking into account the classes properties. Thus, we defined the following Python class pr_class which will be used in LDA. The class pr_class is characterized by the following attributes:
n_dims: corresponds to the number of feature dimensions,n_samples: corresponds to the number of samples,data: a matrix with the entire data od size n_samples by n_dims,gt: a vector with the ground-truth labels,mean_vec: a vector with the mean of each feature dimension,cov_matrix: a matrix containing the covariance of data.prior: a scalar with the prior of the class.
In [ ]:
# Define a class for all the information
class pr_class(object):
def __init__(self, X, y, prior):
self.n_dims = np.shape(X)[1]
self.n_samples = np.shape(X)[0]
self.data = X
self.gt = y
self.mean_vec = np.mean(X, axis = 0)
self.cov_mat = (self.data - self.mean_vec).T * (self.data - self.mean_vec) / (float(self.n_samples) - 1.)
self.prior = prior
As for PCA, the implementation of LDA will be broken down in several steps.
We first need to split the data by classes and compute the mean and covariance by calling the constructor of the class pr_class.
In [ ]:
# Define a function to build a list (or vector) of pr_class
def BuildListClass(X, y, priors):
# Find the number of classes
nb_classes = np.size(np.unique(y))
# Create an object for each class
return [pr_class(X[np.ravel(np.nonzero(y == (np.unique(y)[count]))), :], \
y[np.ravel(np.nonzero(y == (np.unique(y)[count])))], \
priors[count]) for count in xrange(nb_classes)]
(a) Complete the following code to compute the with-in class scatter matrix. Thus, make the sum of the covariance of each class normalized by their prior.
In [ ]:
# Define a function to compute the with-in class scatter matrix
### X is the list of `pr_class`
def ComputeSw(X):
Sw = 0.
for c in range(0, np.size(X)):
Sw += ...
return Sw
(b) Complete the following code to compute the between classes scatter matrix. To do so, you will need to:
In [ ]:
# Define a function to compute the between classes scatter matrix
### X is the list of `pr_class`
def ComputeSb(X):
# Concatenate all the data to get the mixture covariance
all_data = ...
all_data = ...
Sb = 0.
for c in range(0, np.size(X)):
Sb += ...
return Sb
(a) Complete the following function to implement LDA. To do so,
pr_class,
In [ ]:
# Definition of LDA
def LDAPr(X, y, priors):
# Build the list of classes
classes_pr = ...
# Compute the matrix for subsequent decomposition
S = ...
# Compute the eigenvalues and eigenvectors of C
w, v = ...
w = ...
v = ...
return (w, v)
(a) Apply LDA and the dimension reduction for each class, plot the distribution of each class.
In [ ]:
...
(b) Plot the distribution of the different once projected in the low dimensional space.
In [ ]:
...
As reference, we give an example of how to use LDA implemented in scikit-learn.
In [4]:
# Import LDA from scikit-learn
from sklearn.lda import LDA
# Project using LDA
lda = LDA(n_components=1)
lda.fit(data, gt)
class_1_lda = lda.transform(class_1)
class_2_lda = lda.transform(class_2)
class_3_lda = lda.transform(class_3)
# Plot the different class
nb_bins = 50
fig = plt.figure()
# Distribution of the projection of the class #1
n, bins, patches = plt.hist(class_1_lda, nb_bins, normed=1, histtype='stepfilled', label='Class #1')
plt.setp(patches, 'facecolor', 'b', 'alpha', 0.75)
# Distribution of the projection of the class #1
n, bins, patches = plt.hist(class_2_lda, nb_bins, normed=1, histtype='stepfilled', label='Class #2')
plt.setp(patches, 'facecolor', 'r', 'alpha', 0.75)
# Distribution of the projection of the class #1
n, bins, patches = plt.hist(class_3_lda, nb_bins, normed=1, histtype='stepfilled', label='Class #3')
plt.setp(patches, 'facecolor', 'g', 'alpha', 0.75)
plt.legend(framealpha=0)
mpld3.display(fig)
Out[4]:
(a) Explain in few lines the difference between PCA and LDA.
(b) Explain the difference between features selection and features extraction.