If a vector is multiplied by a scalar, the direction stays the same, however, the length can be changes.
If a vector is multiplies by a matrix it can "point" in a new direction (rotated).
Input data X is of shape (N, D).
Tranformation matrix Q is of shape (D x D).
Transformed data Z is of shape (N x D).
Multiplying a vector by a matrix can be seen as rotating the coordinate system of the vector.
Information carried by a variable can be assesed by measuring the variance.
A variable having variance equal to zero provides no information.
High correlation between two variables means that one of them can be disregarded since they provide (approximately) very similar information.
The goal of PCA is to find a matrix Q that will help us find the correlation between the variable so that we have the smallest number of the most crucial information
Assuming that the noise is not a major factor in the data, it is fair to state the principal component of the noise will be somewhere down the list. Thus, the PCA transformation will perform denoising pre-processing.
Furthermore, by reducing number of parameters, risk of fitting the model to the noise or overfitting is significantly reduced.
That examplains them being significantly uncorrelated.
Very important: Uncorrelated does not necessarily mean independent, unless distribution is Gaussian.
Covariance of X is given as: \begin{equation*} Cov(X)= \frac{1}{N} (X - \mu _{X})^{T} (X - \mu _{X}) \end{equation*}
Cov(X) is a D by D matrix.
Eigenvectors (v) are non-zero vectors vectors that only change by a scalar factor , when a linear transformation is applied to it.
\begin{equation*} Av = \lambda v \end{equation*}The equation can be re-written as:
\begin{equation*} | Av - \lambda I | v = 0 \end{equation*}It has a non-zero solution if and only if the determinant is zero: \begin{equation*} det| Av - \lambda I | = 0 \end{equation*}
λ is known as the eigenvalues or characteristic root associated with the eigenvector v.
The goal is to find the eigenvalues and corresponding to them eigenvectors.
There will be D number of eigenvalues (because X is of shape N by D) and their values will be greater or equal to zero.
After finding the eigenvectors and corresponding to them eigenvalues the eigenvalues will be sorted in the descending order and combined in a diagonal matrix:
\begin{equation*} \wedge = \begin{pmatrix} \lambda_{1} & 0 & . & 0\\ 0 & \lambda_{2} & . & 0 \\ . & . & . &. \\ . & . & . & \lambda_{D} \end{pmatrix} \end{equation*}A V matrix is constructed as follows: \begin{equation*} V = \begin{pmatrix} 1 & 1 & . & 1 \\ v_{1} & v_{2} & . & v_{D}\\ 1 & 1 & . & 1 \end{pmatrix} \end{equation*}
The V matrix is orthonormal which results in that if the vector V is multiplied by another eigenvector the result will be zero.
\begin{equation*} Cov(X)V = V \lambda \end{equation*}Furthermore, covariance of Z is given as: \begin{equation*} Cov(Z)= \frac{1}{N} (Z - \mu _{Z})^{T} (Z - \mu _{Z}) = \frac{1}{N} (XQ - \mu _{X}Q)^{T} (XQ - \mu _{X}Q) = \frac{1}{N} Q^{T}(X - \mu _{X})^{T} (X - \mu _{X})Q = Q^{T} Cov(X) Q \end{equation*}
By choosing Q = V we obtain: \begin{equation*} Cov(Z) = V^{T} Cov(X) V = V^{T} V \wedge = \wedge \end{equation*}
Off diagonal elements are equal to zero which means that no dimension is correlated with any other dimension. All values of λ were sorted in descending order so the first column of Z has the most variance and thus carries the most information.
PCA minimized objection function given as: \begin{equation*} J = Cov(|X - X_{reconstructed} |^{2}) \end{equation*}
Perfect reconstruction would lead to the error being equal to zero, because: \begin{equation*} X_{reconstructed} = X Q Q^{-1} \end{equation*} Since: \begin{equation*} Q Q^{-1} = I \end{equation*}
Furthermore, Q is orthonormal so: \begin{equation*} Q^{-1} = Q^{T} \end{equation*}
Finally, we can re-state the problem as: \begin{equation*} X_{reconstructed} = X Q Q^{T} \end{equation*}
The initial data can be reconstructed with a K number of eigenvectors leading to a non-zero reconstruction error. It is a trade-off for lower dimensionality.
\begin{equation*} X_{reconstructed} = X Q_{K} Q_{K}^{T} \end{equation*}In that scenario the objective function is re-written as (often referred to as Frobenius Norm): \begin{equation*} J = Cov(|x(n) - Q_{K} Q_{K}^{T} x(n) |^2) =|X - XQ_{K} Q_{K}^{T}|_{F}^2 \end{equation*}
Based on Sebastian Raschka's blog post: http://sebastianraschka.com/Articles/2014_pca_step_by_step.html
1) Load a dataset without labels. The data will be d-dimensional.
2) Calculate means for every dimension and samples in the dataset.
3) Compute the covariance matrix for the whole dataset
4) Compute eigenvectors (noted as e_1, e_2,..e_d) and corresponding eigenvalues (lambda_1, lambda_2, ... lambda_d)
5) Sort the eigenvectors by eigenvalues (descending) and choose k of them with the largest eigenvalues.
6) Form a d by k dimensional matrix W, where every columns represents an eigenvector.
7) Use the matrix W to transform the data onto the new subspace:
\begin{equation*} y = W_{T} \times x \end{equation*}where:
x is a d by 1-dimensional vector representing one sample and y is the transformed k by 1-dimensional sample in the new subspace.
3-dimensional dataset from multivariate Gaussian distribution.
More information can be found:
https://faculty.math.illinois.edu/~r-ash/Stat/StatLec21-25.pdf
In [1]:
import numpy as np
np.random.seed(123456789)
In [2]:
# Number of samples per class
samples_per_class = 200
In [3]:
# Generate the first "class" samples
# Mean
mu_vec1 = np.array([0, 0, 0])
# Covariance
cov_mat1 = np.array([[1,0,0],[0,1,0],[0,0,1]])
# Samples
class1_sample = np.random.multivariate_normal(mu_vec1, cov_mat1, samples_per_class).T
In [4]:
class1_sample.shape
Out[4]:
In [5]:
# Generate the second "class" samples
# Mean
mu_vec2 = np.array([1, 1, 1])
# Covariance
cov_mat2 = np.array([[1,0,0],[0,1,0],[0,0,1]])
# Samples
class2_sample = np.random.multivariate_normal(mu_vec2, cov_mat2, samples_per_class).T
In [6]:
class2_sample.shape
Out[6]:
In [7]:
# Visualizing the data
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d import proj3d
%pylab inline
In [8]:
fig = plt.figure(figsize = (12, 12))
ax = fig.add_subplot(111, projection = '3d')
plt.rcParams['legend.fontsize'] = 10
ax.plot(class1_sample[0,:], class1_sample[1,:], class1_sample[2,:], 'o',
markersize = 8, color = 'blue', alpha = 0.5, label = 'class1')
ax.plot(class2_sample[0,:], class2_sample[1,:], class2_sample[2,:], '^',
markersize = 8, alpha = 0.5, color = 'red', label = 'class2')
plt.title('Samples for class 1 and class 2')
ax.legend(loc = 'upper right')
Out[8]:
In [9]:
# Gathering the dataset together
all_samples = np.concatenate((class1_sample, class2_sample),
axis = 1)
In [10]:
# Computing the d-dimensional mean vector
mean_x = np.mean(all_samples[0,:])
mean_y = np.mean(all_samples[1,:])
mean_z = np.mean(all_samples[2,:])
mean_vector = np.array([[mean_x],[mean_y],[mean_z]])
mean_vector
Out[10]:
In [11]:
# Computing the Scatter Matrix
# Intializing the scatter matrix as zeros
# It will be of size 3 by 3 in this specific example
dimensionality = all_samples.shape[0]
scatter_matrix = np.zeros((dimensionality, dimensionality))
for i in range(all_samples.shape[1]):
scatter_matrix += (all_samples[:,i].reshape(dimensionality,1) - mean_vector).dot((all_samples[:,i].reshape(dimensionality,1) - mean_vector).T)
scatter_matrix
Out[11]:
In [12]:
# Computing the Covariance Matrix
cov_mat = np.cov([all_samples[0,:], all_samples[1,:], all_samples[2,:]])
cov_mat
Out[12]:
In [13]:
# Calculating Eigenvectors and Eigenvalues from the scatter matrix
eig_val_sc, eig_vec_sc = np.linalg.eig(scatter_matrix)
for i in range(len(eig_val_sc)):
eigvec_sc = eig_vec_sc[:,i].reshape(1, dimensionality).T
In [14]:
eig_val_sc
Out[14]:
In [15]:
eig_vec_sc
Out[15]:
In [16]:
# Calculating Eigenvectors and Eigenvalues from the covariance matrix
eig_val_cov, eig_vec_cov = np.linalg.eig(cov_mat)
for i in range(len(eig_val_cov)):
eigvec_cov = eig_vec_cov[:,i].reshape(1, dimensionality).T
In [17]:
eig_val_cov
Out[17]:
In [18]:
eig_vec_cov
Out[18]:
The eigenvectors are the same, but the eigenvalues will different.
In [19]:
# Sorting the eigenvectors by decreasing eigenvalues
# Creating tuples as follows (eigenvalues, eigenvector)
eig_pairs = [(np.abs(eig_val_sc[i]), eig_vec_sc[:,i]) for i in range(len(eig_val_sc))]
eig_pairs
Out[19]:
In [20]:
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse = True)
eig_pairs
Out[20]:
In [21]:
# Choosing two eigenvectors with the largest eigenvalues
k_number_of_dimensions = 2
top_k_pairs = eig_pairs[:k_number_of_dimensions]
matrix_w = np.hstack((top_k_pairs[0][1].reshape(dimensionality,1), top_k_pairs[1][1].reshape(dimensionality,1)))
In [22]:
# Transforming the samples onto the new subspace (2 - dimensional)
transformed = matrix_w.T.dot(all_samples)
In [23]:
plt.figure(figsize = (12, 12))
plt.plot(transformed[0, 0:samples_per_class], transformed[1,0:samples_per_class],
'o',
markersize = 7,
color = 'blue',
alpha = 0.5,
label = 'class1')
plt.plot(transformed[0, samples_per_class:], transformed[1,samples_per_class :],
'^',
markersize = 7,
color = 'red',
alpha = 0.5,
label = 'class2')
plt.xlim([-4, 4])
plt.ylim([-4, 4])
plt.xlabel('x_values')
plt.ylabel('y_values')
plt.legend()
plt.title('Transformed samples with class labels')
plt.show()
In [24]:
import pandas as pd
import numpy as np
from sklearn.utils import shuffle
In [25]:
def getKaggleMNIST():
# Column 0 is labels
# Column 1-785 is data, with values 0 .. 255
train = pd.read_csv('./data/Section 2/train.csv').as_matrix().astype(np.float32)
train = shuffle(train)
Xtrain = train[:-1000, 1:] / 255
Ytrain = train[:-1000, 0].astype(np.int32)
Xtest = train[-1000:, 1:] / 255
Ytest = train[-1000:, 0].astype(np.int32)
return Xtrain, Ytrain, Xtest, Ytest
In [26]:
Xtrain, Ytrain, Xtest, Ytest = getKaggleMNIST()
In [27]:
Xtrain.shape
Out[27]:
In [28]:
from sklearn.decomposition import PCA
In [29]:
pca = PCA()
reduced = pca.fit_transform(Xtrain)
In [30]:
reduced.shape
Out[30]:
In [31]:
# Variance of each column
pd.DataFrame(data=reduced).var()
Out[31]:
In [32]:
import matplotlib.pyplot as plt
plt.figure(figsize = (12, 12))
plt.scatter(reduced[:, 0],
reduced[:, 1],
s = 100,
c = Ytrain,
alpha = 0.5)
plt.show()
In [33]:
plt.figure(figsize = (12, 12))
plt.plot(pca.explained_variance_ratio_)
plt.show()
In [34]:
# In order to not to lose too much information, the number of dimensions the data is reduced to should be chosen in such way that
# the cummulative information is above 95%.
# In this scenario it
cumulative = []
last = 0
for v in pca.explained_variance_ratio_:
cumulative.append(last + v)
last = cumulative[-1]
plt.figure(figsize = (12, 12))
plt.plot(cumulative)
plt.show()