Principal Component Analysis, also known as the Karhunen-Loeve Transform, is a technique used to search for patterns in high-dimensional data. PCA is commonly used to explore and visualize high-dimensional data sets. It can also be used to compress data, and process data before it is used by another estimator. PCA reduces a set of possibly-correlated, high-dimensional variables to a lower-dimensional set of linearly uncorrelated synthetic variables called principal components. The lower-dimensional data will preserve as much of the variance of the original data as possible.
In general, an n-dimensional dataset can be reduced by projecting the dataset onto a k-dimensional subspace, where k is less than n. More formally, PCA can be used to find a set of vectors that span a subspace, which minimizes the sum of the squared errors of the projected data. This projection will retain the greatest proportion of the original data set's variance.
PCA rotates the data set to align with its principal components to maximize the variance contained within the first several principal components.
PCA is most useful when the variance in a data set is distributed unevenly across the dimensions. Consider a three-dimensional data set with a spherical convex hull. PCA cannot be used effectively with this data set because there is equal variance in each dimension; none of the dimensions can be discarded without losing a significant amount of information. It is easy to visually identify the principal components of data sets with only two or three dimensions.
variance is a measure of how a set of values are spread out. Variance is calculated as the average of the squared differences of the values and mean of the values - $$ var(s^2) = \frac{\sum_{i=1}^{n}(x_i - \bar x)^2}{n-1} $$
Covariance is a measure of how much two variables change together; it is a measure
of the strength of the correlation between two sets of variables. If the covariance of two
variables is zero, the variables are uncorrelated.
....................Note that uncorrelated variables are
not necessarily independent, as correlation is only a measure of linear dependence. The
covariance of two variables is calculated using the following equation:
$$ coVar(X,y) = \frac{\sum_{i=1}^{n}(x_i - \bar x)(y_i - \bar y)}{n-1} $$
A covariance matrix describes the covariance values between each pair of dimensions in a data set. The element (i, j) indicates the covariance of the ith and jth dimensions of the data.
In [34]:
# import
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
%matplotlib inline
In [8]:
X = [[2, 0, -1.4],
[2.2, 0.2, -1.5],
[2.4, 0.1, -1],
[1.9, 0, -1.2]]
print(np.array(X))
print(np.array(X).T)
print(np.cov(np.array(X).T))
print(np.cov(np.array(X)))
In [11]:
w, v = np.linalg.eig(np.array([[1, -2], [2, -3]]))
print("eigen vector : ", w)
print("eigen values : ",v)
In [12]:
x1 = [ 0.9, 2.4, 1.2, 0.5, 0.3, 1.8, 0.5, 0.3, 2.5, 1.3 ]
x2 = [ 1, 2.6, 1.7, 0.7, 0.7, 1.4, 0.6, 0.6, 2.6, 1.1 ]
In [13]:
# deducting mean
x11 = x1 - np.mean(x1)
x21 = x2 - np.mean(x2)
principal components are the eigenvectors of the data's covariance matrix ordered by their eigenvalues.
Way1: The first technique requires calculating the covariance matrix of the data. Since the covariance matrix will be square, we can calculate the eigenvectors and eigenvalues
Way2: The second technique uses singular value decomposition of the data matrix to find the eigenvectors and square roots of the eigenvalues of the covariance matrix.
In [15]:
cov = np.cov(x11, x21)
print(cov)
In [16]:
w, v = np.linalg.eig(cov)
print("eigen vector : ", w)
print("eigen values : ",v)
The first eigenvector has the greatest eigenvalue and is the first principal component. We will build a transformation matrix in which each column of the matrix is the eigenvector for a principal component. If we were reducing a five-dimensional data set to three dimensions, we would build a matrix with three columns
In [27]:
A = np.array([x11, x21]).T
l = np.array([[0.73251454],
[0.68075138]])
np.dot(A, l)
Out[27]:
Many implementations of PCA, including the one of scikit-learn, use singular value decomposition to calculate the eigenvectors and eigenvalues. SVD is given by the following equation:
$$ X = U \sum V^T $$The columns of U are called left singular vectors of the data matrix, the columns of V are its right singular vectors, and the diagonal entries of are its singular values.
In [36]:
data = load_iris()
y = data.target
X = data.data
pca = PCA(n_components=2)
reduced_X = pca.fit_transform(X)
red_x, red_y = [], []
blue_x, blue_y = [], []
green_x, green_y = [], []
for i in range(len(reduced_X)):
if y[i] == 0:
red_x.append(reduced_X[i][0])
red_y.append(reduced_X[i][1])
elif y[i] == 1:
blue_x.append(reduced_X[i][0])
blue_y.append(reduced_X[i][1])
else:
green_x.append(reduced_X[i][0])
green_y.append(reduced_X[i][1])
plt.figure(figsize=(12,9))
plt.scatter(red_x, red_y, c='r', marker='x')
plt.scatter(blue_x, blue_y, c='b', marker='D')
plt.scatter(green_x, green_y, c='g', marker='.')
plt.show()
In [ ]: