Data Souces can have many dimensions. To get a sense of the relative variances, Principal Component Analysis (PCA) can be effective. PCA is an orthogonal transformation that converts a set of samples into a different set of dimensions that are not correlated (i.e. orthogonal). The dimensions are determined by greatest variance and are orthogonal. PCA is useful when the dimensions of data are not well suited but there is an apparent relationship between them. For that reason, it can be helpful in reducing the number of dimensions.
The 'principal components' or components of Principal Component Analysis (PCA) the eigenvalues of the covariance matrix for a given dataset. The explained variance (eigenvectors) provide the weights for each feature (column) of that dataset, and relative variance of each component.
By the end of this file you should have seen simple examples of:
Further Reading:
Section 7.3 Principal Component Analysis (PCA by the SVD), Introduction to Linear Algebra, Fifth Edition, 2016.
In [1]:
import numpy as np
from matplotlib import pyplot as plt
rand_seed = 1 # set the random number generator so results are repeatable
In [2]:
# Generate some 2D data:
x = np.linspace(0,1,100)
y1 = 2.2*x
y2 = 5 - x
dat = np.vstack([y1,y2]).T
np.random.seed(rand_seed)
dat_n = dat + np.random.rand(*dat.shape)*(np.array([1,1])) # add noise
# plot as scatter
plt.scatter(dat_n[:, 0], dat_n[:, 1])
plt.title('Feature 1 vs Feature 2')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.axis('equal')
plt.show()
Next we find the covariance, of the amount each feature varies with another:
$$ cov(x,y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\overline{x})(y_i - \overline{y})$$We can do this for all features in a dataset, creating a n$x$n matrix representing how each feature co-varies with another:
$$ cov = \sum_{j=1}^{n}(x_j-\overline{x})(x_j - \overline{x})^T$$Where: $\overline{x}$ is the expectation of x. If each value is equally weighted, then:
$$ \overline{x} = \frac{1}{n}\sum_{j=1}^{n}x_j$$
Note that the data here is plotted in row format (list of rows) vs columnar format (list of columns). Our math is performed on a per sublist basis, so a transpose is necessary:
In [3]:
datc_n = dat_n.T
In [4]:
# The covariance matrix is the inner product of the mean-subtracted dataset with it's transpose:
datc_ms = (datc_n - datc_n.mean(axis=1)[:, np.newaxis]) # Mean subtract the data set:
dat_cov = datc_ms.dot(datc_ms.T)/(datc_n.shape[1]-1)
print('Covariance matrix is:\n{}'.format(dat_cov))
In [5]:
# Similarly, we could use a built-in numpy package:
dat_cov2 = np.cov(datc_n)
# Are these the same (check if they are within floating point error)?
np.isclose(dat_cov, dat_cov2).all()
Out[5]:
This will determine the direction (eigenvalues or explained variance) and set of vectors (eigenvectors or components) that satisfy the relation:
$$ M\vec{\nu} = \lambda \vec{\nu}$$Where in our case:
In words, eigenvector $\vec{\nu}$ of matrix $M$ is equal to the scalar eigenvalue $\lambda$ of $\vec{\nu}$
From the perspective of PCA, the eigenvectors are the directions in terms of the original features (components) and the eigenvalues are the magnitudes of each vector (explained variance).
Usually, software packages (i.e. numpy, matlab) determine eigenvectors such that the length of each is 1. While computation by hand doesn't necessarily always work out this way, the decomposition into unitary vectors makes a bit more sense: the eigenvalues are the length of the vectors, and the vectors provide direction.
The eigenvalues are the explained variance, or relative 'importance' of the pca component, and the eigenvectors are the components, or relative weights of the original features within each explained vector.
In [6]:
eig_vals, eig_vecs = np.linalg.eig(dat_cov)
print('Eigenvalues (explained variance):\n {}'.format(eig_vals))
print('Eigenvectors (components):\n {}'.format(eig_vecs))
Singular value decomposition (SVD) is a more general approach than the eigen decomposition and frequently used on the back end of software computing PCA.
SVD is another matrix factorization like eigen decomposition, but has all kinds of uses including the pseudoinverse, least squares fitting of data, and determining the rank, range, and null space of a matrix.
$$X = U \Sigma V^T $$where $$ U^TU = I $$ and $$ V^TV = I $$ where:
SVD can be used to determine the eigenvalues and eigenvectors:
In [7]:
dat_ms = dat_n - dat_n.mean(axis=0)
u,s,v = np.linalg.svd(dat_ms, full_matrices=False) # Note that we didn't use the row formatted data (sklearn svd_flip function expects columnar data)
# The explained variance is the square of the SVD singular matrix:
expl_var = (s ** 2) / (dat_n.shape[0] - 1)
print('Explained Variance:\n{}'.format(expl_var))
print('Components:\n {}'.format(v))
SVD results in three vectors, so the signs can be different and still yield the same result. To standardize, scikit-learn has a function so the output sign is predictable.
The negative sign doesn't change the magnitude, only the direction. scikit-learn has a function to specify both the order and signs of our eigenvectors (termed 'components' in PCA).
In [8]:
def skl_svd_flip(u,v):
# Set the signs deterministically - taken from the svd_flip function in sklearn:
# sklearn.utils.extmath.svd_flip
max_abs_cols = np.argmax(np.abs(u), axis=0)
signs = np.sign(u[max_abs_cols, range(u.shape[1])])
u *= signs
v *= signs[:, np.newaxis]
return u,v
# Fix the output sign:
u, components = skl_svd_flip(u, v)
print('Flipped eigenvectors are:\n {}'.format(components))
In [9]:
# Visualize explained variance:
tot = sum(expl_var)
var_exp = [(i / tot)*100 for i in sorted(expl_var, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
xlabels = ['{}'.format(x) for x in range(1, len(var_exp)+1)]
plt.bar(xlabels, var_exp, label='Individual Variance')
plt.plot(xlabels, cum_var_exp, label='Cumulative Variance', marker='o', color='red')
plt.ylabel('Percent Explained Variance')
plt.xlabel('Principal Component')
plt.title('Explained Variance')
plt.legend()
plt.show()
In [10]:
def draw_vector(head, tail, ax=None):
ax = plt.gca()
arrowprops=dict(arrowstyle='-|>',
linewidth=2,
shrinkA=0,
shrinkB=0)
ax.annotate('', tail, head, arrowprops=arrowprops)
# Plot data
plt.scatter(dat_n[:, 0], dat_n[:, 1], alpha=0.5)
for length, vector in zip(expl_var, components):
v = vector*1.5*np.sqrt(length)
draw_vector(dat_n.mean(axis=0), dat_n.mean(axis=0)+ v)
plt.title('Feature 1 vs Feature 2 With\nOrthogonal Vectors in Decreasing Variance')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.axis('equal');
In [11]:
# Use the mean subtracted values
dat_proj = (dat_n - dat_n.mean(axis=0)).dot(components.T) # Use the transpose of the components - numpy is row formatted
# plot as scatter
plt.scatter(dat_proj[:, 0], dat_proj[:, 1])
plt.title('PCA 1 vs PCA 2')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.axis('equal')
plt.show()
In [12]:
dat_n_recovered = dat_proj.dot(components) + dat_n.mean(axis=0)
plt.scatter(dat_n_recovered[:,0], dat_n_recovered[:,1])
plt.title('Feature 1 vs Feature 2')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.axis('equal')
plt.show()
In [13]:
# Did we recover our original dataset?
np.isclose(dat_n, dat_n_recovered).all()
Out[13]:
In [14]:
from sklearn.decomposition import PCA
# Perform PCA, obtain explained variance and basis components:
pca = PCA().fit(dat_n)
expl_var2 = pca.explained_variance_
components2 = pca.components_
# Project onto new basis:
dat_proj2 = pca.transform(dat_n)
# Apply data (usually transformed in some way) from the new axis back to the originial
dat_n_recovered2 = pca.inverse_transform(dat_proj2)
In [15]:
# Does the manual method match the built-in functions?
# np.isclose checks if they are within floating point error:
print(np.isclose(expl_var, expl_var2).all())
print(np.isclose(components, components2).all())
print(np.isclose(dat_proj, dat_proj2).all())
print(np.isclose(dat_n_recovered, dat_n_recovered2).all())
In [16]:
# Generate some 4D data:
x = np.linspace(0,1,100)
y1 = x*-3
y2 = 5/(1+x)
y3 = x*.003
y4 = x*0.0022
dat_4d = np.vstack([y1,y2,y3,y4]).T
np.random.seed(rand_seed)
dat_n_4d = dat_4d + np.random.rand(*dat_4d.shape)*(np.array([1,2,1,4])) # add noise
# Plot data:
plt.plot(dat_n_4d)
plt.legend(['hats','shoes','knickers','potatoes'])
plt.show()
Three of these components have nearly the same variance. While we can plot in 3d, it's easy to imagine a situation with 4 or more components each with high variance. How do we visualize them?
In [17]:
# Perform PCA, obtain explained variance and basis components:
from sklearn.decomposition import PCA
pca_4d = PCA().fit(dat_n_4d)
expl_var_4d = pca_4d.explained_variance_
components_4d = pca_4d.components_
In [18]:
# Visualize explained variance:
tot = sum(expl_var_4d)
var_exp_4d = [(i / tot)*100 for i in sorted(expl_var_4d, reverse=True)]
cum_var_exp_4d = np.cumsum(var_exp_4d)
xlabels_4d = ['{}'.format(x) for x in range(1, len(var_exp_4d)+1)]
plt.bar(xlabels_4d, var_exp_4d, label='Individual Variance')
plt.plot(xlabels_4d, cum_var_exp_4d, label='Cumulative Variance', marker='o', color='red')
plt.ylabel('Percent Explained Variance')
plt.xlabel('Principal Component')
plt.title('Explained Variance')
plt.legend()
plt.show()
Components 3 and 4 don't have much variance. Let's view the first two and look for any structure. This way, we can view the dataset in a reduced dimensional subspace (selected according to variance).
Each component is a 4 component vector (4-d data). Simply repeat the projection from above, using only the first two:
In [19]:
# Take the components of the first two vectors:
dat_n_4d_ms = dat_n_4d - dat_n_4d.mean(axis=0)
dat_proj2d = (dat_n_4d_ms).dot(components_4d[0:2,:].T) # First two vectors:
plt.scatter(dat_proj2d[:,0], dat_proj2d[:,1])
plt.title('PCA 1 vs PCA 2')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.axis('equal')
plt.show()
Similarly, we could use the built-in function to plot the 2D projection, but we must specify the number of components during the PCA to do so (and thus couldn't plot explained variance for all dimensions). Also, only the highest variace dimensions are returned:
In [20]:
dat_proj2d2 = PCA(2).fit(dat_n_4d).transform(dat_n_4d)
print(np.isclose(dat_proj2d, dat_proj2d2).all())
PCA can be peformed on covariance or correlation. Correlation is different because it is standardized between $ -1 < \text{corr} < 1 $.
Correlation is useful where units are difficult to compare, i.e. kg vs miles. When using covariance, different units can yield different results, but with correlation, they are normalized. The correlation is:
$$ corr(x,y) = \frac{1}{\sigma_x \sigma_y} \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\overline{x})(y_i - \overline{y})$$where: $$ \sigma_x = \text{standard deviation of variable x} $$
or just: $$ corr(x,y) = \frac{1}{\sigma_x \sigma_y} cov(x,y) $$
The correlation and covariance matrices are scaled by 1/product of the standard deviations: $$\frac{1}{\sigma_x \sigma_Y}$$
Using just the covariances along the matrix diagonal, (and taking the square root) we obtain the standard deviations. We can use these to create a new matrix of $\frac{1}{\sigma_x \sigma_Y}$ and multiply by the covariance matrix element-wise to obtain the correlation matrix:
In [35]:
# Use the square root of the covariances along the diagonal to obtain the standard deviations:
dat_cov = np.cov(dat_n_4d.T)
stdevs = np.sqrt(np.diag(dat_cov))
# Then multiply such that the 1/product of the standard deviations
dat_corr = 1/stdevs[:,np.newaxis]* 1/stdevs[np.newaxis,:] * dat_cov
print('The Correlation matrix is:\n{}'.format(dat_corr))
In [36]:
# Via built-in functions:
dat_corr2 = np.corrcoef(dat_n_4d.T)
np.isclose(dat_corr, dat_corr2).all()
Out[36]:
In [30]:
# Scaled/Standardized values:
dat_n_s4d = (dat_n_4d - dat_n_4d.mean(axis=0))/(dat_n_4d.std(axis=0))
# Use sklearn:
from sklearn.preprocessing import StandardScaler
dat_n_s4d22 = StandardScaler().fit_transform(dat_n_4d)
# Are they the same?
np.isclose(dat_n_s4d, dat_n_s4d22).all()
Out[30]:
The covariane of StandardScaled data is different from the correlation, which is scaled by the inverse product of the standard deviations. More information is lost; use with caution.
In [33]:
# The covariance of the standard scaled result is different than the correlation
dat_n_s_cov = np.cov(dat_n_s4d.T)
dat_n_corr = np.corrcoef(dat_n_4d.T)
np.isclose(dat_n_s_cov, dat_n_corr).all()
Out[33]: