TL;DR This notebook provides an overview of Principal Component Analysis and its application.
In [3]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from pprint import pprint
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
import ipywidgets as widgets
from scipy.optimize import fmin
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.figsize'] = (16, 8)
Principal Component Analysis is fuundamentally a mechanism to reduce the dimensionality of large datasets will minimising loss of information. There are a number of applications of PCA by extension - classification / noise filtration / visualisation and more.
To build an intuition for how / why PCA works, we're going to use the IRIS dataset, which comprises a collection of measurements of petal and sepal widths and lengths along with which category each measured plant belongs to.
There are many excellent tutorials on applying PCA to the IRIS dataset an unsupervised classification model; we're going to instead use the data to try to build some intuition about how and why PCA works.
Let's take a look at the data.
In [4]:
data_set = load_iris()
data = data_set.data
target = data_set.target
In [5]:
df = pd.DataFrame(np.array(data), columns=data_set.feature_names)
df['species'] = data_set.target_names[target]
df.head(10)
Out[5]:
The first step we're going to take is to pre-process the data by making it mean-centred. We'll come back to why this is necessary (and it is) but for now, let's look at how to achieve it and verify that doesn't affect the variance of our dataset in any way.
In [6]:
def demean(series):
return series - series.mean()
demeaned_df = df[data_set.feature_names].apply(demean)
demeaned_df.head()
Out[6]:
In [7]:
df.var()
Out[7]:
In [8]:
demeaned_df.var()
Out[8]:
It's much easier to build an intuition for PCA when working with 2 dimensions. So we'll extract the petal measurements from the mean-centred data and plot one against the other.
In [9]:
axes = plt.gca()
axes.set_ylim([-4, 4])
axes.set_xlim([-4, 4])
plt.gca().set_aspect('equal', adjustable='box')
p_x = demeaned_df['petal length (cm)']
p_y = demeaned_df['petal width (cm)']
plt.scatter(p_x, p_y, alpha = 0.4, s=50)
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
Out[9]:
There would appear to be an approximately linear relationship between petal length and width, which is intuitively reasonable.
In the plot below, we additionally draw perpendicular lines (in green) from each data point back to the hyperplane.
In [10]:
def plot_line(angle_in_degrees):
# original data
plt.scatter(p_x, p_y, alpha = 0.4, s=50)
# our current fitted line
m = np.tan(np.pi * angle_in_degrees / 360)
x = np.linspace(-4, 4, 3)
y = m * x
plt.plot(x, y, 'r--')
# perpendicular lines between the original data and the
# current fitted line
p_x_line = (p_x + m * p_y) / (m*m + 1)
p_y_line = m * p_x_line
for idx in range(len(p_x)):
plt.plot([p_x[idx], p_x_line[idx]], [p_y[idx], p_y_line[idx]], color='g', alpha=0.1)
# average sq distance from origin of perp line intercepts
# i.e. the points where the green line touches the dashed red line
var = np.mean(np.power(p_x_line, 2) + np.power(p_y_line, 2))
plt.gca().set_aspect('equal', adjustable='box')
axes = plt.gca()
axes.set_ylim([-4, 4])
axes.set_xlim([-4, 4])
plt.title('Variance {0:.4f}'.format(var))
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.show()
In [11]:
plot_line(85) #static plot for arbitrarty slope angle
We introduced a quantity called variance:
var = np.mean(np.power(p_x_line, 2) + np.power(p_y_line, 2))
If we define variance in the general sense for a discrete dataset as:
Noting that $\mu$ is zero for our de-meaned data set, and that - by Pythogoras - our $x_i$ values are the hypotenuse lengths of triangles with sides p_x_line and p_y_line, we have:
We could try to fit a stright line through the data as a means of generalising the petal width / length relationship. There are clearly inifinitely many solutions, but certain solutions have interesting properties.
Try changing the slope of the line in the interactive plot below. As you change the angle of the line:
- Make a note of the plot title (variance)
- Take a look at the green lines
In [12]:
widgets.interact(plot_line,
angle_in_degrees=widgets.FloatSlider(min=0, max=360, step=1, value=85))
Out[12]:
As you vary the slope of the line, you should find that maximal variance is found at about 45 degrees.
Minimal variance is around 225 degrees - i.e. a line which is orthogonal to the line of maximum variance.
The values were about 3.63 and 0.036 respectively.
Fast-forwarding a little, these are the 'explained variances' which a fitted PCA model returns.
petal_data = demeaned_df[['petal length (cm)', 'petal width (cm)']].values pca = PCA().fit(petal_data) pca.explained_variance_ array([ 3.63497866, 0.03597779])
Let us programmatically vary the slope of the line and build a plot explained variance as a funtion of angle.
In [13]:
def get_variance(angle_in_degrees):
x = p_x
y = p_y
# our current fitted line
m = np.tan(np.pi * angle_in_degrees / 360)
y = m * x
# perpendicular lines between the original data and the
# current fitted line
p_x_line = (p_x + m * p_y) / (m*m + 1)
p_y_line = m * p_x_line
# average sq distance from origin of perp line intercepts
# i.e. the points where the green line touches the dashed red line
var = np.mean(np.power(p_x_line, 2) + np.power(p_y_line, 2))
return var
In [14]:
df = pd.DataFrame({'angle': range(361)})
df['variance'] = df.angle.apply(get_variance)
df = df.set_index('angle')
df.plot()
plt.xlabel('angle (degrees))')
plt.ylabel('variance')
Out[14]:
We can use a solver to find the maxima and minima, which should correspond with our previous findings.
In [15]:
angle = fmin(lambda a: -1 * get_variance(a), 50)
var = get_variance(angle)
print('\nVariance: {0:.5f} obtained at angle: {1:.3f} degrees'.format(var, angle[0]))
In [16]:
angle = fmin(get_variance, 200)
var = get_variance(angle)
print('\nVariance: {0:.5f} obtained at angle: {1:.3f} degrees'.format(var, angle[0]))
In some ways, PCA provides us with an analytic mechanism for doing exactly what we did above.
The above procedure is perfectly valid and tractible for problems with 2 dimensions and small amounts of data. But there are a number of analytic solutions to the problem which scale well and the above is intended just for building intuition.
What we've discovered so far is that (for our petal dataset) there exists exactly one axis which, when data points are projected onto it, exhibits maximal variance. This is in fact our first Principal Component.
So we need an analytic approach to decompose the covariance of our data points and recover the principal axes.
The elements of a covariance matrix are given by:
In matrix notation:
As we've already de-meaned our data, our covariance matrix is given by:
In [17]:
petal_data = demeaned_df[['petal length (cm)', 'petal width (cm)']].values
n = len(petal_data)
cov = 1 / (n - 1) * petal_data.T @ petal_data
cov
Out[17]:
We can obtain this using numpy directly:
In [18]:
cov = np.cov(petal_data.T)
cov
Out[18]:
TODO : add stuff about maximising variance in matrix form
The eigenvalues and corresponding vectors (organised in ascending eigenvalue order):
In [19]:
eigenvalues, eigenvectors = np.linalg.eigh(cov)
In [20]:
eigenvalues
Out[20]:
In [21]:
eigenvectors
Out[21]:
The eigenvalues look very close to the variance minimum and maximum we found earlier. In fact, they're very closely related - the returned eigenvalues are just scaled differently.
Recall that we previously wrote down:
petal_data = demeaned_df[['petal length (cm)', 'petal width (cm)']].values pca = PCA().fit(petal_data) pca.explained_variance_ array([ 3.63497866, 0.03597779])
In [22]:
n # number of data points
factor = (n - 1) /n
(factor * eigenvalues)[::-1] # apply factor and flip the order
Out[22]:
So what can we make of the eigenvectors?
The eigenvector corresponding to the largest eigenvalue is:
In [23]:
eigenvectors[:, -1]
Out[23]:
If we plot this over out original data, we can visualise this as the first principal component - i.e. the axis which explains maximal variance.
In [24]:
plt.scatter(p_x, p_y, alpha=0.4)
# slope
m = eigenvectors[:, -1][1]/eigenvectors[:, -1][0]
e_x = np.linspace(-4, 4, 3)
e_y = m * e_x
plt.plot(e_x, e_y, 'r--')
plt.gca().set_aspect('equal', adjustable='box')
axes = plt.gca()
axes.set_ylim([-4, 4])
axes.set_xlim([-4, 4])
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
Out[24]:
We can check the angle implied by the first Principal Component against the value we solved for previously.
In [25]:
angle = np.arctan(eigenvectors[:, -1][1]/eigenvectors[:, -1][0])*360/np.pi
print('Angle implied by first eigenvector: {0:.3f} degrees'.format(angle))
We can trivially add the second eigenvector, which is orthogonal to the first and in fact the only other Principal Component that our two dimensional data has.
This gives us a new coordinate system whereby the axes are orthogonal to eath other and the variance of the data is maximal on the first axis.
In [26]:
plt.scatter(p_x, p_y, alpha=0.4)
# slope
m1 = eigenvectors[:, -1][1]/eigenvectors[:, -1][0]
m2 = eigenvectors[:, 0][1]/eigenvectors[:, 0][0]
e_x1 = np.linspace(-3, 3, 3)
e_y1 = m1 * e_x1
e_x2 = np.linspace(-0.3, 0.3, 3)
e_y2 = m2 * e_x2
plt.plot(e_x1, e_y1, 'r--')
plt.plot(e_x2, e_y2, 'r--')
plt.gca().set_aspect('equal', adjustable='box')
axes = plt.gca()
axes.set_ylim([-4, 4])
axes.set_xlim([-4, 4])
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
Out[26]:
We can use the eigenvectors to transform our original data into our new coordinate space:
In [27]:
transformed_data = petal_data @ eigenvectors
df_trans = pd.DataFrame(transformed_data, columns=['pc2', 'pc1'])
df_trans.head()
Out[27]:
These new features are in fact just linear combinations of our original features.
We can show this as follows. Recall our original data (demeaned):
In [28]:
petal_df = demeaned_df[['petal length (cm)', 'petal width (cm)']].copy()
petal_df.head()
Out[28]:
The eigenvector corresponding to the largest eigenvalue was:
In [29]:
eigenvectors[:, -1]
Out[29]:
So instead of recording petal width and length, suppose we had recorded a quantity: (-0.9215469 multiplied by length) + (-0.3882669 multiplied by width)
In [30]:
petal_df['new_qty'] = -0.92154695 * petal_df['petal length (cm)'] - 0.38826694 * petal_df['petal width (cm)']
petal_df.head()
Out[30]:
As follows, we can prove that pc1 data exactly tallies with new_qty
In [31]:
np.allclose(df_trans.pc1, petal_df.new_qty)
Out[31]:
The 'new_qty' is often called a 'score' and it would be normal to call the transformed values 'scores' - i.e. the values which each data point corresponds to in the new Principal Component space.
So what this means is that if we'd recorded the synthetic quantity (-0.9215469 multiplied by length) + (-0.3882669 multiplied by width), then we'd have one collection of data points which almost completely represents the information / variance of the original data which comprised two features (length and width). These values would be the PC1 scores.
So what fraction of total variance would we retain?
The answer is given by the scaled eigenvalues.
In [32]:
scaled_eigenvalues = eigenvalues * (n - 1) / n
scaled_eigenvalues
Out[32]:
In [33]:
scaled_eigenvalues / sum(scaled_eigenvalues)
Out[33]:
This means that using PC1 alone explains 99% of the variance of our original data.
One other fact to note is that the transformed data for PC1 and PC2 are uncorrelated (as a consequence of the orthoginal nature of the axes). This should feel intuitively reasonable as moving along one axis does not impact the value on the other.
In [34]:
np.around(np.corrcoef(transformed_data.T), 3)
Out[34]:
So let's revisit sklearn PCA and see how we'd use it to recover the above results.
In [35]:
petal_data = demeaned_df[['petal length (cm)', 'petal width (cm)']].values
pca = PCA().fit(petal_data)
In [36]:
pca.explained_variance_
Out[36]:
In [37]:
pca.explained_variance_ratio_
Out[37]:
In [38]:
pca.components_
Out[38]:
Note that the transformed values have a flipped sign compared to the results we manually derived above. It doesn't really have any statistical significance and doesn't affect variance. It would be trivial to add a conditioning step to determine a sign which matches sklearn.
In [39]:
pd.DataFrame(pca.transform(petal_data), columns=['pc1', 'pc2']).head()
Out[39]:
In [40]:
pca.get_covariance() * n / (n - 1) # rescaled
Out[40]:
The power of the sklearn model is that we can very simply reduce down to our desired number of dimesions.
In [41]:
petal_data = demeaned_df[['petal length (cm)', 'petal width (cm)']].values
pca = PCA(n_components=1)
one_dimensional = pd.DataFrame(pca.fit_transform(petal_data), columns=['pc1'])
one_dimensional.head()
Out[41]:
The following plot shows the data points transformed into PC1 and then mapped back into the original coordinate system.
Recalling the interactive chart above, the green dots repesent the projection of each blue data point onto the PC1 best fit line. The difference between the green and blue dots gives an indication of the amount of information / variance which is lost by reducing to one dimension.
In [42]:
trans_data = pca.inverse_transform(one_dimensional.values)
x = trans_data[:, 0]
y = trans_data[:, 1]
plt.scatter(p_x, p_y, alpha=0.4)
plt.scatter(x, y, alpha=0.4)
plt.gca().set_aspect('equal', adjustable='box')
axes = plt.gca()
axes.set_ylim([-4, 4])
axes.set_xlim([-4, 4])
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
Out[42]:
PCA is a variance explanation technique. What would happen if we added a feature which had zero variance? Let's say we added a feature called 'animal, vegetable, mineral' which we one-hot encode into three columns: [animal, vegetable, mineral].
In [43]:
petal_df = demeaned_df[['petal length (cm)', 'petal width (cm)']].copy()
petal_df.head()
Out[43]:
In [44]:
petal_df['animal'] = 0
petal_df['vegetable'] = 1
petal_df['mineral'] = 0
petal_df.head()
Out[44]:
In [45]:
pca = PCA().fit(petal_df.values)
In [46]:
pca.explained_variance_ratio_
Out[46]:
In [47]:
pca.explained_variance_
Out[47]:
In [48]:
pca.components_
Out[48]:
In [49]:
pd.DataFrame(pca.transform(petal_df.values), columns=['pc1', 'pc2', 'pc3', 'pc4', 'pc5']).head()
Out[49]:
As you might expect, the features which have no variance are not useful in explaining the variance of the dataset, so PC1 and PC2 are unchanged.
What would happen if the dimensions we'd recorded had different scales? So let's say we recorded petal width in meters and petal length in milimeters.
In [50]:
petal_data = demeaned_df[['petal length (cm)', 'petal width (cm)']].copy()
In [51]:
petal_data.head()
Out[51]:
In [52]:
petal_data['petal length (mm)'] = petal_data['petal length (cm)'] * 10
petal_data['petal width (m)'] = petal_data['petal width (cm)'] /100
del petal_data['petal length (cm)']
del petal_data['petal width (cm)']
In [53]:
petal_data.head()
Out[53]:
In [54]:
pca = PCA().fit(petal_data)
In [55]:
pca.explained_variance_
Out[55]:
In [56]:
pca.explained_variance_ratio_
Out[56]:
So, perhaps unsurprisingly as PCA 'works' by explaining the variance in the data, the enormously different scales of the inputs means that one feature dominates the other. This is perhaps something to bear in mind when working with cross-sectional data where features use very different scales.
So what can we do about it? One option is to z-score.
In [57]:
def zscore(series):
return (series - series.mean()) / series.std()
petal_data_std = petal_data.apply(zscore)
petal_data_std.columns = ['petal length', 'petal width']
petal_data_std.head()
Out[57]:
In [58]:
pca = PCA().fit(petal_data_std)
In [59]:
pca.explained_variance_
Out[59]:
In [60]:
pca.explained_variance_ratio_
Out[60]:
There are many ways of normalising data; z-scoring is just one. So should features always be scaled before fitting a PCA model? That's a matter of some debate; a valid counter argument is that it can artificially 'inflate' the contribution of an otherwise relatively unimportant feature. In any event, it makes sense to be explicit about what preconditioning (if any) you've decided to use and why.
A corrollary of z-scoring is that it makes the covariance matrix and correlation matrix equal.
In [61]:
np.cov(petal_data_std.T)
Out[61]:
In [62]:
np.corrcoef(petal_data_std.T)
Out[62]:
This should come as no great suprise as the act of z-scoring is to rescale by feature standard deviation and by definition:
We've hithero chosen to decompose the data's covariance matrix but it may be valid to instead decompose the correlation matrix (e.g. where data scaling is a significant factor). In the event that input features are preconditioned using z-scoring then it makes no difference.
What happens if we introduce a feature which is perfectly correlated with some other feature?
In [63]:
petal_df = petal_df[['petal length (cm)', 'petal width (cm)']].copy()
petal_df.head()
Out[63]:
In [64]:
petal_df['length_times_factor'] = petal_df['petal length (cm)'] * 0.8
petal_df.head()
Out[64]:
In [65]:
petal_df.corr()
Out[65]:
In [66]:
pca = PCA().fit(petal_df.values)
In [67]:
pca.explained_variance_ratio_
Out[67]:
In [68]:
pca.explained_variance_
Out[68]:
In [69]:
pca.components_
Out[69]:
In [70]:
df = pd.DataFrame(pca.transform(petal_df.values), columns=['pc1', 'pc2', 'pc3']).head()
df
Out[70]:
So what's happening here is that we end up with a third principal component which is not useful at all in explaining variance.
Indeed, the linear combination of features is zero (i.e. all the scores are zero). Here's ehat happens when we apply the factors to the first data point.
In [71]:
pca.components_[-1]
Out[71]:
In [72]:
sum(pca.components_[-1] * petal_df.values[0])
Out[72]:
And note that if we sum pc3, it's approximately zero.
In [73]:
df.pc3.sum()
Out[73]: