PCA is an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. This is an Unsupervised Learning Techniques - Which means we don't have a target variable.
In [1]:
import numpy as np
import pandas as pd
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (10, 6)
In [3]:
np.random.seed(123)
In [4]:
a = np.arange(12, 56, 0.5)
e = np.random.normal(0, 100, a.size)
b = 500 + 20*a + e
In [5]:
X = np.c_[a,b]
In [6]:
def plot2var (m, xlabel, ylabel):
x = m[:,0]
y = m[:,1]
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(x, y, s = 40, alpha = 0.8)
sns.rugplot(x, color="m", ax=ax)
sns.rugplot(y, color="m", vertical=True, ax=ax)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
In [7]:
plot2var(X, 'a', 'b')
In [8]:
X_mean = np.mean(X, axis=0)
In [9]:
X_mean
Out[9]:
In [10]:
X_sd = np.std(X, axis=0)
In [11]:
X_sd
Out[11]:
In [12]:
X_std = np.subtract(X, X_mean) / X_sd
In [13]:
def plot2var_std (m, xlabel, ylabel):
x = m[:,0]
y = m[:,1]
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(x, y, s = 40, alpha = 0.8)
sns.rugplot(x, color="m", ax=ax)
sns.rugplot(y, color="m", vertical=True, ax=ax)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.xlim([-3,3])
plt.ylim([-3,3])
In [14]:
plot2var_std(X_std, "a", "b")
In [15]:
cov_mat_2var = np.cov(X_std.T)
In [16]:
cov_mat_2var
Out[16]:
So now this is the symetric $A$ matrix we are trying to solve
$$ Ax = \lambda x $$where $$ A = \begin{bmatrix} 1.01 & -0.92 \\ -0.92 & 1.01 \end{bmatrix} $$
In [17]:
eigen_val_2var, eigen_vec_2var = np.linalg.eig(cov_mat_2var)
In [18]:
eigen_val_2var
Out[18]:
In [19]:
eigen_vec_2var
Out[19]:
In [20]:
eigen_vec_2var[1].dot(eigen_vec_2var[0])
Out[20]:
So our eigen vectors and eigen values are: $$ \lambda_1 = 1.93, \lambda_2 = 0.09 $$
$$ \vec{v_1} = \begin{bmatrix} 0.707 \\ -0.707\end{bmatrix} $$$$ \vec{v_2} = \begin{bmatrix} 0.707 \\ 0.707\end{bmatrix} $$These are orthogonal to each other. Let us plots to see these eigen vectors
In [21]:
def plot2var_eigen (m, xlabel, ylabel):
x = m[:,0]
y = m[:,1]
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(x, y, s = 40, alpha = 0.8)
sns.rugplot(x, color="m", ax=ax)
sns.rugplot(y, color="m", vertical=True, ax=ax)
cov_mat = np.cov(m.T)
eigen_val, eigen_vec = np.linalg.eig(cov_mat)
plt.quiver(eigen_vec[0, 0], eigen_vec[0, 1], angles='xy', scale_units='xy', scale=1, color='brown')
plt.quiver(eigen_vec[1, 0], eigen_vec[1, 1], angles='xy', scale_units='xy', scale=1, color='brown')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.xlim(-3,3)
plt.ylim(-3,3)
In [22]:
plot2var_eigen(X_std, 'a' ,'b')
In [23]:
eigen_vec_2var
Out[23]:
In [24]:
X_std.T.shape
Out[24]:
In [25]:
eigen_vec_2var.shape
Out[25]:
In [26]:
X_proj = eigen_vec_2var.dot(X_std.T)
In [27]:
plot2var_eigen(X_proj.T, 'pca1' ,'pca2')
In [28]:
from sklearn.decomposition import PCA
In [29]:
pca = PCA(n_components=2)
In [30]:
pca.fit(X_std)
Out[30]:
In [31]:
X_pca_proj = pca.transform(X_std)
In [32]:
plot2var_std(X_pca_proj, 'pca1', 'pca2')
In [33]:
pca.explained_variance_
Out[33]:
In [34]:
pop = pd.read_csv('data/cars_small.csv')
In [35]:
pop.head()
Out[35]:
In [36]:
pop = pop.drop(['model'], axis = 1)
In [37]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df = pop.apply(le.fit_transform)
In [38]:
df.head()
Out[38]:
In [39]:
g = sns.PairGrid(df, hue = 'type')
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter, alpha = 0.8)
Out[39]:
In [40]:
X = df.iloc[:,:4]
In [41]:
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)
In [42]:
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('Covariance matrix \n%s' %cov_mat)
In [43]:
# Doing this directly using np.cov
print('NumPy covariance matrix: \n%s' %np.cov(X_std.T))
In [44]:
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
In [45]:
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
How do you select which 2 axis to choose?
In order to decide which eigenvector(s) can dropped without losing too much information for the construction of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can be dropped.
In [46]:
# Make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
In [47]:
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse=True)
In [48]:
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
print(i[0])
In [49]:
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
In [50]:
plt.bar(range(4), var_exp, alpha=0.5, align='center',
label='individual explained variance')
plt.step(range(4), cum_var_exp, where='mid',
label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()
The “projection matrix” is just a matrix of our concatenated top k eigenvectors. Here, we are reducing the 4-dimensional feature space to a 2-dimensional feature subspace, by choosing the “top 2” eigenvectors with the highest eigenvalues to construct our $n×k$-dimensional eigenvector matrix $W$.
In [51]:
matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),
eig_pairs[1][1].reshape(4,1)))
print('Matrix W:\n', matrix_w)
In [52]:
X_proj = X_std.dot(matrix_w)
In [53]:
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(X_proj[:,0], X_proj[:,1], c = df.type, s = 100, cmap = plt.cm.viridis)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
Out[53]:
In [54]:
from sklearn.decomposition import PCA
In [55]:
pca = PCA(n_components=2)
In [56]:
pca.fit(X_std)
Out[56]:
In [57]:
X_proj_sklearn = pca.transform(X_std)
In [58]:
fig, ax = plt.subplots(figsize=(6, 6))
plt.scatter(X_proj_sklearn[:,0], X_proj_sklearn[:,1], c = df.type,
s = 100, cmap = plt.cm.viridis)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
Out[58]:
In [59]:
pca.explained_variance_
Out[59]:
In [60]:
digits = pd.read_csv('data/digits.csv')
In [61]:
digits.head()
Out[61]:
In [62]:
digits.shape
Out[62]:
In [63]:
digitsX = digits.iloc[:,1:785]
In [64]:
digitsX.head()
Out[64]:
In [65]:
pca = PCA(n_components=2)
In [66]:
pca.fit(digitsX)
Out[66]:
In [67]:
digits_trans = pca.transform(digitsX)
In [68]:
digits_trans
Out[68]:
In [69]:
plt.scatter(digits_trans[:,0], digits_trans[:,1], c = digits.num,
s = 20, alpha = 0.8, cmap = plt.cm.viridis)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
Out[69]: