PCA involves following broad level steps –
1. Standardize the d-dimensional dataset.
2. Construct the covariance matrix.
3. Decompose the covariance matrix into its eigenvectors and eigenvalues.
4. Select k eigenvectors that correspond to the k largest eigenvalues, where k is the dimensionality of the new feature subspace (k≤d)
5. Construct a projection matrix W from the "top" k eigenvectors.
6. Transform the d-dimensional input dataset x using the projection matrix W to obtain the new k-dimensional feature subspace
In [37]:
# Import the modules
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import zscore
In [7]:
# Read the dataset
dataset = pd.read_csv("Datasets/wine.data", header=None)
In [8]:
# Descriptive analytics
print("Shape of the dataset: ", dataset.shape)
In [10]:
# Displaying the top 5 rows of the dataset
dataset.head(5)
Out[10]:
In [17]:
# Check for null values
dataset.isnull().values.sum()
Out[17]:
1st attribute is class identifier (1-3). Other attributes are below:
So we will consider 13 attributes for PCA.
In [18]:
# Excluding first attribute
X = dataset.iloc[:, 1:].values
In [22]:
# Standardize the dataset
sc_X = StandardScaler()
X_std = sc_X.fit_transform(X)
In [25]:
# Display the standardized dataset
X_std[:3, :]
Out[25]:
In [33]:
cov_matrix = np.cov(X_std.transpose())
In [49]:
# Print the covariance matrix
plt.figure(figsize=(15, 15))
sns.heatmap(cov_matrix, annot=True, cmap="Greens")
Out[49]:
In [51]:
# Pair plot for this dataset
sns.pairplot(pd.DataFrame(X_std))
Out[51]:
In [52]:
# Converting to eigen values and eigen vectors
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
In [68]:
0.04166172048200001 ** 0.5
Out[68]:
In [75]:
# Display the eigen Vectors
print("Eigen Vectors:")
pd.DataFrame(eig_vecs)
Out[75]:
In [74]:
# Display the eigen values
print("Eigen Values:")
pd.DataFrame(eig_vals).transpose()
Out[74]:
In [76]:
eig_vecs_selected = eig_vecs[:7, :7]
In [78]:
# Display the eigen Vectors
print("First 7 Eigen Vectors:")
pd.DataFrame(eig_vecs_selected)
Out[78]: