PCA is a popular method for summarizing datasets. Suppose, we have a dataset of different wine types. We describe each wine sample by its Alcohol content, color, and so on (see this very nice visualization of wine properties taken from here). Some of these features will measure related properties and so will be redundant. So, we can summarize each wine sample with fewer features! PCA is one such way to do this. It's also called as a method for dimensionality reduction.
Here we have a scatter plot of different wine samples (synthetic). It's based on two wine characteristics, color intensity and alcohol content.
Now we will look at the projections -- The below animation shows how the projections of data points look like for different lines (red dots are projections of the blue dots):
PCA aims to find the best line according to the following two criteria.
The variation of (projected) values along the line should be maximal. Have look at how the "variance" of the red dots changes while the line rotates...
The line should give the lowest reconstruction error. By reconstruction, we mean that constructing the original two characteristics (the position ($x$, $y$) of a blue dot) from the new one (the position of a red dot). This reconstruction error is proportional to the length of the connecting red line.
We will notice that the maximum variance and the minimum error are happened at the same time, when the line points to the magenta ticks. This line corresponds to the first principal component constructed by PCA.
PCA objective: Given the data covariance matrix $C$, we look for a vector $u$ having unit length ($\|u\| = 1$) such that $u^TCu$ is maximal. We will see that we can do this with the help of eigenvectors and eigenvalues of the covariance matrix.
We will look at the intuition behind this approach using the example above.
Let $C$ be an $n \times n$ matrix and $u$ is an $n \times 1$ vector. The operation $C u$ is well-defined. An eigenvector of $C$ is, by definition, any vector $u$ such that $C u = \lambda u$. For the dataset $A$ ($n \times 2$ matrix) above, the covariance matrix C ($2 \times 2$ matrix) is (we assume that the data is centered.)
\begin{equation*} \begin{vmatrix} 1.07 & 0.63 \\ 0.63 & 0.64 \end{vmatrix} \end{equation*}It's a square symmetric matrix. Thus, one can diagonalize it by choosing a new orthogonal coordinate system, given by its eigenvectors (spectral theorem):
\begin{equation*} C = U \Lambda U^{T} \end{equation*}where $U$ is a matrix of eigenvectors $u_i$'s (each column is an eigenvector) and $\Lambda$ is a diagonal matrix with eigenvalues $\lambda_i$'s on the diagonal.
In the new (eigen) space, the covariance matrix is diagonal, as follows:
\begin{equation*} \begin{vmatrix} 1.52 & 0 \\ 0 & 0.18 \end{vmatrix} \end{equation*}It means that there is no correlation between points in this new system. The maximum possible variance is $1.52$, which is given by the first eigenvalue. We achieve this variance by taking the projection on the first principal axis. The direction of this axis is given by the first eigen vector of $C$.
This example/discussion is adapted from here.
For illustration, we will use the wine dataset. Each wine sample is described by 14 features as follows:
In [1]:
# We will first read the wine data headers
f = open("wine.data")
header = f.readlines()[0]
Let's first look at two wine characteristics: Alcohol Content and Color Intensity.
We can draw a scatter plot:
In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import linalg as la
# Read the data file (text format): wine.data, delimiter=',', use columns 0, 1, 10, skip the header
wine_class, wine_alc, wine_col = np.loadtxt("wine.data", delimiter=',', usecols=(0, 1, 10), unpack=True, skiprows=1)
In [3]:
# draw a scatter plot of wine color intensity and alcohol content
In [4]:
# Perform PCA on two wine characteristics: **Alcohol Content** and **Color Intensity**
col_alc = np.matrix([wine_col, wine_alc]).T
m, n = col_alc.shape
# compute column means
# center the data with column means
# calculate the covariance matrix
# calculate eigenvectors & eigenvalues of the covariance matrix
# sort eigenvalues and eigenvectors in decreasing order
Let's visualize the normalized data and its principal components.
In [5]:
# Create a scatter plot of the normalized data
# color intensity of the x-axis and alcohol content on the y-axis
# Plot the principal component line
Let's transform the normalized data to the principal component space
In [6]:
# the PCA tranformation
# Plot the data points in the new space
In [7]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from scipy.spatial.distance import cosine
We consider a toy document collection (corpus) and a query for this tutorial.
In [8]:
corpus = [
"Romeo and Juliet.", # document 1
"Juliet: O happy dagger!", # document 2
"Romeo died by dagger.", # document 3
"'Live free or die', that's the New-Hampshire's motto.", # document 4
"Did you know, New-Hampshire is in New-England." # document 5
]
key_words = [
'die',
'dagger'
]
We now build a term frequency (TF) matrix from the corpus using the Python sklearn package.
In [9]:
# initialize the countvetorizer class
vectorizer = CountVectorizer(min_df=0, stop_words=None)
# transform the corpus based on the count vectorizer
# print the vocabulary
Let's look at the corpus vocabulary terms.
Some of these terms are noninformative or stopwords, e.g., a, an, the, and, etc. One can use a standard or a custom stopword list to remove these terms.
The vocabulary also contains different forms for a single word, e.g., die, died. One can use methods such are stemming and lemmatization to get root forms of words in a corpus.
There are several open source libraries available to perform all these for you, e.g., Python Natural Language Processing Toolkit (NLTK)
In [10]:
# A custom stopword list
stop_words = ["a", "an", "the", "and", "in", "by", "or", "did", "you", "is", "that"]
# Here, we assume that we preprocessed the corpus
preprocessed_corpus = [
"Romeo and Juliet",
"Juliet O happy dagger",
"Romeo die by dagger",
"Live free or die that the NewHampshire motto",
"Did you know NewHampshire is in NewEngland"
]
# Customize the vectorizer class
# transform the corpus based on the count vectorizer
# print the vocabulary
In [11]:
# query keywords
key_words = ['die', 'dagger']
# To keep the development simple, we build a composite model for both the corpus and the query
corpus = preprocessed_corpus + [' '.join(key_words)]
# transform the corpus based on the count vectorizer
# TF-IDF transform using TfidfTransformer
# transform the TF matrix to TF-IDF matrix
# D x V document-term matrix
# 1 x V query-term vector
Now, we solve the document ranking problem for the given query: die dagger. We use cosine distance to measure similarity between each document vector and the query vector in the TF-IDF vector space. Once we have the distance scores we can sort them to get a rank list as follows.
In [12]:
# Find cosine distance b/w the TF-IDF vectors of every document and the query
# Sort them and create the rank list
In [ ]:
Note that
$A$ is a $V \times D$ data matrix
$U$ is the matrix of the eigenvectors of $C = AA'$ (the term-term matrix). It's a $V \times V$ matrix.
$V$ is the matrix of the eigenvectors of $B = A'A$ (the document-document matrix). It's a $D \times D$ matrix
$s$ is the vector singular values, obtained as square roots of the eigenvalues of $B$.
More info can be found in the python SVD documentation: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html
We now perform data reduction or transform documents in a $V$-dimensional space to a lower dimensional space. Let's take the number dimensions $K = 3$, i.e., the number of semantic components in the corpus.
Using LSA, we can represent vocabulary terms in the semantic space.
In [13]:
K = 2 # number of components
In [ ]:
We now solve the document ranking problem given the query die dagger as follows.
In [14]:
# Find cosine distance b/w the TF-IDF vectors of every document and the query
# Sort them and create the rank list
Homework $2$: Compare the ranking scores of documents in this list with the ranking scores generated by TF-IDF scheme that we discussed above.
In [ ]: