PCA is a popular method for summarizing datasets. Suppose, we have a dataset of different wine types. We describe each wine sample by its Alcohol content, color, and so on (see this very nice visualization of wine properties taken from here). Some of these features will measure related properties and so will be redundant. So, we can summarize each wine sample with fewer features! PCA is one such way to do this. It's also called as a method for dimensionality reduction.
Here we have a scatter plot of different wine samples (synthetic). It's based on two wine characteristics, color intensity and alcohol content.
Now we will look at the projections -- The below animation shows how the projections of data points look like for different lines (red dots are projections of the blue dots):
PCA aims to find the best line according to the following two criteria.
The variation of (projected) values along the line should be maximal. Have look at how the "variance" of the red dots changes while the line rotates...
The line should give the lowest reconstruction error. By reconstruction, we mean that constructing the original two characteristics (the position ($x$, $y$) of a blue dot) from the new one (the position of a red dot). This reconstruction error is proportional to the length of the connecting red line.
We will notice that the maximum variance and the minimum error are happened at the same time, when the line points to the magenta ticks. This line corresponds to the first principal component constructed by PCA.
PCA objective: Given the data covariance matrix $C$, we look for a vector $u$ having unit length ($\|u\| = 1$) such that $u^TCu$ is maximal. We will see that we can do this with the help of eigenvectors and eigenvalues of the covariance matrix.
We will look at the intuition behind this approach using the example above.
Let $C$ be an $n \times n$ matrix and $u$ is an $n \times 1$ vector. The operation $C u$ is well-defined. An eigenvector of $C$ is, by definition, any vector $u$ such that $C u = \lambda u$. For the dataset $A$ ($n \times 2$ matrix) above, the covariance matrix C ($2 \times 2$ matrix) is (we assume that the data is centered.)
\begin{equation*} \begin{vmatrix} 1.07 & 0.63 \\ 0.63 & 0.64 \end{vmatrix} \end{equation*}It's a square symmetric matrix. Thus, one can diagonalize it by choosing a new orthogonal coordinate system, given by its eigenvectors (spectral theorem):
\begin{equation*} C = U \Lambda U^{T} \end{equation*}where $U$ is a matrix of eigenvectors $u_i$'s (each column is an eigenvector) and $\Lambda$ is a diagonal matrix with eigenvalues $\lambda_i$'s on the diagonal.
In the new (eigen) space, the covariance matrix is diagonal, as follows:
\begin{equation*} \begin{vmatrix} 1.52 & 0 \\ 0 & 0.18 \end{vmatrix} \end{equation*}It means that there is no correlation between points in this new system. The maximum possible variance is $1.52$, which is given by the first eigenvalue. We achieve this variance by taking the projection on the first principal axis. The direction of this axis is given by the first eigen vector of $C$.
This example/discussion is adapted from here.
For illustration, we will use the wine dataset. Each wine sample is described by 14 features as follows:
In [1]:
# We will first read the wine data headers
f = open("wine.data")
header = f.readlines()[0]
print header
Let's first look at two wine characteristics: Alcohol Content and Color Intensity.
We can draw a scatter plot:
In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import linalg as la
wine_class, wine_alc, wine_col = np.loadtxt("wine.data", delimiter=',', usecols=(0, 1, 10), unpack=True, skiprows=1)
# print wine_class
In [3]:
# print wine_alc
# print wine_col
In [4]:
plt.scatter(wine_col, wine_alc, c=wine_class)
plt.xlabel("Color Intensity")
plt.ylabel("Alcohol Content")
Out[4]:
In [5]:
col_alc = np.matrix([wine_col, wine_alc]).T
m, n = col_alc.shape
# print m, n
# print col_alc.mean(axis=0)
# center the data with column means
column_means = col_alc.mean(axis=0)
col_alc -= column_means
# calculate the covariance matrix
col_alc_cov = np.cov(col_alc, rowvar=False)
# calculate eigenvectors & eigenvalues of the covariance matrix
evals, evecs = la.eigh(col_alc_cov)
# sort eigenvalues and eigenvectors in decreasing order
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]
print evecs, evals
Let's visualize the normalized data and principal components.
In [6]:
plt.scatter(col_alc[:,0], col_alc[:,1], c=wine_class)
plt.xlabel("Color Intensity")
plt.ylabel("Alcohol Content")
# draws the first principal axis
slope = evecs[1][0]/evecs[0][0]
plt.plot(col_alc[:,0], slope * col_alc[:,0], c="blue", linestyle='--') # first eigen vector
Out[6]:
Let's transform the normalized data to the principal component space
In [7]:
# PCA transformation
pca_data = np.dot(evecs.T, col_alc.T).T
# plot the data points in the new space
plt.scatter(pca_data[:,0], pca_data[:,1], c=wine_class)
plt.xlabel("Color Intensity")
plt.ylabel("Alcohol Content")
Out[7]:
Homework $1$: Apply PCA on the whole set of features and analyze its principal components.
For another example and limitations of PCA, see
First, let's import numpy and a couple other modules we'll need.
In [8]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from scipy.spatial.distance import cosine
We consider a toy document collection (corpus) and a query for this tutorial.
In [9]:
corpus = [
"Romeo and Juliet.", # document 1
"Juliet: O happy dagger!", # document 2
"Romeo died by dagger.", # document 3
"'Live free or die', that's the New-Hampshire's motto.", # document 4
"Did you know, New-Hampshire is in New-England." # document 5
]
key_words = [
'die',
'dagger'
]
We now build a term frequency (TF) matrix from the corpus using the Python sklearn package.
In [10]:
vectorizer = CountVectorizer(min_df=0, stop_words=None, strip_accents='ascii')
docs_tf = vectorizer.fit_transform(corpus)
vocabulary_terms = vectorizer.get_feature_names()
print ' '.join(vocabulary_terms)
Let's look at the corpus vocabulary terms.
Some of these terms are noninformative or stopwords, e.g., a, an, the, and, etc. One can use a standard or a custom stopword list to remove these terms.
The vocabulary also contains different forms for a single word, e.g., die, died. One can use methods such are stemming and lemmatization to get root forms of words in a corpus.
There are several open source libraries available to perform all these for you, e.g., Python Natural Language Processing Toolkit (NLTK)
In [11]:
# A custom stopword list
stop_words = ["a", "an", "the", "and", "in", "by", "or", "did", "you", "is", "that"]
# Here, we assume that we preprocessed the corpus
preprocessed_corpus = [
"Romeo and Juliet",
"Juliet O happy dagger",
"Romeo die by dagger",
"Live free or die that the NewHampshire motto",
"Did you know NewHampshire is in NewEngland"
]
vectorizer = CountVectorizer(min_df=0, stop_words=stop_words) # uses the stopwords list
# vectorizer = CountVectorizer(min_df=0, stop_words=None) # includes all stopwords
docs_tf = vectorizer.fit_transform(preprocessed_corpus)
vocabulary_terms = vectorizer.get_feature_names()
print ' '.join(vocabulary_terms)
In [12]:
key_words = ['die', 'dagger']
# To keep the development simple, we build a composite model for both the corpus and the query
docs_query_tf = vectorizer.transform(preprocessed_corpus + [' '.join(key_words)])
transformer = TfidfTransformer(smooth_idf = False)
tfidf = transformer.fit_transform(docs_query_tf.toarray())
# D x V document-term matrix
tfidf_matrix = tfidf.toarray()[:-1]
# 1 x V query-term vector
query_tfidf = tfidf.toarray()[-1]
print tfidf_matrix
print query_tfidf
Now, we solve the document ranking problem for the given query: die dagger. We use cosine distance to measure similarity between each document vector and the query vector in the TF-IDF vector space. Once we have the distance scores we can sort them to get a rank list as follows.
In [13]:
query_doc_tfidf_cos_dist = [cosine(query_tfidf, doc_tfidf) for doc_tfidf in tfidf_matrix]
query_doc_tfidf_sort_index = np.argsort(np.array(query_doc_tfidf_cos_dist))
for rank, sort_index in enumerate(query_doc_tfidf_sort_index):
print rank, query_doc_tfidf_cos_dist[sort_index], corpus[sort_index]
In [14]:
tf_matrix = docs_tf.toarray() # D x V matrix
A = tf_matrix.T
U, s, V = np.linalg.svd(A, full_matrices=1, compute_uv=1)
Note that
$A$ is a $V \times D$ data matrix
$U$ is the matrix of the eigenvectors of $C = AA'$ (the term-term matrix). It's a $V \times V$ matrix.
$V$ is the matrix of the eigenvectors of $B = A'A$ (the document-document matrix). It's a $D \times D$ matrix
$s$ is the vector singular values, obtained as square roots of the eigenvalues of $B$.
More info can be found in the python SVD documentation: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html
We now perform data reduction or transform documents in a $V$-dimensional space to a lower dimensional space. Let's take the number dimensions $K = 3$, i.e., the number of semantic components in the corpus.
Using LSA, we can represent vocabulary terms in the semantic space.
In [15]:
K = 2 # number of components
A_reduced = np.dot(U[:,:K], np.dot(np.diag(s[:K]), V[:K, :])) # D x V matrix
docs_rep = np.dot(np.diag(s[:K]), V[:K, :]).T # D x K matrix
terms_rep = np.dot(U[:,:K], np.diag(s[:K])) # V x K matrix
print A_reduced
print docs_rep
print terms_rep
In [16]:
key_word_indices = [vocabulary_terms.index(key_word) for key_word in key_words] # vocabulary indices
key_words_rep = terms_rep[key_word_indices,:]
query_rep = np.sum(key_words_rep, axis = 0)
print query_rep
We now solve the document ranking problem given the query die dagger as follows.
In [17]:
query_doc_cos_dist = [cosine(query_rep, doc_rep) for doc_rep in docs_rep]
query_doc_sort_index = np.argsort(np.array(query_doc_cos_dist))
for rank, sort_index in enumerate(query_doc_sort_index):
print rank, query_doc_cos_dist[sort_index], corpus[sort_index]
In [18]:
# plot documents in the new space
plt.scatter(docs_rep[:,0], docs_rep[:,1], c=query_doc_cos_dist) # all documents
plt.scatter(query_rep[0], query_rep[1], marker='+', c='red') # the query
plt.xlabel("Component 1")
plt.ylabel("Component 2")
Out[18]:
Homework $2$: Compare the ranking scores of documents in this list with the ranking scores generated by TF-IDF scheme that we discussed above.
In [ ]: