Exploratory Data Analysis

In this tutorial we focus on two popular methods for exploring high dimensional datasets.

  1. Principal Component Analysis
  2. Latent Semantic Analysis

The first method is a general scheme for dimensionality reduction, but the second one is specifically used in the text domain.


Principal Component Analysis (PCA)

PCA is a popular method for summarizing datasets. Suppose, we have a dataset of different wine types. We describe each wine sample by its Alcohol content, color, and so on (see this very nice visualization of wine properties taken from here). Some of these features will measure related properties and so will be redundant. So, we can summarize each wine sample with fewer features! PCA is one such way to do this. It's also called as a method for dimensionality reduction.

Here we have a scatter plot of different wine samples (synthetic). It's based on two wine characteristics, color intensity and alcohol content. We notice a correlation between these two features. We can construct a new property or feature (that summarizes the two features) by drawing a line through the center of the scatter plot and projecting all points onto this line. We construct these lines via linear combinations of $x$ and $y$ coordinates, i.e., $w_1 x + w_2 y$. Each configuration of $(w_1, w_2)$ will give us a new line.

Now we will look at the projections -- The below animation shows how the projections of data points look like for different lines (red dots are projections of the blue dots):

PCA aims to find the best line according to the following two criteria.

  1. The variation of (projected) values along the line should be maximal. Have look at how the "variance" of the red dots changes while the line rotates...

  2. The line should give the lowest reconstruction error. By reconstruction, we mean that constructing the original two characteristics (the position ($x$, $y$) of a blue dot) from the new one (the position of a red dot). This reconstruction error is proportional to the length of the connecting red line.

We will notice that the maximum variance and the minimum error are happened at the same time, when the line points to the magenta ticks. This line corresponds to the first principal component constructed by PCA.

PCA objective: Given the data covariance matrix $C$, we look for a vector $u$ having unit length ($\|u\| = 1$) such that $u^TCu$ is maximal. We will see that we can do this with the help of eigenvectors and eigenvalues of the covariance matrix.

We will look at the intuition behind this approach using the example above.

Let $C$ be an $n \times n$ matrix and $u$ is an $n \times 1$ vector. The operation $C u$ is well-defined. An eigenvector of $C$ is, by definition, any vector $u$ such that $C u = \lambda u$. For the dataset $A$ ($n \times 2$ matrix) above, the covariance matrix C ($2 \times 2$ matrix) is (we assume that the data is centered.)

\begin{equation*} \begin{vmatrix} 1.07 & 0.63 \\ 0.63 & 0.64 \end{vmatrix} \end{equation*}

It's a square symmetric matrix. Thus, one can diagonalize it by choosing a new orthogonal coordinate system, given by its eigenvectors (spectral theorem):

\begin{equation*} C = U \Lambda U^{T} \end{equation*}

where $U$ is a matrix of eigenvectors $u_i$'s (each column is an eigenvector) and $\Lambda$ is a diagonal matrix with eigenvalues $\lambda_i$'s on the diagonal.

In the new (eigen) space, the covariance matrix is diagonal, as follows:

\begin{equation*} \begin{vmatrix} 1.52 & 0 \\ 0 & 0.18 \end{vmatrix} \end{equation*}

It means that there is no correlation between points in this new system. The maximum possible variance is $1.52$, which is given by the first eigenvalue. We achieve this variance by taking the projection on the first principal axis. The direction of this axis is given by the first eigen vector of $C$.

This example/discussion is adapted from here.


PCA on a Real Dataset

For illustration, we will use the wine dataset. Each wine sample is described by 14 features as follows:


In [1]:
# We will first read the wine data headers 
f = open("wine.data")
header = f.readlines()[0]

print header


Class,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280 OD315 of diluted wines,Proline

Let's first look at two wine characteristics: Alcohol Content and Color Intensity.

We can draw a scatter plot:


In [2]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from scipy import linalg as la

wine_class, wine_alc, wine_col = np.loadtxt("wine.data", delimiter=',', usecols=(0, 1, 10), unpack=True, skiprows=1)

# print wine_class

In [3]:
# print wine_alc
# print wine_col

In [4]:
plt.scatter(wine_col, wine_alc, c=wine_class)
plt.xlabel("Color Intensity")
plt.ylabel("Alcohol Content")


Out[4]:
<matplotlib.text.Text at 0x78eec88>

PCA on a Subset of the Wine Data


In [5]:
col_alc = np.matrix([wine_col, wine_alc]).T
m, n = col_alc.shape

# print m, n
# print col_alc.mean(axis=0)

# center the data with column means 
column_means = col_alc.mean(axis=0)
col_alc -= column_means

# calculate the covariance matrix
col_alc_cov = np.cov(col_alc, rowvar=False)

# calculate eigenvectors & eigenvalues of the covariance matrix
evals, evecs = la.eigh(col_alc_cov)

# sort eigenvalues and eigenvectors in decreasing order
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]

print evecs, evals


[[-0.97893176  0.20418769]
 [-0.20418769 -0.97893176]] [ 5.58893076  0.44458095]

Let's visualize the normalized data and principal components.


In [6]:
plt.scatter(col_alc[:,0], col_alc[:,1], c=wine_class)
plt.xlabel("Color Intensity")
plt.ylabel("Alcohol Content")

# draws the first principal axis 
slope = evecs[1][0]/evecs[0][0]
plt.plot(col_alc[:,0], slope * col_alc[:,0], c="blue", linestyle='--') # first eigen vector


Out[6]:
[<matplotlib.lines.Line2D at 0x7ac21d0>]

Let's transform the normalized data to the principal component space


In [7]:
# PCA transformation 
pca_data = np.dot(evecs.T, col_alc.T).T

# plot the data points in the new space   
plt.scatter(pca_data[:,0], pca_data[:,1], c=wine_class)
plt.xlabel("Color Intensity")
plt.ylabel("Alcohol Content")


Out[7]:
<matplotlib.text.Text at 0x8128ac8>

Homework $1$: Apply PCA on the whole set of features and analyze its principal components.

For another example and limitations of PCA, see


Exploratory Text Analysis

First, let's import numpy and a couple other modules we'll need.


In [8]:
import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from scipy.spatial.distance import cosine

We consider a toy document collection (corpus) and a query for this tutorial.


In [9]:
corpus = [
    "Romeo and Juliet.", # document 1 
    "Juliet: O happy dagger!", # document 2
    "Romeo died by dagger.", # document 3
    "'Live free or die', that's the New-Hampshire's motto.", # document 4
    "Did you know, New-Hampshire is in New-England." # document 5
]


key_words = [
    'die', 
    'dagger'
]

We now build a term frequency (TF) matrix from the corpus using the Python sklearn package.


In [10]:
vectorizer = CountVectorizer(min_df=0, stop_words=None, strip_accents='ascii')

docs_tf = vectorizer.fit_transform(corpus)
vocabulary_terms = vectorizer.get_feature_names()

print ' '.join(vocabulary_terms)


and by dagger did die died england free hampshire happy in is juliet know live motto new or romeo that the you

Let's look at the corpus vocabulary terms.

Some of these terms are noninformative or stopwords, e.g., a, an, the, and, etc. One can use a standard or a custom stopword list to remove these terms.

The vocabulary also contains different forms for a single word, e.g., die, died. One can use methods such are stemming and lemmatization to get root forms of words in a corpus.

There are several open source libraries available to perform all these for you, e.g., Python Natural Language Processing Toolkit (NLTK)


In [11]:
# A custom stopword list 
stop_words = ["a", "an", "the", "and", "in", "by", "or", "did", "you", "is", "that"]

# Here, we assume that we preprocessed the corpus  
preprocessed_corpus = [
"Romeo and Juliet",
"Juliet O happy dagger",
"Romeo die by dagger",
"Live free or die that the NewHampshire motto",
"Did you know NewHampshire is in NewEngland"
]

vectorizer = CountVectorizer(min_df=0, stop_words=stop_words) # uses the stopwords list 
# vectorizer = CountVectorizer(min_df=0, stop_words=None) # includes all stopwords 

docs_tf = vectorizer.fit_transform(preprocessed_corpus)
vocabulary_terms = vectorizer.get_feature_names()


print ' '.join(vocabulary_terms)


dagger die free happy juliet know live motto newengland newhampshire romeo

TF-IDF

Here, we compute the TF-IDF matrix for the normalized corpus and the sample query die dagger. We consider the query as a document in the corpus.


In [12]:
key_words = ['die', 'dagger']

# To keep the development simple, we build a composite model for both the corpus and the query 
docs_query_tf = vectorizer.transform(preprocessed_corpus + [' '.join(key_words)]) 

transformer = TfidfTransformer(smooth_idf = False)
tfidf = transformer.fit_transform(docs_query_tf.toarray())

# D x V document-term matrix 
tfidf_matrix = tfidf.toarray()[:-1] 

# 1 x V query-term vector 
query_tfidf = tfidf.toarray()[-1] 

print tfidf_matrix
print query_tfidf


[[ 0.          0.          0.          0.          0.70710678  0.          0.
   0.          0.          0.          0.70710678]
 [ 0.43622688  0.          0.          0.71927623  0.54069197  0.          0.
   0.          0.          0.          0.        ]
 [ 0.53177225  0.53177225  0.          0.          0.          0.          0.
   0.          0.          0.          0.659118  ]
 [ 0.          0.30581618  0.50424749  0.          0.          0.
   0.50424749  0.50424749  0.          0.37905127  0.        ]
 [ 0.          0.          0.          0.          0.          0.62438104
   0.          0.          0.62438104  0.46935767  0.        ]]
[ 0.70710678  0.70710678  0.          0.          0.          0.          0.
  0.          0.          0.          0.        ]

Information Retrieval via TF-IDF

Now, we solve the document ranking problem for the given query: die dagger. We use cosine distance to measure similarity between each document vector and the query vector in the TF-IDF vector space. Once we have the distance scores we can sort them to get a rank list as follows.


In [13]:
query_doc_tfidf_cos_dist = [cosine(query_tfidf, doc_tfidf) for doc_tfidf in tfidf_matrix]
query_doc_tfidf_sort_index = np.argsort(np.array(query_doc_tfidf_cos_dist))

for rank, sort_index in enumerate(query_doc_tfidf_sort_index):
    print rank, query_doc_tfidf_cos_dist[sort_index], corpus[sort_index]


0 0.247960466684 Romeo died by dagger.
1 0.69154101474 Juliet: O happy dagger!
2 0.783755305037 'Live free or die', that's the New-Hampshire's motto.
3 1.0 Romeo and Juliet.
4 1.0 Did you know, New-Hampshire is in New-England.

Latent Semantic Analysis (LSA)

We perform LSA using the well-known matrix factorization technique Singular Value Decomposition (SVD).

We consider the TF matrix for SVD. In practice, one can also perform SVD on the TF-IDF matrix.


In [14]:
tf_matrix = docs_tf.toarray() # D x V matrix 
A = tf_matrix.T 

U, s, V = np.linalg.svd(A, full_matrices=1, compute_uv=1)

Note that

  • $A$ is a $V \times D$ data matrix

  • $U$ is the matrix of the eigenvectors of $C = AA'$ (the term-term matrix). It's a $V \times V$ matrix.

  • $V$ is the matrix of the eigenvectors of $B = A'A$ (the document-document matrix). It's a $D \times D$ matrix

  • $s$ is the vector singular values, obtained as square roots of the eigenvalues of $B$.

More info can be found in the python SVD documentation: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html

We now perform data reduction or transform documents in a $V$-dimensional space to a lower dimensional space. Let's take the number dimensions $K = 3$, i.e., the number of semantic components in the corpus.

Using LSA, we can represent vocabulary terms in the semantic space.


In [15]:
K = 2 # number of components

A_reduced = np.dot(U[:,:K], np.dot(np.diag(s[:K]), V[:K, :])) # D x V matrix 

docs_rep = np.dot(np.diag(s[:K]), V[:K, :]).T # D x K matrix 
terms_rep = np.dot(U[:,:K], np.diag(s[:K])) # V x K matrix 

print A_reduced
print docs_rep
print terms_rep


[[  5.73337731e-01   7.91760842e-01   7.57017378e-01   9.79647354e-02
   -1.02491148e-01]
 [  2.35199347e-01   3.01655677e-01   5.53326436e-01   9.88138201e-01
    3.35503364e-01]
 [ -3.47434636e-02  -6.81955731e-02   1.66160309e-01   8.21977893e-01
    3.35951252e-01]
 [  3.03394920e-01   4.21909591e-01   3.69851251e-01  -6.81955731e-02
   -1.02043259e-01]
 [  5.21818031e-01   7.25304512e-01   6.39794061e-01  -1.02939037e-01
   -1.69790944e-01]
 [ -6.77476844e-02  -1.02043259e-01  -4.47888701e-04   3.35951252e-01
    1.50523277e-01]
 [ -3.47434636e-02  -6.81955731e-02   1.66160309e-01   8.21977893e-01
    3.35951252e-01]
 [ -3.47434636e-02  -6.81955731e-02   1.66160309e-01   8.21977893e-01
    3.35951252e-01]
 [ -6.77476844e-02  -1.02043259e-01  -4.47888701e-04   3.35951252e-01
    1.50523277e-01]
 [ -1.02491148e-01  -1.70238832e-01   1.65712420e-01   1.15792915e+00
    4.86474529e-01]
 [  4.88365922e-01   6.73246171e-01   6.57108938e-01   1.31416845e-01
   -6.81955731e-02]]
[[ 0.38708704  0.92754923]
 [ 0.48751353  1.30434102]
 [ 1.0044869   0.97900911]
 [ 1.99263838 -0.80087035]
 [ 0.69808592 -0.54790248]]
[[ 0.61663275  1.08099134]
 [ 1.23868973  0.08433505]
 [ 0.82354272 -0.37915075]
 [ 0.20148574  0.61750554]
 [ 0.36146595  1.05662903]
 [ 0.28851375 -0.25938985]
 [ 0.82354272 -0.37915075]
 [ 0.82354272 -0.37915075]
 [ 0.28851375 -0.25938985]
 [ 1.11205647 -0.63854059]
 [ 0.57512722  0.9026093 ]]

Information Retrieval via LSA

Now we would like to represent the query in the LSA space. A natural choice is to compute a vector that is the centroid of the semantic vectors for its terms.

In our example, the keyword query is die dagger. We compute the query vector as


In [16]:
key_word_indices = [vocabulary_terms.index(key_word) for key_word in key_words] # vocabulary indices 

key_words_rep = terms_rep[key_word_indices,:]     
query_rep = np.sum(key_words_rep, axis = 0)

print query_rep


[ 1.85532247  1.16532639]

We now solve the document ranking problem given the query die dagger as follows.


In [17]:
query_doc_cos_dist = [cosine(query_rep, doc_rep) for doc_rep in docs_rep]
query_doc_sort_index = np.argsort(np.array(query_doc_cos_dist))

for rank, sort_index in enumerate(query_doc_sort_index):
    print rank, query_doc_cos_dist[sort_index], corpus[sort_index]


0 0.0223310149191 Romeo died by dagger.
1 0.183008437788 Romeo and Juliet.
2 0.205301979656 Juliet: O happy dagger!
3 0.412621550076 'Live free or die', that's the New-Hampshire's motto.
4 0.66224728898 Did you know, New-Hampshire is in New-England.

In [18]:
# plot documents in the new space  

plt.scatter(docs_rep[:,0], docs_rep[:,1], c=query_doc_cos_dist) # all documents 
plt.scatter(query_rep[0], query_rep[1], marker='+', c='red') # the query 
plt.xlabel("Component 1")
plt.ylabel("Component 2")


Out[18]:
<matplotlib.text.Text at 0xa6740b8>

Homework $2$: Compare the ranking scores of documents in this list with the ranking scores generated by TF-IDF scheme that we discussed above.


In [ ]: