Exploratory Data Analysis

In this tutorial we focus on two popular methods for exploring high dimensional datasets.

The first method is a general scheme for dimensionality reduction, but the second one is specifically used in the text domain.

Principal Component Analysis (PCA)

PCA is a popular method for summarizing datasets. Suppose, we have a dataset of different wine types. We describe each wine sample by its Alcohol content, color, and so on (see this very nice visualization of wine properties taken from here). Some of these features will measure related properties and so will be redundant. So, we can summarize each wine sample with fewer features! PCA is one such way to do this. It's also called as a method for dimensionality reduction.

Here we have a scatter plot of different wine samples (synthetic). It's based on two wine characteristics, color intensity and alcohol content. We notice a correlation between these two features. We can construct a new property or feature (that summarizes the two features) by drawing a line through the center of the scatter plot and projecting all points onto this line. We construct these lines via linear combinations of $x$ and $y$ coordinates, i.e., $w_1 x + w_2 y$. Each configuration of $(w_1, w_2)$ will give us a new line.

Now we will look at the projections -- The below animation shows how the projections of data points look like for different lines (red dots are projections of the blue dots):

PCA aims to find the best line according to the following two criteria.

The variation of (projected) values along the line should be maximal. Have look at how the "variance" of the red dots changes while the line rotates...
The line should give the lowest reconstruction error. By reconstruction, we mean that constructing the original two characteristics (the position ($x$, $y$) of a blue dot) from the new one (the position of a red dot). This reconstruction error is proportional to the length of the connecting red line.

We will notice that the maximum variance and the minimum error are happened at the same time, when the line points to the magenta ticks. This line corresponds to the first principal component constructed by PCA.

PCA objective: Given the data covariance matrix $C$, we look for a vector $u$ having unit length ($\|u\| = 1$) such that $u^TCu$ is maximal. We will see that we can do this with the help of eigenvectors and eigenvalues of the covariance matrix.

We will look at the intuition behind this approach using the example above.

Let $C$ be an $n \times n$ matrix and $u$ is an $n \times 1$ vector. The operation $C u$ is well-defined. An eigenvector of $C$ is, by definition, any vector $u$ such that $C u = \lambda u$. For the dataset $A$ ($n \times 2$ matrix) above, the covariance matrix C ($2 \times 2$ matrix) is (we assume that the data is centered.)

\begin{equation*} \begin{vmatrix} 1.07 & 0.63 \\ 0.63 & 0.64 \end{vmatrix} \end{equation*}

It's a square symmetric matrix. Thus, one can diagonalize it by choosing a new orthogonal coordinate system, given by its eigenvectors (spectral theorem):

\begin{equation*} C = U \Lambda U^{T} \end{equation*}

where $U$ is a matrix of eigenvectors $u_i$'s (each column is an eigenvector) and $\Lambda$ is a diagonal matrix with eigenvalues $\lambda_i$'s on the diagonal.

In the new (eigen) space, the covariance matrix is diagonal, as follows:

\begin{equation*} \begin{vmatrix} 1.52 & 0 \\ 0 & 0.18 \end{vmatrix} \end{equation*}

It means that there is no correlation between points in this new system. The maximum possible variance is $1.52$, which is given by the first eigenvalue. We achieve this variance by taking the projection on the first principal axis. The direction of this axis is given by the first eigen vector of $C$.

This example/discussion is adapted from here.

PCA on a Real Dataset

For illustration, we will use the wine dataset. Each wine sample is described by 14 features as follows:



In [1]:

    
# We will first read the wine data headers 
f = open("wine.data")
header = f.readlines()[0]

Let's first look at two wine characteristics: Alcohol Content and Color Intensity.

We can draw a scatter plot:



In [2]:

    
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from scipy import linalg as la

# Read the data file (text format): wine.data, delimiter=',', use columns 0, 1, 10, skip the header    

wine_class, wine_alc, wine_col = np.loadtxt("wine.data", delimiter=',', usecols=(0, 1, 10), unpack=True, skiprows=1)



In [3]:

    
# draw a scatter plot of wine color intensity and alcohol content

PCA on a Subset of the Wine Data



In [4]:

    
# Perform PCA on two wine characteristics: **Alcohol Content** and **Color Intensity**

col_alc = np.matrix([wine_col, wine_alc]).T
m, n = col_alc.shape

# compute column means 


# center the data with column means 


# calculate the covariance matrix


# calculate eigenvectors & eigenvalues of the covariance matrix


# sort eigenvalues and eigenvectors in decreasing order

Let's visualize the normalized data and its principal components.



In [5]:

    
# Create a scatter plot of the normalized data 
# color intensity of the x-axis and alcohol content on the y-axis 



# Plot the principal component line

Let's transform the normalized data to the principal component space



In [6]:

    
# the PCA tranformation


# Plot the data points in the new space

Homework $1$: Apply PCA on the whole set of features and analyze its principal components.

Exploratory Text Analysis

First, let's import numpy and a couple other modules we'll need.



In [7]:

    
import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from scipy.spatial.distance import cosine

We consider a toy document collection (corpus) and a query for this tutorial.



In [8]:

    
corpus = [
    "Romeo and Juliet.", # document 1 
    "Juliet: O happy dagger!", # document 2
    "Romeo died by dagger.", # document 3
    "'Live free or die', that's the New-Hampshire's motto.", # document 4
    "Did you know, New-Hampshire is in New-England." # document 5
]


key_words = [
    'die', 
    'dagger'
]

We now build a term frequency (TF) matrix from the corpus using the Python sklearn package.



In [9]:

    
# initialize the countvetorizer class 

vectorizer = CountVectorizer(min_df=0, stop_words=None)


# transform the corpus based on the count vectorizer 



# print the vocabulary

Let's look at the corpus vocabulary terms.

Some of these terms are noninformative or stopwords, e.g., a, an, the, and, etc. One can use a standard or a custom stopword list to remove these terms.

The vocabulary also contains different forms for a single word, e.g., die, died. One can use methods such are stemming and lemmatization to get root forms of words in a corpus.

There are several open source libraries available to perform all these for you, e.g., Python Natural Language Processing Toolkit (NLTK)



In [10]:

    
# A custom stopword list 
stop_words = ["a", "an", "the", "and", "in", "by", "or", "did", "you", "is", "that"]

# Here, we assume that we preprocessed the corpus  
preprocessed_corpus = [
"Romeo and Juliet",
"Juliet O happy dagger",
"Romeo die by dagger",
"Live free or die that the NewHampshire motto",
"Did you know NewHampshire is in NewEngland"
]


# Customize the vectorizer class 


# transform the corpus based on the count vectorizer 


# print the vocabulary

TF-IDF

Here, we compute the TF-IDF matrix for the normalized corpus and the sample query die dagger. We consider the query as a document in the corpus.



In [11]:

    
# query keywords 

key_words = ['die', 'dagger']

# To keep the development simple, we build a composite model for both the corpus and the query 

corpus = preprocessed_corpus + [' '.join(key_words)]

# transform the corpus based on the count vectorizer 


# TF-IDF transform using TfidfTransformer


# transform the TF matrix to TF-IDF matrix 

# D x V document-term matrix 


# 1 x V query-term vector

Information Retrieval via TF-IDF

Now, we solve the document ranking problem for the given query: die dagger. We use cosine distance to measure similarity between each document vector and the query vector in the TF-IDF vector space. Once we have the distance scores we can sort them to get a rank list as follows.



In [12]:

    
# Find cosine distance b/w the TF-IDF vectors of every document and the query 


# Sort them and create the rank list

Latent Semantic Analysis (LSA)

We perform LSA using the well-known matrix factorization technique Singular Value Decomposition (SVD).

We consider the TF matrix for SVD. In practice, one can also perform SVD on the TF-IDF matrix.



In [ ]:

Note that

$A$ is a $V \times D$ data matrix
$U$ is the matrix of the eigenvectors of $C = AA'$ (the term-term matrix). It's a $V \times V$ matrix.
$V$ is the matrix of the eigenvectors of $B = A'A$ (the document-document matrix). It's a $D \times D$ matrix
$s$ is the vector singular values, obtained as square roots of the eigenvalues of $B$.

More info can be found in the python SVD documentation: https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html

We now perform data reduction or transform documents in a $V$-dimensional space to a lower dimensional space. Let's take the number dimensions $K = 3$, i.e., the number of semantic components in the corpus.

Using LSA, we can represent vocabulary terms in the semantic space.



In [13]:

    
K = 2 # number of components

Information Retrieval via LSA

Now we would like to represent the query in the LSA space. A natural choice is to compute a vector that is the centroid of the semantic vectors for its terms.

In our example, the keyword query is die dagger. We compute the query vector as



In [ ]:

We now solve the document ranking problem given the query die dagger as follows.



In [14]:

    
# Find cosine distance b/w the TF-IDF vectors of every document and the query 


# Sort them and create the rank list

Homework $2$: Compare the ranking scores of documents in this list with the ranking scores generated by TF-IDF scheme that we discussed above.



In [ ]: