Intro to TF-IDF and Doc Clustering

Lynn Cherny, arnicas@gmail



In [ ]:

    
%matplotlib inline
# You can ignore the pink warning that appears



In [ ]:

    
import itertools
import math
import nltk
import string



In [ ]:

    
nltk.data.path = ['../nltk_data']



In [ ]:

    
import matplotlib.pyplot as plt
import numpy as np
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram



In [ ]:

    
from sklearn.feature_extraction.text import TfidfVectorizer

TF-IDF (Term Frequency, Inverse Document Frequency)

Term Frequency: Number of appearances of a word in a document (the counts we saw already)

Document Frequency: Number of documents that contain a word in a set of docs

TF-IDF is Term Frequency / Document Frequency, with some extra fiddles.

Example from Manning, Raghavan, and Schuetze showing IDF of a rare term is high:

TF-IDF for a word and document is usually calculated as:

(Word t's frequency in the doc) * Log( Number of Docs / Number of docs that contain the word t)

However, it is usually done with a + 1 term or two. You can consider it an information measure for document words (or "features") in a bag-of-words style analysis, where the order of the words doesn't matter, just the set of words. It is a "weight" for a word. Some features of TF-IDF:

If a term is very frequent in the whole doc set, it's less interesting overall and gets a low TF-IDF. Note this tends to remove stopwords for you! However, you need a lot of documents for this to work well. Beware of effects of tf-idf on small doc sets, it may not work as you expect.
A term frequent in a few docs, but not in a lot, has a high score. It helps distinguish those docs.

See the discussion in Manning, Raghavan, and Schuetze, and even more math in Wikipedia. Depending on implementation, TF-IDF may or may not be normalized. Always check to see if the implementation you use cleans stopwords or not and decide if you like that.

Some more python references:

In other languages than Python, for instance:

A version in Processing by Nic Felton



In [ ]:

    
# code example from Building Machine Learning Systems with Python (Richert & Coelho) 
# - modified slightly by Lynn

import math

def tfidf(t, d, D):
    tf = float(d.count(t)) / sum(d.count(w) for w in set(d))  # normalized
    # Note his version doesn't use +1 in denominator.
    idf = math.log( float(len(D)) / (len([doc for doc in D if t in doc])))
    return tf * idf


a, abb, abc = ["a"], ["a", "b", "b"], ["a", "b", "c"]  # try adding another c to the last doc!
D = [a, abb, abc]

print(tfidf("a", a, D))   # a is in all of them
print(tfidf("a", abc, D)) # a is in all of them
print(tfidf("b", abc, D)) # b occurs only once here, but in 2 docs
print(tfidf("b", abb, D)) # b occurs more frequently in this doc
print(tfidf("c", abc, D)) # c is unique in the doc set

What if you change some of those docs, or add another one? Add another c in the last doc, e.g.



In [ ]:

    
filelist = !ls ../data/movie_reviews/positive/*



In [ ]:

    
filelist

Some Utilities to Make a File We Can Save



In [ ]:

    
from nltk.corpus import stopwords



In [ ]:

    
stopwords = stopwords.words('english')

import collections

def clean_tokens(tokens, stopwords):
    import string
    """ Lowercases, takes out punct and stopwords and short strings """
    return [token.lower() for token in tokens if (token not in string.punctuation)and (token.lower() not in stopwords) and len(token) > 2]

def makeText(filename, stopwords):
    from nltk import Text
    with open(filename) as handle:
        text = handle.read()
    return Text(clean_tokens(nltk.word_tokenize(text.decode('ascii', 'ignore')), stopwords))

def makeTextCollection(files, stopwords=stopwords):
    from nltk import TextCollection
    texts= [makeText(filename, stopwords) for filename in files]
    collection = TextCollection(texts)
    return collection, texts

# use the data for the vocab in a single doc for a wordcloud, for instance
def compute_tfidf_by_doc(coll, texts, filenames):
    tfidf_by_doc = collections.defaultdict(list)
    for i, text in enumerate(texts):
        for word in set(text.tokens):   # just use the words in this text
            tfidfscore = coll.tf_idf(word, text)
            tf = coll.tf(word, text) # is actually count / len(text)
            count = text.count(word)
            if tfidfscore:
                tfidf_by_doc[filenames[i]].append({
                    "word": word,
                    "tfidf": tfidfscore,
                    "tf": tf,
                    "count": count
                })
    return tfidf_by_doc



In [ ]:

    
# We need to make the text collection, then use it to compute the tf-idf for the words in the docs.
res = makeTextCollection(filelist)
coll = res[0]
texts = res[1]



In [ ]:

    
texts[0]



In [ ]:

    
coll.tf_idf("woman", texts[3])



In [ ]:

    
tfidfs = compute_tfidf_by_doc(coll, texts, filelist)



In [ ]:

    
tfidfs[tfidfs.keys()[0]]  # the first filename is the first key... it contains a list of words and scores



In [ ]:

    
import json
jsonified = json.dumps(tfidfs)
with open('../outputdata/pos_movies_tfidf.json', 'w') as handle:
    handle.write(jsonified)



In [ ]:

    
!ls -al ../outputdata/pos_movies_tfidf.json

Now we can look at these reviews as little wordclouds, using different measures to size our words. Let's work with word_clouds_tfidf.html and we can compare how our clouds look using regular word counts, term frequencies (which is count / length of the document), and tfidf across all the documents.

For movie cv681_tok-28559.txt, by counts is useless:

By tf (term frequency, normalized to document length), it's a bit better, in that some of the smaller words are getting larger:

It's still not stellar with tf-idf, but the more unique words are popping a bit better and the words shared across many docs, like "film" and "movie" are disappearing:

TF-IDF has started to creep into journalism now. Here are a couple of articles that use it:

Time to discuss vectors, and similarity measures!

This is where some more libraries start to be needed. NLTK is fine for some things, but can be very slow for stuff like making vectors of terms across large numbers of documents.

Each document is a collection of weighted words, which we'll call a vector. Vectors can be compared to each other, to compute similarity. A common metric is "cosine similarity." Image from Manning, Raghavan and Schuetze:

Reminder: Angles close to each other are near 1 in cosine, far apart are closer to 0. This means that in practice you may want to subtract from 1, so that a higher score = further away. You should think of it as cosine = distance, 1-cos = similarity. Pattern (the library) does this for you so that similarity = 1 - cos.

Another, perhaps simpler to understand, is euclidean distance (image from this article):

This is essentially the hypoteneuse between two sides of a vector triangle. Larger numbers = further apart vectors!

Links:

Reference in Manning, Raghavan and Schuetze: http://nlp.stanford.edu/IR-book/html/htmledition/dot-products-1.html
A good article focusing on queries in search -- same idea! https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-similarity/
Computing the angle between vectors for cosine similarity: http://www.mathsisfun.com/algebra/vectors-dot-product.html
An article on text similarities using scikit-learn: https://de.dariah.eu/tatom/working_with_text.html



In [ ]:

    
# Load in the docs... again.  We're going to make TF-IDF vectors with sklearn (scikit-learn) because it's faster.

def load_texts(filenames, dirpath):
    """ filenames are the leaves, dirpath is the path to them with the / """
    loaded_text = {}
    for filen in filenames:
        with open(dirpath + filen) as handle:
            loaded_text[filen] = handle.read()
    return loaded_text



In [ ]:

    
texts = load_texts(filelist, "")



In [ ]:

    
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer().fit_transform([text.decode('ascii', 'ignore') for text in texts.values()])



In [ ]:

    
vectors = tfidf.toarray()



In [ ]:

    
vectors[0]



In [ ]:

    
#If you haven't managed to load sklearn, you can try the slower version of the vectorizing:
def get_unique_terms_for_all(collection):
    #get a list of unique terms
    unique_terms = list(set(collection))
    print "Unique terms found: ", len(unique_terms)
    return unique_terms

# Function to create a TF*IDF vector for one document.  For each of
# our unique words, we have a feature which is the td*idf for that word
# in the current document
def TFIDF(document, unique_terms_in_all):
    word_tfidf = []
    for word in unique_terms_in_all:
        word_tfidf.append(collection.tf_idf(word,document))
    return word_tfidf

def tfidf_doc_vectors(texts, coll):
    uniques = get_unique_terms_for_all(coll)
    vectors = [numpy.array(TFIDF(f, uniques)) for f in texts]
    print "Vectors created."
    return vectors

This gets us the input we need for clustering and making dendrograms.

Graphing the Cluster Tree with SciPy



In [ ]:

    
import numpy as np
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram



In [ ]:

    
vectors

Scipy's pdist is pairwise distance - see http://docs.scipy.org/doc/scipy/reference/spatial.distance.html You can use cosine here as well! or a host of other options...



In [ ]:

    
dist = pdist(vectors, metric='cosine')  # look at the manpage and pick a different measure to try



In [ ]:

    
linkage(dist)



In [ ]:

    
# this is a base diagram, using defaults... 
dendrogram(linkage(dist))  # this plotting function has a ton of things you can manipulate if you look at the docs.



In [ ]:

    
texts[texts.keys()[14]]

Let's do this with a nicer layout now...



In [ ]:

    
def make_dend(data, labels=None, height=8):
    from pylab import rcParams
    dist = pdist(data, metric='cosine')
    link = linkage(dist, method='complete')
    rcParams['figure.figsize'] = 6, height
    rcParams['axes.labelsize'] = 5
    if not labels:
        dend = dendrogram(link, orientation='right') #labels=names)
    else:
        dend = dendrogram(link, orientation='right', labels=[str(i) + label for i, label in enumerate(labels)])
    return dist



In [ ]:

    
dist = make_dend(vectors, height=20, labels=texts.keys())

Let's inspect a pair that are grouped closely in the cosine-similarity tree -- 23, 4:



In [ ]:

    
texts.keys()[23]



In [ ]:

    
texts[texts.keys()[23]]



In [ ]:

    
texts[texts.keys()[4]]

What do you notice about them both?

This is an example of a heatmap based on similarity scores. Most of these are not very similar.



In [ ]:

    
# Code borrowed from: http://nbviewer.ipython.org/github/OxanaSachenkova/hclust-python/blob/master/hclust.ipynb

def make_heatmap_matrix(dist, method='complete'):
    """ Pass in the distance matrix; method options are complete or single """
    # Compute and plot first dendrogram.
    fig = plt.figure(figsize=(10,10))
    # x ywidth height
    ax1 = fig.add_axes([0.05,0.1,0.2,0.6])
    Y = linkage(dist, method=method)
    Z1 = dendrogram(Y, orientation='right') # adding/removing the axes
    ax1.set_xticks([])

    # Compute and plot second dendrogram.
    ax2 = fig.add_axes([0.3,0.71,0.6,0.2])
    Z2 = dendrogram(Y)
    ax2.set_xticks([])
    ax2.set_yticks([])

    #Compute and plot the heatmap
    axmatrix = fig.add_axes([0.3,0.1,0.6,0.6])
    idx1 = Z1['leaves']
    idx2 = Z2['leaves']
    D = squareform(dist)
    D = D[idx1,:]
    D = D[:,idx2]
    im = axmatrix.matshow(D, aspect='auto', origin='lower', cmap=plt.cm.YlGnBu)
    axmatrix.set_xticks([])
    axmatrix.set_yticks([])

    # Plot colorbar.
    axcolor = fig.add_axes([0.91,0.1,0.02,0.6])
    plt.colorbar(im, cax=axcolor)



In [ ]:

    
make_heatmap_matrix(dist, method='complete')

Relevant links:

K-Means Clustering with NLTK Using Our TFIDF Vectors



In [ ]:

    
## clustering in NLTK:
import numpy
from nltk.cluster import KMeansClusterer, GAAClusterer, euclidean_distance,cosine_distance
import nltk.corpus
import nltk.stem
stemmer_func = nltk.stem.snowball.SnowballStemmer("english").stem
stopwords = set(nltk.corpus.stopwords.words('english'))

cluster = KMeansClusterer(5, cosine_distance)
cluster.cluster(vectors)
classified_examples = [cluster.classify(vec) for vec in vectors]



In [ ]:

    
for i,val in enumerate(classified_examples):
    print val, texts.keys()[i]

Let's look at the items in cluster 0:



In [ ]:

    
texts['../data/movie_reviews/positive/cv673_tok-6552.txt']



In [ ]:

    
texts['../data/movie_reviews/positive/cv680_tok-18142.txt']

For another, more advanced tour of clustering, including plotting these, see the notebook cluster_analysis_by_brandon_rose.ipynb.



In [ ]: