Exercise 07: TF-IDF

The following exercise uses results from our parsing to calculate a term frequency - inverse document frequency (TF-IDF) metric to construct feature vectors per document. First we'll load a stopword list, for common words to ignore from the analysis:



In [ ]:

    
import pynlp

stopwords = pynlp.load_stopwords("stop.txt")
print(stopwords)

Next, we'll use a function from our pynlp library to iterate through the keywords for one of the parsed HTML documents:



In [ ]:

    
%sx ls *.json



In [ ]:

    
json_file = "a1.json"

for lex in pynlp.lex_iter(json_file):
  print(lex)

We need to initialize some data structures for counting keywords. BTW, if you've heard about how Big Data projects use word count programs to demonstrate their capabilities, here's a major use case for that. Even so, our examples are conceptually simple, built for relatively small files, and are not intended to scale:



In [ ]:

    
from collections import defaultdict

files = ["a4.json", "a3.json", "a2.json", "a1.json"]
files_tf = {}

d = len(files)
df = defaultdict(int)

Iterate through each parsed file, tallying counts for tf for each document while also tallying counts for df across all documents:



In [ ]:

    
for json_file in files:
  tf = defaultdict(int) # each file has its own term frequence

  for lex in pynlp.lex_iter(json_file):
    if (lex.pos != ".") and (lex.root not in stopwords):
      tf[lex.root] += 1 # increment for each word

  files_tf[json_file] = tf # keep the file in memory - it is mall enough
  
  for word in tf.keys():  
    df[word] += 1 # print out term frequency

## print results for just the last file in the sequence
print(json_file, files_tf[json_file])

Let's take a look at the df results overall. If there are low-information common words in the list that you'd like to filter out, move them to your stopword list.



In [ ]:

    
for word, count in sorted(df.items(), key=lambda kv: kv[1], reverse=True):
  print(word, count) # now show all of the word frequancy

Finally, we make a second pass through the data, using the df counts to normalize tf counts, calculating the tfidf metrics for each keyword:



In [ ]:

    
import math

# calculate TFIDF for each document
for json_file in files:
  tf = files_tf[json_file]
  keywords = []

  for word, count in tf.items():
    tfidf = float(count) * math.log((d + 1.0) / (df[word] + 1.0))
    keywords.append((json_file, tfidf, word,))

Let's take a look at the results for one of the files:



In [ ]:

    
# check if TFIDF is working for you - an important QA step
for json_file, tfidf, word in sorted(keywords, key=lambda x: x[1], reverse=True):
  print("%s\t%7.4f\t%s" % (json_file, tfidf, word))

Question: how do that vector of ranked keywords compare with your reading of the text from the HTML file?