The following exercise uses results from our parsing to calculate a term frequency - inverse document frequency (TF-IDF) metric to construct feature vectors per document. First we'll load a stopword list, for common words to ignore from the analysis:
In [ ]:
import pynlp
stopwords = pynlp.load_stopwords("stop.txt")
print(stopwords)
Next, we'll use a function from our pynlp library to iterate through the keywords for one of the parsed HTML documents:
In [ ]:
%sx ls *.json
In [ ]:
json_file = "a1.json"
for lex in pynlp.lex_iter(json_file):
print(lex)
We need to initialize some data structures for counting keywords. BTW, if you've heard about how Big Data projects use word count programs to demonstrate their capabilities, here's a major use case for that. Even so, our examples are conceptually simple, built for relatively small files, and are not intended to scale:
In [ ]:
from collections import defaultdict
files = ["a4.json", "a3.json", "a2.json", "a1.json"]
files_tf = {}
d = len(files)
df = defaultdict(int)
Iterate through each parsed file, tallying counts for tf for each document while also tallying counts for df across all documents:
In [ ]:
for json_file in files:
tf = defaultdict(int) # each file has its own term frequence
for lex in pynlp.lex_iter(json_file):
if (lex.pos != ".") and (lex.root not in stopwords):
tf[lex.root] += 1 # increment for each word
files_tf[json_file] = tf # keep the file in memory - it is mall enough
for word in tf.keys():
df[word] += 1 # print out term frequency
## print results for just the last file in the sequence
print(json_file, files_tf[json_file])
Let's take a look at the df results overall. If there are low-information common words in the list that you'd like to filter out, move them to your stopword list.
In [ ]:
for word, count in sorted(df.items(), key=lambda kv: kv[1], reverse=True):
print(word, count) # now show all of the word frequancy
Finally, we make a second pass through the data, using the df counts to normalize tf counts, calculating the tfidf metrics for each keyword:
In [ ]:
import math
# calculate TFIDF for each document
for json_file in files:
tf = files_tf[json_file]
keywords = []
for word, count in tf.items():
tfidf = float(count) * math.log((d + 1.0) / (df[word] + 1.0))
keywords.append((json_file, tfidf, word,))
Let's take a look at the results for one of the files:
In [ ]:
# check if TFIDF is working for you - an important QA step
for json_file, tfidf, word in sorted(keywords, key=lambda x: x[1], reverse=True):
print("%s\t%7.4f\t%s" % (json_file, tfidf, word))
Question: how do that vector of ranked keywords compare with your reading of the text from the HTML file?