The three algorithms I am presenting are:
1. Most common nouns - A simple look at noun unigrams to find the most common.
2. Most used concepts - Uses WordNet to identify common concepts from the most common nouns.
3. Proper noun extraction - Extracts proper nouns from text.
There is more description of these algorithms below.
My mystery.txt file comes from the most used concepts algorithm.
I chose these algorithms because they show some different strategies and provide some useful output. I really liked the output of the most used concepts algorithm for both my corpus and the mystery text, though I also found some weaknesses (discussed a bit in my guess on the mystery text below).
I had a tough time interpreting results for the brown news corpus. I'm hoping it's because news is very broad, so there are no clear categories to be found but I look forward to seeing if other students had algorithms that were more fruitful there.
The brown news corpus did stimy me a bit. I think I pulled out important concepts but it's very scattered, so I'm not sure. I'm hoping it's because news is very broad, so there are no clear categories to be found but I look forward to seeing if other students had algorithms that were more fruitful there.
I tried several other things including
1. Collocations - I tried both PMI and chi squared measures. While some interesting entities were extracted, they did not provide much insight onto the text.
2. Common verbs and their subjects - I tried finding the most common verbs and then used chunking to find the most common subjects of the most common verbs. This produced very confusing results. I think this was a case of a more complex algorithm not working very well.
3. General noun phrase extraction - Using chunking to find common noun phrases. This was useful but surprisingly not much better than the simple unigram noun extraction. I decided to include the unigram method because it is foundational for the concept extraction method, which was my favorite.
I thought it would be fun to record my guess of the topic of the mystery text based on all of my algorithms:
I'm certain it has to do with international commerce. I am going to guess that it's further narrowed to agriculture.
After taking a quick look at mystery.txt: Not bad! It's not completely focused on agriculture but there are a lot of articles about it. I want to figure out why other issues discussed in the corpus, such as fuel and labor, were missed.
I just checked in wordnet and all of the synsets from fuel have a max depth of 3 or 4. I filtered concepts to have a depth of at least 7, to avoid having very broad concepts like organism jump in. A more intelligent way of doing this is something I'll be thinking about.
In [1]:
import re
from os import path
import nltk
from nltk.corpus import brown
from nltk.corpus import wordnet as wn
import corpii
In [2]:
from urllib import urlopen
You can uncomment some lines below to use a different text source.
In [3]:
text = nltk.clean_html(corpii.load_pres_debates().raw())
# code to get text from alternate source
#text = urlopen("www.url.com").read()
Now let's tokenize, split into sentences, and tag the text.
In [4]:
# tokenize
token_regex= """(?x)
# taken from ntlk book example
([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():-_`] # these are separate tokens
"""
tokens = nltk.regexp_tokenize(text, token_regex)
In [5]:
# we're going to use frequency distributions a lot, so let's create a nice way of looking at those
DISPLAY_LIM = 25
def fd_view(fd, n=DISPLAY_LIM):
"""Prints a nice format of items in FreqDist fd[0:n]"""
print "{:<16}|{:<16}|{:<16}".format("Word", "Count", "Frequency")
print "========================================================="
for i in fd.items()[0:n]:
print "{:<16}|{:<16,d}|{:<16.3%}".format(i[0], i[1], fd.freq(i[0]))
In [6]:
# get sentences
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
sents = list(sent_tokenizer.sentences_from_tokens(tokens))
In [7]:
#Create tagger
def build_backoff_tagger (train_sents):
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
return t2
tagger = build_backoff_tagger(brown.tagged_sents())
In [8]:
debate_sents = [tagger.tag(s) for s in sents]
debate_tags = [t for s in debate_sents for t in s]
In [9]:
brown_sents = brown.tagged_sents(categories=['news'])
brown_tags = brown.tagged_words(categories=['news'])
In [10]:
myst_file = open(path.join("text", "mystery.txt"), "r")
myst_text = myst_file.read()
myst_file.close()
In [11]:
myst_tokens = nltk.word_tokenize(myst_text)
myst_sents = [nltk.tag.pos_tag(s) for s in sent_tokenizer.sentences_from_tokens(myst_tokens)]
myst_tags = [t for s in myst_sents for t in s]
In [12]:
def common_nouns(tags, min_length=4, pos=r"N.*"):
"""Takes a tagset and returns a frequency distribution of the words
that are at least min_length and whose tag matches pos"""
fd_nouns = nltk.FreqDist([ t[0].lower() for t in tags if len(t[0]) >= min_length and re.match(pos, t[1])])
return fd_nouns
In [13]:
debate_fd_nouns = common_nouns(debate_tags)
fd_view(debate_fd_nouns)
In [14]:
brown_fd_nouns = common_nouns(brown_tags)
fd_view(brown_fd_nouns)
In [15]:
myst_fd_nouns = common_nouns(myst_tags)
fd_view(myst_fd_nouns)
The simple approach does pretty well. For the debates, many of the top nouns have to do with governance. The Brown news corpus is a bit more scattered, although news is a pretty broad category itself. I would say it does a decent job with mystery text and from these results would guess the mystery text has something to do with trade,
My next most succesffull experiment expanded on the unigram approach to find common concepts from the top nouns. Basically, I take the most common nouns and look at each of their hypernym paths and to determine how often concepts are referred to in a text. This is filtered to only include hypernyms at a certain depth in WordNet's tree (after some experimenting I settled on a depth of 7).
I also played around with using least common hypernym for this algorithm but I got some weird results, like the lowest hypernym of president and senator being organism when they both had leader in their hypernym paths. Also, the way I was thinking of using that algorithm had O(n)=n^2, which I wasn't excited about.
In [16]:
def get_hypernyms(synsets):
"""
Takes a list of synsets (as generated by wn.synsets) and returns a list of all hypernyms.
"""
hypernyms = set()
for synset in synsets:
for path in synset.hypernym_paths():
hypernyms.update([h for h in path if h != synset])
return hypernyms
In [17]:
def fd_hypernyms(fd, depth=None, min_depth=7, pos=None):
"""
Takes a frequency distribution and analyzes the hypernyms of the wordforms contained therein.
Returns a weighted
fd - frequency distribution
depth - How far down fd to look
min_depth - A filter to only include synsets of a certain depth.
Unintuitively, max_depth is used to calculate the depth of a synset.
pos - part of speech to limit sysnsets to
"""
hypernyms = {}
for wf in fd.keys()[0:depth]:
freq = fd.freq(wf)
hset = get_hypernyms(wn.synsets(wf, pos=pos))
for h in hset:
if h.max_depth()>=min_depth:
if h in hypernyms:
hypernyms[h] += freq
else:
hypernyms[h] = freq
hlist = hypernyms.items()
hlist.sort(key=lambda s: s[1], reverse=True)
return hlist
In [18]:
def concept_printer(concepts, n=DISPLAY_LIM):
"Prints first n concepts in concept list generated by fd_hypernyms"
print "{:<20} | {:<10} | {}".format("Concept", "Concept Freq", "Definition")
print "===================================================================="
for s in concepts[0:n]:
print "{:<20} | {:<12.3%} | {}".format(s[0].lemma_names[0], s[1], s[0].definition)
In [19]:
debate_concepts = fd_hypernyms(debate_fd_nouns, pos=wn.NOUN, min_depth=7)
concept_printer(debate_concepts)
In [20]:
brown_concepts = fd_hypernyms(brown_fd_nouns, pos=wn.NOUN, min_depth=7)
concept_printer(brown_concepts)
In [21]:
myst_concepts = fd_hypernyms(myst_fd_nouns, pos=wn.NOUN, min_depth=7)
concept_printer(myst_concepts)
I really like this one, partly because traversing concepts in wordnet just seems cool. What I like about this is how it adds some insight to the simple noun counting from my first method and allows nouns that may not be common individually but are linked in cocept to bubble up. So for the mystery text, I still see commerce as a high ranking concept but there is also a lot about plants and food stuffs. So I'm guessing it has something to do with agricultural concepts.
In [22]:
def get_propnoun_fd(sents):
"""
Finds proper nouns from tagged sentences and returns a frequency distribution of those nouns.
"""
grammar = r"""
NPROP: {<N+P>+|<N+P><IN.*|DT.*><N+P>+}
# realized that using the pos tagger made propper nouns NNP while in brown they are NP, hence the N+P
"""
noun_parser = nltk.RegexpParser(grammar)
trees = [t for s in sents for t in noun_parser.parse(s).subtrees() if t.node == "NPROP"]
fd = nltk.FreqDist([" ".join([w[0] for w in t]) for t in trees])
return fd
In [23]:
debate_fd_np = get_propnoun_fd(debate_sents)
fd_view(debate_fd_np)
In [24]:
brown_fd_np = get_propnoun_fd(brown_sents)
fd_view(brown_fd_np)
In [25]:
myst_fd_np = get_propnoun_fd(myst_sents)
fd_view(myst_fd_np)
Of these collections, this one is most useful for my corpus because pulling out the names of candidates and places they are talking about is very important. It does add some value to the other Corpora. For Brown it lists some important subjects. For the mystery text, it makes me think it is about Japan - U.S. relations over a period of time.
In [26]:
out_file = open("mystery.txt", "w")
In [27]:
# realizing it would have been smart to make my print functions return strings instead of just printing
def concept_csv(concepts, n=DISPLAY_LIM):
"Creates a c separated string for n items in concept list"
out = []
out.append("{},{}".format("Concept", "Concept_Freq"))
for s in concepts[0:n]:
out.append("{},{:.3}".format(s[0].lemma_names[0], s[1]))
return "\n".join(out)
In [28]:
out_file.write(concept_csv(myst_concepts, 100))
out_file.close()