Keyphrase Identification

This notebook will try several methods for identifying keyphrases in text.

Setup

First, let's import modules, load text, and do the required tokenizing and tagging.


In [1]:
import re

import nltk
from nltk.corpus import brown
from nltk.corpus import wordnet as wn

import corpii

In [6]:
from urllib import urlopen

You can uncomment some lines below to use a different text source.


In [7]:
text = nltk.clean_html(corpii.load_pres_debates().raw())

# code to get text from alternate source
#text = urlopen("www.url.com").read()

Now let's tokenize, split into sentences, and tag the text.


In [8]:
# tokenize
token_regex= """(?x)
    # taken from ntlk book example
    ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
    | \w+(-\w+)*        # words with optional internal hyphens
    | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
    | \.\.\.            # ellipsis
    | [][.,;"'?():-_`]  # these are separate tokens
"""

tokens = nltk.regexp_tokenize(text, token_regex)

In [9]:
# get sentences
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

sents = list(sent_tokenizer.sentences_from_tokens(tokens))

In [10]:
#Create tagger

def build_backoff_tagger (train_sents):
    t0 = nltk.DefaultTagger('NN')
    t1 = nltk.UnigramTagger(train_sents, backoff=t0)
    t2 = nltk.BigramTagger(train_sents, backoff=t1)
    return t2

tagger = build_backoff_tagger(brown.tagged_sents())

In [11]:
sent_tags = [tagger.tag(s) for s in sents]

In [12]:
tags = [t for s in sent_tags for t in s]

In [93]:
# we're going to use frequency distributions a lot, so let's create a nice way of looking at those

def fd_view(fd, n=10):
    """Prints a nice format of items in FreqDist fd[0:n]"""
    print "{:<16}|{:<16}|{:<16}".format("Word", "Count", "Frequency")
    print "========================================================="
    for i in fd.items()[0:n]:
        print "{:<16}|{:<16,d}|{:<16.3%}".format(i[0], i[1], fd.freq(i[0]))

Simple unigram approach

To start let's try just finding the most common nouns in the corpus and see how that does.


In [105]:
def common_nouns(tags, min_length=4, pos=r"N.*"):
    """Takes a tagset and returns a frequency distribution of the words
    that are at least min_length and whose tag matches pos"""
    fd_nouns = nltk.FreqDist([ t[0].lower() for t in tags if len(t[0]) >= min_length and re.match(pos, t[1])])
    return fd_nouns

In [106]:
# Let's look for common nouns
# I was getting some noise from very short tokens, so I'm excluding them.
fd_tokens = common_nouns(tags, 4)

In [107]:
#First let's see what some of these tokens are
fd_view(fd_tokens, 10)


Word            |Count           |Frequency       
=========================================================
president       |3,052           |2.409%          
people          |2,601           |2.053%          
senator         |1,265           |0.998%          
years           |1,202           |0.949%          
country         |1,173           |0.926%          
question        |1,153           |0.910%          
time            |1,046           |0.826%          
governor        |1,028           |0.811%          
america         |1,025           |0.809%          
bush            |937             |0.740%          

That's actually not too bad for a corpus of presidential debates.

Let's try the same thing on the brown news corpus.


In [108]:
fd_brown = common_nouns(brown.tagged_words(categories='news'), 4)
fd_view(fd_brown, 10)


Word            |Count           |Frequency       
=========================================================
mrs.            |253             |0.899%          
state           |151             |0.537%          
president       |142             |0.505%          
year            |142             |0.505%          
home            |132             |0.469%          
time            |103             |0.366%          
years           |102             |0.362%          
house           |96              |0.341%          
week            |94              |0.334%          
city            |93              |0.330%          

hmmmmm... a lot of things that sound newsy in there. A little scattered, but the news corpus probably covers a lot of ground.

A more complex approach with collocations

Let's see if doing something more complex can beat the simple approach. One weakness with unigrams is that it only catches topics that are one word long. Let's try using collocations and PMI to see if we can find some interesting word combinations.


In [18]:
bigram_measures = nltk.collocations.BigramAssocMeasures()

In [19]:
def bigram_collocs(tokens, mfreq=3, measure=bigram_measures.pmi, n=10):
    """
    Look for collocations in token list:tokens
    
    args:
    mfreq - minimum frequency to be included
    measure - The measure to be used to find collocations
    n - The number of matches to show
    """
    finder = nltk.BigramCollocationFinder.from_words(tokens)
    finder.apply_freq_filter(mfreq)
    return finder.nbest(measure, 10)

In [20]:
bigram_collocs(tokens)


Out[20]:
[('A.Q.', 'Khan'),
 ('Achilles', 'heel'),
 ('Beta', 'Kappa'),
 ('Costa', 'Rica'),
 ('Dana', 'Crist'),
 ('EXECUTIVE', 'EDITOR'),
 ('Helping', 'Hand'),
 ('NEW', 'YORK'),
 ('Occupational', 'Safety'),
 ('Phi', 'Beta')]

In [21]:
bigram_collocs(brown.words(categories=['news']))


Out[21]:
[('Sterling', 'Township'),
 ('Duncan', 'Phyfe'),
 ('Milwaukee', 'Braves'),
 ('magnetic', 'tape'),
 ('Dolce', 'Vita'),
 ('Notre', 'Dame'),
 ('Scottish', 'Rite'),
 ('Thrift', 'Shop'),
 ('Adlai', 'Stevenson'),
 ('Lady', 'Jacqueline')]

That isn't too helpful, maybe the chi measure?


In [22]:
bigram_collocs(tokens, measure=bigram_measures.chi_sq)


Out[22]:
[('Left', 'Behind'),
 ('Los', 'Angeles'),
 ('Prime', 'Minister'),
 ('El', 'Salvador'),
 ('Chiang', 'Kai-shek'),
 ('Jong', 'Il'),
 ('Training', 'Partnership'),
 ('et', 'cetera'),
 ('PARTICIPATE', 'IN'),
 ('Planned', 'Parenthood')]

In [23]:
bigram_collocs(brown.words(categories=['news']), measure=bigram_measures.chi_sq)


Out[23]:
[('Viet', 'Nam'),
 ('Hong', 'Kong'),
 ('Dolce', 'Vita'),
 ('Notre', 'Dame'),
 ('Scottish', 'Rite'),
 ('Duncan', 'Phyfe'),
 ('Sterling', 'Township'),
 ('Los', 'Angeles'),
 ('per', 'cent'),
 ('Thrift', 'Shop')]

I think one problem with applying the collocations metric to my corpus is that it is very large and covers several debates where different topics are important, so it doesn't create a very coherent result. I will try running it against individual debates to see if that's more interesting.


In [24]:
debates = corpii.load_pres_debates()

In [25]:
for debate in debates.fileids():
    d = nltk.clean_html(debates.raw(fileids=[debate]))
    d_tokens = nltk.regexp_tokenize(debate, token_regex)
    collocs = bigram_collocs(d_tokens, n=5, mfreq=2)
    if collocs:
        print collocs
    else:
        print "No collocations found in {}".format(debate)


No collocations found in First_half_of_Debate.txt
No collocations found in October_11,_1984:_The_Bush-Ferraro_Vice_Presidential_Debate.txt
No collocations found in October_11,_2000:_The_Second_Gore-Bush_Presidential_Debate.txt
No collocations found in October_11,_2012:_The_Biden-Ryan_Vice_Presidential_Debate.txt
No collocations found in October_13,_1960:_The_Third_Kennedy-Nixon_Presidential_Debate.txt
No collocations found in October_13,_1988:_The_Second_Bush-Dukakis_Presidential_Debate.txt
No collocations found in October_13,_1992:_The_Gore-Quayle-Stockdale_Vice_Presidential_Debate.txt
No collocations found in October_13,_2004:_The_Third_Bush-Kerry_Presidential_Debate.txt
No collocations found in October_15,_2008:_The_Third_McCain-Obama_Presidential_Debate.txt
No collocations found in October_16,_1996:_The_Second_Clinton-Dole_Presidential_Debate.txt
No collocations found in October_16,_2012:_The_Second_Obama-Romney_Presidential_Debate.txt
No collocations found in October_17,_2000:_The_Third_Gore-Bush_Presidential_Debate.txt
No collocations found in October_19,_1992:_The_Third_Clinton-Bush-Perot_Presidential_Debate.txt
No collocations found in October_2,_2008:_The_Biden-Palin_Vice_Presidential_Debate.txt
No collocations found in October_21,_1960:_The_Fourth_Kennedy-Nixon_Presidential_Debate.txt
No collocations found in October_21,_1984:_The_Second_Reagan-Mondale_Presidential_Debate.txt
No collocations found in October_22,_1976:_The_Third_Carter-Ford_Presidential_Debate.txt
No collocations found in October_22,_2012:_The_Third_Obama-Romney_Presidential_Debate.txt
No collocations found in October_28,_1980:_The_Carter-Reagan_Presidential_Debate.txt
No collocations found in October_3,_2000:_The_First_Gore-Bush_Presidential_Debate.txt
No collocations found in October_3,_2012:_The_First_Obama-Romney_Presidential_Debate.txt
No collocations found in October_5,_1988:_The_Bentsen-Quayle_Vice_Presidential_Debate.txt
No collocations found in October_5,_2000:_The_Lieberman-Cheney_Vice_Presidential_Debate.txt
No collocations found in October_5,_2004:_The_Cheney-Edwards_Vice_Presidential_Debate.txt
No collocations found in October_6,_1976:_The_Second_Carter-Ford_Presidential_Debate.txt
No collocations found in October_6,_1996:_The_First_Clinton-Dole_Presidential_Debate.txt
No collocations found in October_7,_1960:_The_Second_Kennedy-Nixon_Presidential_Debate.txt
No collocations found in October_7,_1984:_The_First_Reagan-Mondale_Presidential_Debate.txt
No collocations found in October_7,_2008:_The_Second_McCain-Obama_Presidential_Debate.txt
No collocations found in October_8,_2004:_The_Second_Bush-Kerry_Presidential_Debate.txt
No collocations found in October_9,_1996:_The_Gore-Kemp_Vice_Presidential_Debate.txt
No collocations found in Second_half_of_Debate.txt
No collocations found in September_21,_1980:_The_Anderson-Reagan_Presidential_Debate.txt
No collocations found in September_23,_1976:_The_First_Carter-Ford_Presidential_Debate.txt
No collocations found in September_25,_1988:_The_First_Bush-Dukakis_Presidential_Debate.txt
No collocations found in September_26,_1960:_The_First_Kennedy-Nixon_Presidential_Debate.txt
No collocations found in September_26,_2008:_The_First_McCain-Obama_Presidential_Debate.txt
No collocations found in September_30,_2004:_The_First_Bush-Kerry_Presidential_Debate.txt

Well that didn't work, even with a lowered minimum frequency. (There are results when mfreq is lowered to 1, but they're pretty useless)

Gleaning topics from unigrams

Let's go back to the unigrams and see if we can use wordnet to abstract the most frequent nouns into concepts.


In [22]:
def get_hypernyms(synsets):
    """
    Takes a list of synsets (as generated by wn.synsets) and returns a list of all hypernyms. 
    """
    hypernyms = set()
    for synset in synsets:
        for path in synset.hypernym_paths():
            hypernyms.update([h for h in path if h != synset])
    return hypernyms

In [23]:
def common_hypernyms(wordforms, min_depth=3):
    """
    Takes a list of wordforms and extracts all hypernyms associated with the wordforms. 
    Returns a frequency distribution of of the sysnsets extracted. 
    arguments:
        wordforms - Wordforms to be processed
        min_depth - A filter to only include synsets of a certain depth.
                    Unintuitively, max_depth is used to calculate the depth of a synset. 
    """
    hypernyms = []
    for l in wordforms:
        hset = get_hypernyms(wn.synsets(l, pos=wn.NOUN))
        hypernyms.extend(h for h in hset if h.max_depth()>=min_depth)
    return nltk.FreqDist(hypernyms)

In [24]:
def fd_hypernyms(fd, depth=None, min_depth=3, pos=None):
    """
    Takes a frequency distribution and analyzes the hypernyms of the wordforms contained therein. 
    Returns a weighted 
    fd - frequency distribution
    depth - How far down fd to look
    min_depth - A filter to only include synsets of a certain depth.
                Unintuitively, max_depth is used to calculate the depth of a synset. 
    pos - part of speech to limit sysnsets to
    """
    hypernyms = {}
    for wf in fd.keys()[0:depth]:
        freq = fd.freq(wf)
        hset = get_hypernyms(wn.synsets(wf, pos=pos))
        for h in hset:
            if h.max_depth()>=min_depth:
                if h in hypernyms:
                    hypernyms[h] += freq
                else:
                    hypernyms[h] = freq
    
    hlist = hypernyms.items()
    hlist.sort(key=lambda s: s[1], reverse=True)
    return hlist

In [25]:
debate_concepts = fd_hypernyms(fd_tokens, pos=wn.NOUN, min_depth=7)

In [26]:
[s[0].definition for s in debate_concepts[0:10]]


Out[26]:
['a person who rules or guides or inspires others',
 'a person who communicates with others',
 'someone who negotiates (confers with others in order to reach a settlement)',
 'a person who represents others',
 'the chief public representative of a country who may also be the head of government',
 'a person who is in charge',
 'someone who administers a business',
 'the leader of a group meeting',
 'a person responsible for the administration of a business',
 'an administrative unit in government or business']

In [27]:
def concept_printer(concepts, n=10):
    "Prints first n concepts in concept list generated by fd_hypernyms"
    print "{:<20} | {:<10} | {}".format("Concept", "Noun Freq", "Definition")
    print "===================================================================="
    for s in debate_concepts[0:10]:
        print "{:<20} | {:<10.3%} |  {}".format(s[0].lemma_names[0], s[1], s[0].definition)

In [28]:
concept_printer(debate_concepts, 10)


Concept              | Noun Freq  | Definition
====================================================================
leader               | 7.989%     |  a person who rules or guides or inspires others
communicator         | 5.141%     |  a person who communicates with others
negotiator           | 4.295%     |  someone who negotiates (confers with others in order to reach a settlement)
representative       | 3.958%     |  a person who represents others
head_of_state        | 3.909%     |  the chief public representative of a country who may also be the head of government
head                 | 3.396%     |  a person who is in charge
administrator        | 3.119%     |  someone who administers a business
presiding_officer    | 3.073%     |  the leader of a group meeting
executive            | 3.050%     |  a person responsible for the administration of a business
division             | 2.969%     |  an administrative unit in government or business

In [29]:
syn = debate_concepts[2][0]

In [30]:
syn.lemma_names


Out[30]:
['negotiator', 'negotiant', 'treater']

In [31]:
brown_concepts = fd_hypernyms(fd_brown, pos=wn.NOUN, min_depth=7)

In [32]:
concept_printer(brown_concepts)


Concept              | Noun Freq  | Definition
====================================================================
leader               | 7.989%     |  a person who rules or guides or inspires others
communicator         | 5.141%     |  a person who communicates with others
negotiator           | 4.295%     |  someone who negotiates (confers with others in order to reach a settlement)
representative       | 3.958%     |  a person who represents others
head_of_state        | 3.909%     |  the chief public representative of a country who may also be the head of government
head                 | 3.396%     |  a person who is in charge
administrator        | 3.119%     |  someone who administers a business
presiding_officer    | 3.073%     |  the leader of a group meeting
executive            | 3.050%     |  a person responsible for the administration of a business
division             | 2.969%     |  an administrative unit in government or business

Chunking Noun Phrases

I want to try and pull out noun phrases, as I think that is pretty important for my corpus


In [101]:
def get_propnoun_fd(sents):
    """
    Finds proper nouns from tagged sentences and returns a frequency distribution of those nouns.
    """
    grammar = r"""
        NPROP: {<NP>+|<NP><IN.*|DT.*><NP>+}
    """

    noun_parser = nltk.RegexpParser(grammar)
    
    trees = [t for s in sents for t in noun_parser.parse(s).subtrees() if t.node == "NPROP"]
    fd = nltk.FreqDist([" ".join([w[0] for w in t]) for t in trees])
    return fd

In [117]:
fd_debate_np = get_propnoun_fd(sent_tags)

In [103]:
fd_view(fd_debate_np)


Word            |Count           |Frequency       
=========================================================
America         |847             |8.844%          
Congress        |420             |4.386%          
Bush            |332             |3.467%          
Iraq            |309             |3.226%          
John            |261             |2.725%          
Iran            |204             |2.130%          
Kennedy         |202             |2.109%          
Washington      |199             |2.078%          
Republican      |190             |1.984%          
Carter          |189             |1.973%          

In [104]:
fd_brown_np = get_propnoun_fd(brown.tagged_sents(categories=['news']))
fd_view(fd_brown_np)


Word            |Count           |Frequency       
=========================================================
Mr.             |51              |1.139%          
U.S.            |44              |0.983%          
Kennedy         |41              |0.916%          
Mantle          |41              |0.916%          
Dallas          |37              |0.827%          
Laos            |36              |0.804%          
Palmer          |36              |0.804%          
Maris           |33              |0.737%          
Congo           |30              |0.670%          
Washington      |29              |0.648%          

Looking at common verbs and their objects

I think it would be interesting to look at common verbs, then use chunking to see what their objects are generally.


In [110]:
porter = nltk.stem.PorterStemmer()

def common_verbs(tags, min_length=4, pos=r"V.*"):
    """Takes a tagset and returns a frequency distribution of the words
    that are at least min_length and whose tag matches pos"""
    fd_verbs = nltk.FreqDist([ porter.stem(t[0].lower()) for t in tags if len(t[0]) >= min_length and re.match(pos, t[1])])
    return fd_verbs

In [111]:
fd_debate_vb = common_verbs(tags)
fd_view(fd_debate_vb)


Word            |Count           |Frequency       
=========================================================
think           |2,344           |4.052%          
go              |2,117           |3.660%          
make            |1,700           |2.939%          
want            |1,697           |2.934%          
said            |1,401           |2.422%          
know            |1,375           |2.377%          
believ          |993             |1.717%          
take            |904             |1.563%          
work            |830             |1.435%          
unit            |828             |1.431%          

In [118]:
def get_verb_phrases(sent_tags):
    grammar = r"""
        NPROP: {<NP>+|<NP><IN.*|DT.*><NP>+}
        VPHRASE: {<V.*><DET>?<NPROP>}
    """
    
    parser = nltk.RegexpParser(grammar)
    trees = [t for s in sent_tags for t in parser.parse(s).subtrees() if t.node == "VPHRASE"]
    return trees

In [157]:
def get_verb_subjects_fd(vphrases, verb, stem=True):
    """
    Takes a list of verb phrases, as made by get_verb_phrases and a verb
    and returns a frequency distribution of the subjects of that verb. 
    """
       
    if stem:
        verb = porter.stem(verb)
    
    subjects = []
    
    for phrase in vphrases:
        v = phrase[0][0]
        if stem:
            v = porter.stem(v)
        
        if v == verb:
            print phrase
            subjects.append(" ".join([w[0] for w in phrase[1]]))
    
    return nltk.FreqDist(subjects)

In [158]:
vphrases = get_verb_phrases(sent_tags)

In [152]:
def get_top_verb_subjects(sent_tags, n=10):
    """
    Takes a list of tagged sentences, finds the most common verbs and
    creates a frequency distribution for the subjects of the most used verbs. 
    
    n - the number of verbs to test
    """
    
    vphrases = get_verb_phrases(sent_tags)
    verbs = common_verbs([t for s in sent_tags for t in s])
    
    v_subjects = {}
    
    for v in verbs.keys()[0:n]:
        v_subjects[v] = get_verb_subjects_fd(vphrases, v)
        
    return v_subjects

In [159]:
debates_vsubjs = get_top_verb_subjects(sent_tags)


(VPHRASE think/VB (NPROP Iran/NP))
(VPHRASE think/VB (NPROP America/NP))
(VPHRASE think/VB (NPROP America/NP))
(VPHRASE think/VB (NPROP Ross/NP))
(VPHRASE think/VB (NPROP Ross/NP))
(VPHRASE think/VB (NPROP America/NP))
(VPHRASE think/VB (NPROP America/NP))
(VPHRASE think/VB (NPROP George/NP W./NP Bush/NP))
(VPHRASE think/VB (NPROP Al/NP Gore/NP))
(VPHRASE think/VB (NPROP America/NP))
(VPHRASE think/VB (NPROP John/NP))
(VPHRASE think/VB (NPROP America/NP))
(VPHRASE make/VB (NPROP America/NP))
(VPHRASE making/VBG (NPROP Illinois/NP))
(VPHRASE make/VB (NPROP America/NP))
(VPHRASE make/VB (NPROP America/NP))
(VPHRASE make/VB (NPROP America/NP))
(VPHRASE make/VB (NPROP John/NP))
(VPHRASE make/VB (NPROP America/NP))
(VPHRASE make/VB (NPROP America/NP))
(VPHRASE make/VB (NPROP America/NP))
(VPHRASE making/VBG (NPROP America/NP))
(VPHRASE make/VB (NPROP America/NP))
(VPHRASE make/VB (NPROP Israel/NP))
(VPHRASE make/VB (NPROP America/NP))
(VPHRASE make/VB (NPROP America/NP))
(VPHRASE make/VB (NPROP America/NP))
(VPHRASE make/VB (NPROP America/NP))
(VPHRASE make/VB (NPROP America/NP))
(VPHRASE make/VB (NPROP America/NP))
(VPHRASE want/VB (NPROP America/NP))
(VPHRASE want/VB (NPROP Joe/NP))
(VPHRASE want/VB (NPROP Roe/NP))
(VPHRASE want/VB (NPROP America/NP))
(VPHRASE want/VB (NPROP U.S./NP))
(VPHRASE said/VBD (NPROP Barry/NP Goldwater/NP))
(VPHRASE said/VBD (NPROP Bob/NP))
(VPHRASE said/VBD (NPROP Jim/NP Baker/NP))
(VPHRASE said/VBD (NPROP George/NP))
(VPHRASE said/VBD (NPROP Joey/NP))
(VPHRASE said/VBD (NPROP Russia/NP))
(VPHRASE said/VBD (NPROP Russia/NP))
(VPHRASE said/VBN (NPROP Iraq/NP))
(VPHRASE said/VBD (NPROP America/NP))
(VPHRASE said/VBN (NPROP America/NP))
(VPHRASE said/VBD (NPROP Bob/NP))
(VPHRASE know/VB (NPROP Barbara/NP))
(VPHRASE know/VB (NPROP Al/NP))
(VPHRASE know/VB (NPROP Sarah/NP))
(VPHRASE know/VB (NPROP America/NP))
(VPHRASE know/VB (NPROP Gwen/NP))
(VPHRASE know/VB (NPROP Donald/NP))
(VPHRASE knows/VBZ (NPROP Missouri/NP))
(VPHRASE believe/VB (NPROP Washington/NP))
(VPHRASE believe/VB (NPROP Bill/NP Clinton/NP))
(VPHRASE believe/VB (NPROP America/NP))
(VPHRASE believe/VB (NPROP Roe/NP))
(VPHRASE believe/VB (NPROP John/NP))
(VPHRASE believe/VB (NPROP John/NP))
(VPHRASE believe/VB (NPROP Castro/NP))
(VPHRASE believe/VB (NPROP China/NP))
(VPHRASE believe/VB (NPROP America/NP))
(VPHRASE believe/VB (NPROP America/NP))
(VPHRASE believe/VB (NPROP America/NP))
(VPHRASE believe/VB (NPROP America/NP))
(VPHRASE take/VB (NPROP Russia/NP))
(VPHRASE taking/VBG (NPROP Formosa/NP))
(VPHRASE take/VB (NPROP Joe/NP))
(VPHRASE take/VB (NPROP Joe/NP))
(VPHRASE take/VB (NPROP California/NP))
(VPHRASE take/VB (NPROP Detroit/NP))
(VPHRASE take/VB (NPROP Detroit/NP))
(VPHRASE take/VB (NPROP America/NP))
(VPHRASE take/VB (NPROP Israel/NP))
(VPHRASE take/VB (NPROP America/NP))
(VPHRASE take/VB (NPROP America/NP))

In [156]:
for k in debates_vsubjs.keys():
    print k
    fd_view(debates_vsubjs[k])
    print "********************************************"


said
Word            |Count           |Frequency       
=========================================================
America         |2               |18.182%         
Bob             |2               |18.182%         
Russia          |2               |18.182%         
Barry Goldwater |1               |9.091%          
George          |1               |9.091%          
Iraq            |1               |9.091%          
Jim Baker       |1               |9.091%          
Joey            |1               |9.091%          
********************************************
make
Word            |Count           |Frequency       
=========================================================
America         |15              |83.333%         
Illinois        |1               |5.556%          
Israel          |1               |5.556%          
John            |1               |5.556%          
********************************************
work
Word            |Count           |Frequency       
=========================================================
********************************************
know
Word            |Count           |Frequency       
=========================================================
Al              |1               |14.286%         
America         |1               |14.286%         
Barbara         |1               |14.286%         
Donald          |1               |14.286%         
Gwen            |1               |14.286%         
Missouri        |1               |14.286%         
Sarah           |1               |14.286%         
********************************************
want
Word            |Count           |Frequency       
=========================================================
America         |2               |40.000%         
Joe             |1               |20.000%         
Roe             |1               |20.000%         
U.S.            |1               |20.000%         
********************************************
go
Word            |Count           |Frequency       
=========================================================
********************************************
take
Word            |Count           |Frequency       
=========================================================
America         |3               |27.273%         
Detroit         |2               |18.182%         
Joe             |2               |18.182%         
California      |1               |9.091%          
Formosa         |1               |9.091%          
Israel          |1               |9.091%          
Russia          |1               |9.091%          
********************************************
think
Word            |Count           |Frequency       
=========================================================
America         |6               |50.000%         
Ross            |2               |16.667%         
Al Gore         |1               |8.333%          
George W. Bush  |1               |8.333%          
Iran            |1               |8.333%          
John            |1               |8.333%          
********************************************
unit
Word            |Count           |Frequency       
=========================================================
********************************************
believ
Word            |Count           |Frequency       
=========================================================
America         |5               |41.667%         
John            |2               |16.667%         
Bill Clinton    |1               |8.333%          
Castro          |1               |8.333%          
China           |1               |8.333%          
Roe             |1               |8.333%          
Washington      |1               |8.333%          
********************************************