Keyphrase Identification

This notebook will try several methods for identifying keyphrases in text.

Overview

Algorithms

The three algorithms I am presenting are:

1. Most common nouns - A simple look at noun unigrams to find the most common. 
2. Most used concepts - Uses WordNet to identify common concepts from the most common nouns. 
3. Proper noun extraction - Extracts proper nouns from text. 

There is more description of these algorithms below.

My mystery.txt file comes from the most used concepts algorithm.

I chose these algorithms because they show some different strategies and provide some useful output. I really liked the output of the most used concepts algorithm for both my corpus and the mystery text, though I also found some weaknesses (discussed a bit in my guess on the mystery text below).

I had a tough time interpreting results for the brown news corpus. I'm hoping it's because news is very broad, so there are no clear categories to be found but I look forward to seeing if other students had algorithms that were more fruitful there.

The brown news corpus did stimy me a bit. I think I pulled out important concepts but it's very scattered, so I'm not sure. I'm hoping it's because news is very broad, so there are no clear categories to be found but I look forward to seeing if other students had algorithms that were more fruitful there.

Other Experiments

I tried several other things including

1. Collocations - I tried both PMI and chi squared measures. While some interesting entities were extracted, they did not provide much insight onto the text. 
2. Common verbs and their subjects - I tried finding the most common verbs and then used chunking to find the most common subjects of the most common verbs. This produced very confusing results. I think this was a case of a more complex algorithm not working very well. 
3. General noun phrase extraction - Using chunking to find common noun phrases. This was useful but surprisingly not much better than the simple unigram noun extraction. I decided to include the unigram method because it is foundational for the concept extraction method, which was my favorite. 

Guess on Mystery Text

I thought it would be fun to record my guess of the topic of the mystery text based on all of my algorithms:

I'm certain it has to do with international commerce. I am going to guess that it's further narrowed to agriculture.

After taking a quick look at mystery.txt: Not bad! It's not completely focused on agriculture but there are a lot of articles about it. I want to figure out why other issues discussed in the corpus, such as fuel and labor, were missed.

I just checked in wordnet and all of the synsets from fuel have a max depth of 3 or 4. I filtered concepts to have a depth of at least 7, to avoid having very broad concepts like organism jump in. A more intelligent way of doing this is something I'll be thinking about.

Setup

First, let's import modules, load text, and do the required tokenizing and tagging.


In [1]:
import re
from os import path

import nltk
from nltk.corpus import brown
from nltk.corpus import wordnet as wn

import corpii

In [2]:
from urllib import urlopen

You can uncomment some lines below to use a different text source.


In [3]:
text = nltk.clean_html(corpii.load_pres_debates().raw())

# code to get text from alternate source
#text = urlopen("www.url.com").read()

Now let's tokenize, split into sentences, and tag the text.


In [4]:
# tokenize
token_regex= """(?x)
    # taken from ntlk book example
    ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
    | \w+(-\w+)*        # words with optional internal hyphens
    | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
    | \.\.\.            # ellipsis
    | [][.,;"'?():-_`]  # these are separate tokens
"""

tokens = nltk.regexp_tokenize(text, token_regex)

In [5]:
# we're going to use frequency distributions a lot, so let's create a nice way of looking at those
DISPLAY_LIM = 25

def fd_view(fd, n=DISPLAY_LIM):
    """Prints a nice format of items in FreqDist fd[0:n]"""
    print "{:<16}|{:<16}|{:<16}".format("Word", "Count", "Frequency")
    print "========================================================="
    for i in fd.items()[0:n]:
        print "{:<16}|{:<16,d}|{:<16.3%}".format(i[0], i[1], fd.freq(i[0]))

In [6]:
# get sentences
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

sents = list(sent_tokenizer.sentences_from_tokens(tokens))

In [7]:
#Create tagger

def build_backoff_tagger (train_sents):
    t0 = nltk.DefaultTagger('NN')
    t1 = nltk.UnigramTagger(train_sents, backoff=t0)
    t2 = nltk.BigramTagger(train_sents, backoff=t1)
    return t2

tagger = build_backoff_tagger(brown.tagged_sents())

In [8]:
debate_sents = [tagger.tag(s) for s in sents]
debate_tags = [t for s in debate_sents for t in s]

In [9]:
brown_sents = brown.tagged_sents(categories=['news'])
brown_tags = brown.tagged_words(categories=['news'])

In [10]:
myst_file = open(path.join("text", "mystery.txt"), "r")
myst_text = myst_file.read()
myst_file.close()

In [11]:
myst_tokens = nltk.word_tokenize(myst_text)
myst_sents = [nltk.tag.pos_tag(s) for s in sent_tokenizer.sentences_from_tokens(myst_tokens)]
myst_tags = [t for s in myst_sents for t in s]

1 - Simple unigram approach

To start let's try just finding the most common nouns in the corpus and see how that does.


In [12]:
def common_nouns(tags, min_length=4, pos=r"N.*"):
    """Takes a tagset and returns a frequency distribution of the words
    that are at least min_length and whose tag matches pos"""
    fd_nouns = nltk.FreqDist([ t[0].lower() for t in tags if len(t[0]) >= min_length and re.match(pos, t[1])])
    return fd_nouns

In [13]:
debate_fd_nouns = common_nouns(debate_tags)
fd_view(debate_fd_nouns)


Word            |Count           |Frequency       
=========================================================
president       |3,052           |2.409%          
people          |2,601           |2.053%          
senator         |1,265           |0.998%          
years           |1,202           |0.949%          
country         |1,173           |0.926%          
question        |1,153           |0.910%          
time            |1,046           |0.826%          
governor        |1,028           |0.811%          
america         |1,025           |0.809%          
bush            |937             |0.740%          
states          |886             |0.699%          
government      |856             |0.676%          
world           |849             |0.670%          
administration  |659             |0.520%          
plan            |651             |0.514%          
obama           |648             |0.511%          
jobs            |618             |0.488%          
year            |600             |0.474%          
security        |595             |0.470%          
money           |576             |0.455%          
things          |573             |0.452%          
percent         |571             |0.451%          
health          |552             |0.436%          
care            |532             |0.420%          
lehrer          |507             |0.400%          

In [14]:
brown_fd_nouns = common_nouns(brown_tags)
fd_view(brown_fd_nouns)


Word            |Count           |Frequency       
=========================================================
mrs.            |253             |0.899%          
state           |151             |0.537%          
president       |142             |0.505%          
year            |142             |0.505%          
home            |132             |0.469%          
time            |103             |0.366%          
years           |102             |0.362%          
house           |96              |0.341%          
week            |94              |0.334%          
city            |93              |0.330%          
school          |87              |0.309%          
committee       |75              |0.266%          
members         |74              |0.263%          
government      |73              |0.259%          
university      |70              |0.249%          
bill            |69              |0.245%          
kennedy         |66              |0.235%          
john            |65              |0.231%          
night           |65              |0.231%          
program         |65              |0.231%          
board           |64              |0.227%          
administration  |62              |0.220%          
county          |61              |0.217%          
states          |60              |0.213%          
meeting         |58              |0.206%          

In [15]:
myst_fd_nouns = common_nouns(myst_tags)
fd_view(myst_fd_nouns)


Word            |Count           |Frequency       
=========================================================
said.           |606             |2.244%          
tonnes          |472             |1.748%          
u.s.            |425             |1.573%          
dlrs            |320             |1.185%          
dollar          |283             |1.048%          
trade           |262             |0.970%          
wheat           |235             |0.870%          
japan           |231             |0.855%          
market          |197             |0.729%          
prices          |196             |0.726%          
year            |191             |0.707%          
coffee          |188             |0.696%          
bank            |181             |0.670%          
week            |159             |0.589%          
export          |142             |0.526%          
exports         |134             |0.496%          
gold            |132             |0.489%          
price           |132             |0.489%          
rice            |129             |0.478%          
stocks          |123             |0.455%          
grain           |115             |0.426%          
production      |114             |0.422%          
april           |111             |0.411%          
government      |111             |0.411%          
department      |109             |0.404%          

Analysis

The simple approach does pretty well. For the debates, many of the top nouns have to do with governance. The Brown news corpus is a bit more scattered, although news is a pretty broad category itself. I would say it does a decent job with mystery text and from these results would guess the mystery text has something to do with trade,

2 - Gleaning topics from unigrams

My next most succesffull experiment expanded on the unigram approach to find common concepts from the top nouns. Basically, I take the most common nouns and look at each of their hypernym paths and to determine how often concepts are referred to in a text. This is filtered to only include hypernyms at a certain depth in WordNet's tree (after some experimenting I settled on a depth of 7).

I also played around with using least common hypernym for this algorithm but I got some weird results, like the lowest hypernym of president and senator being organism when they both had leader in their hypernym paths. Also, the way I was thinking of using that algorithm had O(n)=n^2, which I wasn't excited about.


In [16]:
def get_hypernyms(synsets):
    """
    Takes a list of synsets (as generated by wn.synsets) and returns a list of all hypernyms. 
    """
    hypernyms = set()
    for synset in synsets:
        for path in synset.hypernym_paths():
            hypernyms.update([h for h in path if h != synset])
    return hypernyms

In [17]:
def fd_hypernyms(fd, depth=None, min_depth=7, pos=None):
    """
    Takes a frequency distribution and analyzes the hypernyms of the wordforms contained therein. 
    Returns a weighted 
    fd - frequency distribution
    depth - How far down fd to look
    min_depth - A filter to only include synsets of a certain depth.
                Unintuitively, max_depth is used to calculate the depth of a synset. 
    pos - part of speech to limit sysnsets to
    """
    hypernyms = {}
    for wf in fd.keys()[0:depth]:
        freq = fd.freq(wf)
        hset = get_hypernyms(wn.synsets(wf, pos=pos))
        for h in hset:
            if h.max_depth()>=min_depth:
                if h in hypernyms:
                    hypernyms[h] += freq
                else:
                    hypernyms[h] = freq
    
    hlist = hypernyms.items()
    hlist.sort(key=lambda s: s[1], reverse=True)
    return hlist

In [18]:
def concept_printer(concepts, n=DISPLAY_LIM):
    "Prints first n concepts in concept list generated by fd_hypernyms"
    print "{:<20} | {:<10} | {}".format("Concept", "Concept Freq", "Definition")
    print "===================================================================="
    for s in concepts[0:n]:
        print "{:<20} | {:<12.3%} |  {}".format(s[0].lemma_names[0], s[1], s[0].definition)

In [19]:
debate_concepts = fd_hypernyms(debate_fd_nouns, pos=wn.NOUN, min_depth=7)
concept_printer(debate_concepts)


Concept              | Concept Freq | Definition
====================================================================
leader               | 8.538%       |  a person who rules or guides or inspires others
communicator         | 6.751%       |  a person who communicates with others
negotiator           | 5.820%       |  someone who negotiates (confers with others in order to reach a settlement)
representative       | 5.516%       |  a person who represents others
head_of_state        | 5.472%       |  the chief public representative of a country who may also be the head of government
head                 | 3.204%       |  a person who is in charge
administrator        | 2.954%       |  someone who administers a business
executive            | 2.889%       |  a person responsible for the administration of a business
presiding_officer    | 2.770%       |  the leader of a group meeting
country              | 2.684%       |  the territory occupied by a nation
division             | 2.677%       |  an administrative unit in government or business
worker               | 2.602%       |  a person who works at a specific occupation
position             | 2.527%       |  a job in an organization
department           | 2.452%       |  a specialized division of a large organization
corporate_executive  | 2.451%       |  an executive in a business corporation
change_of_state      | 2.448%       |  the act of changing something into something different in essential characteristics
academic_administrator | 2.447%       |  an administrator in a college or university
presidency           | 2.444%       |  the office and function of president
President_of_the_United_States | 2.425%       |  the person who holds the office of head of state of the United States government
family               | 2.309%       |  people descended from a common ancestor
politician           | 2.293%       |  a leader engaged in civil administration
curiosity            | 2.292%       |  a state in which you want to learn more about something
interest             | 2.053%       |  a sense of concern with and curiosity about someone or something
concern              | 2.008%       |  something that interests you because it is important or affects you
government_department | 1.950%       |  a department of government

In [20]:
brown_concepts = fd_hypernyms(brown_fd_nouns, pos=wn.NOUN, min_depth=7)
concept_printer(brown_concepts)


Concept              | Concept Freq | Definition
====================================================================
communicator         | 5.010%       |  a person who communicates with others
leader               | 4.985%       |  a person who rules or guides or inspires others
worker               | 3.820%       |  a person who works at a specific occupation
skilled_worker       | 3.006%       |  a worker who has acquired special skills
municipality         | 2.935%       |  an urban district having corporate status and powers of self-government
adult                | 2.761%       |  a fully developed person from maturity onward
writer               | 2.512%       |  writes (books or stories or articles or the like) professionally (for pay)
contestant           | 2.405%       |  a person who participates in competitions
creator              | 2.398%       |  a person who grows or makes or invents things
negotiator           | 2.388%       |  someone who negotiates (confers with others in order to reach a settlement)
representative       | 2.331%       |  a person who represents others
entertainer          | 2.302%       |  a person who tries to please or amuse
performer            | 2.260%       |  an entertainer who performs a dramatic or musical work for an audience
city                 | 2.196%       |  a large and densely populated urban area; may include several independent administrative districts
head_of_state        | 2.150%       |  the chief public representative of a country who may also be the head of government
motion               | 2.075%       |  the act of changing location from one place to another
player               | 1.762%       |  a person who participates in or is skilled at some game
change_of_state      | 1.691%       |  the act of changing something into something different in essential characteristics
relative             | 1.670%       |  a person related by blood or marriage
division             | 1.631%       |  an administrative unit in government or business
commerce             | 1.528%       |  transactions (sales and purchases) having the objective of supplying commodities (goods and services)
scientist            | 1.425%       |  a person with advanced knowledge of one or more sciences
athlete              | 1.414%       |  a person trained to compete in sports
department           | 1.400%       |  a specialized division of a large organization
sports_equipment     | 1.322%       |  equipment needed to participate in a particular sport

In [21]:
myst_concepts = fd_hypernyms(myst_fd_nouns, pos=wn.NOUN, min_depth=7)
concept_printer(myst_concepts)


Concept              | Concept Freq | Definition
====================================================================
country              | 5.131%       |  the territory occupied by a nation
commerce             | 5.013%       |  transactions (sales and purchases) having the objective of supplying commodities (goods and services)
vascular_plant       | 4.532%       |  green plant having a vascular system: ferns, gymnosperms, angiosperms
herb                 | 3.047%       |  a plant lacking a permanent woody stem; many are flowering garden plants or potherbs; some having medicinal properties; some are pests
gramineous_plant     | 2.829%       |  cosmopolitan herbaceous or woody plants with hollow jointed stems and long narrow leaves
grass                | 2.810%       |  narrow-leaved green herbage: grown as lawns; used as pasture for grazing animals; cut and dried as hay
cereal               | 2.680%       |  grass whose starchy grains are used as food: wheat; rice; rye; oats; maize; buckwheat; millet
commercial_enterprise | 2.488%       |  the activity of providing goods and services involving financial and commercial and industrial aspects
financial_gain       | 2.477%       |  the amount of monetary gain
income               | 2.455%       |  the financial gain (earned or unearned) accruing over a given period of time
communicator         | 2.440%       |  a person who communicates with others
worker               | 2.077%       |  a person who works at a specific occupation
division             | 1.981%       |  an administrative unit in government or business
entertainer          | 1.947%       |  a person who tries to please or amuse
North_American_country | 1.936%       |  any country on the North American continent
net_income           | 1.910%       |  the excess of revenues over outlays in a given period of time (including depreciation and other non-cash expenses)
reproductive_structure | 1.896%       |  the parts of a plant involved in its reproduction
accumulation         | 1.892%       |  (finance) profits that are not paid out as dividends but are added to the capital base of the corporation
performer            | 1.881%       |  an entertainer who performs a dramatic or musical work for an audience
fruit                | 1.877%       |  the ripened reproductive body of a seed plant
motion               | 1.825%       |  the act of changing location from one place to another
seed                 | 1.777%       |  a small hard fruit
capitalist           | 1.736%       |  a person who invests capital in a business (especially a large business)
industry             | 1.714%       |  the organized action of making of goods and services for sale
federal_government   | 1.710%       |  a government with strong central powers

Analysis

I really like this one, partly because traversing concepts in wordnet just seems cool. What I like about this is how it adds some insight to the simple noun counting from my first method and allows nouns that may not be common individually but are linked in cocept to bubble up. So for the mystery text, I still see commerce as a high ranking concept but there is also a lot about plants and food stuffs. So I'm guessing it has something to do with agricultural concepts.

Finding Proper Nouns

For my corpus (and many others), pulling out noun phrases could be very important. I used chunking for this. This wasn't the most exciting algorithm but it's an important one and produced more useful results than my other experiments.


In [22]:
def get_propnoun_fd(sents):
    """
    Finds proper nouns from tagged sentences and returns a frequency distribution of those nouns.
    """
    grammar = r"""
        NPROP: {<N+P>+|<N+P><IN.*|DT.*><N+P>+}
        # realized that using the pos tagger made propper nouns NNP while in brown they are NP, hence the N+P
    """

    noun_parser = nltk.RegexpParser(grammar)
    
    trees = [t for s in sents for t in noun_parser.parse(s).subtrees() if t.node == "NPROP"]
    fd = nltk.FreqDist([" ".join([w[0] for w in t]) for t in trees])
    return fd

In [23]:
debate_fd_np = get_propnoun_fd(debate_sents)
fd_view(debate_fd_np)


Word            |Count           |Frequency       
=========================================================
America         |847             |8.844%          
Congress        |420             |4.386%          
Bush            |332             |3.467%          
Iraq            |309             |3.226%          
John            |261             |2.725%          
Iran            |204             |2.130%          
Kennedy         |202             |2.109%          
Washington      |199             |2.078%          
Republican      |190             |1.984%          
Carter          |189             |1.973%          
Ford            |172             |1.796%          
Jim             |139             |1.451%          
Clinton         |137             |1.431%          
Israel          |129             |1.347%          
Nixon           |126             |1.316%          
China           |123             |1.284%          
Bob             |121             |1.263%          
George Bush     |111             |1.159%          
Gore            |109             |1.138%          
Bill Clinton    |99              |1.034%          
October         |97              |1.013%          
Texas           |93              |0.971%          
U.S.            |93              |0.971%          
Russia          |87              |0.908%          
Al              |81              |0.846%          

In [24]:
brown_fd_np = get_propnoun_fd(brown_sents)
fd_view(brown_fd_np)


Word            |Count           |Frequency       
=========================================================
Mr.             |51              |1.139%          
U.S.            |44              |0.983%          
Kennedy         |41              |0.916%          
Mantle          |41              |0.916%          
Dallas          |37              |0.827%          
Laos            |36              |0.804%          
Palmer          |36              |0.804%          
Maris           |33              |0.737%          
Congo           |30              |0.670%          
Washington      |29              |0.648%          
Player          |28              |0.626%          
March           |27              |0.603%          
Texas           |25              |0.559%          
May             |24              |0.536%          
Portland        |21              |0.469%          
Congress        |19              |0.424%          
April           |18              |0.402%          
Jr.             |17              |0.380%          
Republican      |17              |0.380%          
Hughes          |16              |0.357%          
Khrushchev      |15              |0.335%          
Moscow          |15              |0.335%          
Houston         |14              |0.313%          
Mr. Kennedy     |14              |0.313%          
San Francisco   |14              |0.313%          

In [25]:
myst_fd_np = get_propnoun_fd(myst_sents)
fd_view(myst_fd_np)


Word            |Count           |Frequency       
=========================================================
said.           |447             |5.761%          
Japan           |180             |2.320%          
U.S.            |176             |2.268%          
month.          |109             |1.405%          
April           |95              |1.224%          
Fed             |75              |0.967%          
March           |72              |0.928%          
February        |71              |0.915%          
January         |55              |0.709%          
Bank            |51              |0.657%          
year.           |47              |0.606%          
May             |45              |0.580%          
added.          |43              |0.554%          
Brazil          |40              |0.516%          
Paris           |40              |0.516%          
Ecus            |39              |0.503%          
New York        |39              |0.503%          
Tokyo           |39              |0.503%          
West Germany    |39              |0.503%          
India           |36              |0.464%          
dlrs.           |34              |0.438%          
June            |33              |0.425%          
pct.            |33              |0.425%          
yen.            |33              |0.425%          
September       |30              |0.387%          

Analysis

Of these collections, this one is most useful for my corpus because pulling out the names of candidates and places they are talking about is very important. It does add some value to the other Corpora. For Brown it lists some important subjects. For the mystery text, it makes me think it is about Japan - U.S. relations over a period of time.

Generating output for mystery.txt

I am using the concept traverser to generate output for this. The file will contain the first lemma for each synset along with its frequency.


In [26]:
out_file = open("mystery.txt", "w")

In [27]:
# realizing it would have been smart to make my print functions return strings instead of just printing

def concept_csv(concepts, n=DISPLAY_LIM):
    "Creates a c separated string for n items in concept list"
    
    out = []
    out.append("{},{}".format("Concept", "Concept_Freq"))
    for s in concepts[0:n]:
        out.append("{},{:.3}".format(s[0].lemma_names[0], s[1]))
    
    return "\n".join(out)

In [28]:
out_file.write(concept_csv(myst_concepts, 100))
out_file.close()