Text mining

In this task we will use nltk package to recognize named entities and classify in a given text (in this case article about American Revolution from Wikipedia).

nltk.ne_chunk function can be used for both recognition and classification of named entities. We will aslo implement custom NER function to recognize entities, and custom function to classify named entities using their Wikipedia articles.


In [2]:
import nltk
import numpy as np
import wikipedia
import re

Suppress wikipedia package warnings.


In [3]:
import warnings
warnings.filterwarnings('ignore')

Helper functions to process output of nltk.ne_chunk and to count frequency of named entities in a given text.


In [4]:
def count_entites(entity, text):
    s = entity
    
    if type(entity) is tuple:
        s = entity[0]
    
    return len(re.findall(s, text))

def get_top_n(entities, text, n):
    a = [ (e, count_entites(e, text)) for e in entities]
    a.sort(key=lambda x: x[1], reverse=True)
    return a[0:n]

# For a list of entities found by nltk.ne_chunks:
# returns (entity, label) if it is a single word or
# concatenates multiple word named entities into single string
def get_entity(entity):
    if isinstance(entity, tuple) and entity[1][:2] == 'NE':
        return entity
    if isinstance(entity, nltk.tree.Tree):
        text = ' '.join([word for word, tag in entity.leaves()])
        return (text, entity.label())
    return None

Since nltk.ne_chunks tends to put same named entities into more classes (like 'American' : 'ORGANIZATION' and 'American' : 'GPE'), we would want to filter these duplicities.


In [5]:
# returns list of named entities in a form [(entity_text, entity_label), ...]
def extract_entities(chunk):
    data = []

    for entity in chunk:
        d = get_entity(entity)
        if d is not None and d[0] not in [e[0] for e in data]:
            data.append(d)

    return data

Our custom NER functio from example here.


In [13]:
def custom_NER(tagged):
    entities = []
    
    entity = []
    for word in tagged:
        if word[1][:2] == 'NN' or (entity and word[1][:2] == 'IN'):
            entity.append(word)
        else:
            if entity and entity[-1][1].startswith('IN'):
                entity.pop()
            if entity:
                s = ' '.join(e[0] for e in entity)
                if s not in entities and s[0].isupper() and len(s) > 1:
                    entities.append(s)
            entity = []
    return entities

Loading processed article, approximately 500 sentences. Regex substitution removes reference links (e.g. [12])


In [14]:
text = None
with open('text', 'r') as f:
    text = f.read()
    
text = re.sub(r'\[[0-9]*\]', '', text)

Now we try to recognize entities with both nltk.ne_chunk and our custom_NER function and print 10 most frequent entities.

Yielded results seem to be fairly similar. nltk.ne_chunk function also added basic classification tags.


In [15]:
tokens = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokens)

ne_chunked = nltk.ne_chunk(tagged, binary=False)
ex = extract_entities(ne_chunked)
ex_custom = custom_NER(tagged)

top_ex = get_top_n(ex, text, 20)
top_ex_custom = get_top_n(ex_custom, text, 20)
print('ne_chunked:')
for e in top_ex:
    print('{} count: {}'.format(e[0], e[1]))
print()
print('custom NER:')
for e in top_ex_custom:
    print('{} count: {}'.format(e[0], e[1]))


ne_chunked:
('British', 'GPE') count: 154
('America', 'GPE') count: 145
('American', 'GPE') count: 130
('New', 'ORGANIZATION') count: 51
('Loyalist', 'GPE') count: 46
('Americans', 'GPE') count: 44
('Britain', 'GPE') count: 40
('Patriot', 'GPE') count: 38
('Revolution', 'ORGANIZATION') count: 38
('Loyalists', 'ORGANIZATION') count: 37
('Congress', 'ORGANIZATION') count: 35
('Patriots', 'GPE') count: 29
('Boston', 'GPE') count: 29
('New York', 'GPE') count: 29
('American Revolution', 'ORGANIZATION') count: 24
('Parliament', 'ORGANIZATION') count: 24
('United States', 'GPE') count: 20
('French', 'GPE') count: 19
('Washington', 'GPE') count: 19
('Continental', 'ORGANIZATION') count: 18

custom NER:
British count: 154
America count: 145
Loyalist count: 46
Americans count: 44
Britain count: 40
Revolution count: 38
Patriot count: 38
Loyalists count: 37
Congress count: 35
Patriots count: 29
Boston count: 29
New York count: 29
Parliament count: 24
American Revolution count: 24
United States count: 20
Washington count: 19
French count: 19
War count: 18
Act count: 18
King count: 17

Next we would want to do our own classification, using Wikipedia articles for each named entity. Idea is to find article matching entity string (for example 'America') and then create a noun phrase from its first sentence. When no suitable article or description is found, entity classification will be 'Thing'.


In [82]:
def get_noun_phrase(entity, sentence):
    t = nltk.pos_tag([word for word in nltk.word_tokenize(sentence)])
    phrase = []
    stage = 0
    for word in t:
        if word[0] in ('is', 'was', 'were', 'are', 'refers') and stage == 0:
            stage = 1
            continue
        elif stage == 1:
            if word[1] in ('NN', 'JJ', 'VBD', 'CD', 'NNP', 'NNPS', 'RBS', 'IN', 'NNS'):
                phrase.append(word)
            elif word[1] in ('DT', ',', 'CC', 'TO', 'POS'):
                continue
            else:
                break
                
    if len(phrase) > 1 and phrase[-1][1] == 'IN':
        phrase.pop()
        
    phrase = ' '.join([ word[0] for word in phrase ])
    
    if phrase == '':
        phrase = 'Thing'
        
    return {entity : phrase}

def get_wiki_desc(entity, wiki='en'):
    wikipedia.set_lang(wiki)
    
    try:
        fs = wikipedia.summary(entity, sentences=1)
    except wikipedia.DisambiguationError as e:
        fs = wikipedia.summary(e.options[0], sentences=1)
    except wikipedia.PageError:
        return {entity : 'Thing'}
    
    #fs = nltk.sent_tokenize(page.summary)[0]
    return get_noun_phrase(entity, fs)

Obivously this classification is way more specific than tags used by nltk.ne_chunk. We can also see that both NER methods mistook common words for entities unrelated to the article (for example 'New').

Since custom_NER function relies on uppercase letters to recognize entities, this can be commonly caused by first words in sentences.

The lack of description for entity 'America' is caused by simple way get_noun_phrase function constructs description. It looks for basic words like 'is', so more advanced language can throw it off. This could be fixed by searching simple english Wikipedia or using it as a fallback when no suitable phrase is found on normal english Wikipedia (for example compare article about Americas on simple and normal wiki).

I also tried to search for more general verb (presen tense verb, tag 'VBZ'), but this yielded worse results. Other improvement could be simply expanding the verb list in get_noun_phrase with other suitable verbs.

When no exact match for pair (entity, article) is found, wikipedia module raises DisambiguationError, which (same as disambiguation page on Wikipedia) offers possible matching pages. When this happens, first suggested page is picked. This however does not have to be the best page for given entity.


In [77]:
for entity in top_ex:
    print(get_wiki_desc(entity[0][0]))


{'British': 'sovereign country in western Europe'}
{'America': 'Thing'}
{'American': 'constitutional federal republic'}
{'New': 'South Korean single-place paraglider'}
{'Loyalist': 'individual allegiance toward established government political party sovereign'}
{'Americans': 'citizens of United States of America'}
{'Britain': 'sovereign country in western Europe'}
{'Patriot': 'conservative talk radio channel on Sirius Satellite Radio channel 125 XM Satellite Radio channel 125 [ 1 ]'}
{'Revolution': 'fundamental change in political power organizational structures'}
{'Loyalists': 'individual allegiance toward established government political party sovereign'}
{'Congress': 'formal meeting of representatives of different nations constituent states independent organizations'}
{'Patriots': '1994 American film'}
{'Boston': 'capital most populous city of Commonwealth of Massachusetts in United States'}
{'New York': 'state in northeastern United States'}
{'American Revolution': 'political upheaval'}
{'Parliament': 'legislative elected body of government'}
{'United States': 'constitutional federal republic'}
{'French': 'country with territory in western Europe several overseas regions territories'}
{'Washington': 'American politician soldier'}
{'Continental': 'one of several'}

In [73]:
for entity in top_ex_custom:
    print(get_wiki_desc(entity[0]))


{'British': 'sovereign country in western Europe'}
{'America': 'Thing'}
{'Loyalist': 'individual allegiance toward established government political party sovereign'}
{'Americans': 'citizens of United States of America'}
{'Britain': 'sovereign country in western Europe'}
{'Revolution': 'fundamental change in political power organizational structures'}
{'Patriot': 'conservative talk radio channel on Sirius Satellite Radio channel 125 XM Satellite Radio channel 125 [ 1 ]'}
{'Loyalists': 'individual allegiance toward established government political party sovereign'}
{'Congress': 'formal meeting of representatives of different nations constituent states independent organizations'}
{'Patriots': '1994 American film'}
{'Boston': 'capital most populous city of Commonwealth of Massachusetts in United States'}
{'New York': 'state in northeastern United States'}
{'Parliament': 'legislative elected body of government'}
{'American Revolution': 'political upheaval'}
{'United States': 'constitutional federal republic'}
{'Washington': 'American politician soldier'}
{'French': 'country with territory in western Europe several overseas regions territories'}
{'War': 'state of armed conflict between societies'}
{'Act': 'activity'}
{'King': 'electoral district in Australian state of New South Wales'}

When searching simple wiki, entity 'Americas' gets fairly reasonable description. However there seems to be an issue with handling DisambiguationError in some cases when looking for first page in DisambiguationError.options raises another DisambiguationError (even if pages from .options should be guaranteed hit).


In [83]:
get_wiki_desc('Americas', wiki='simple')


Out[83]:
{'Americas': 'landmass'}

In [ ]: