Named Entity Tagger

main task is to detect whether the text contain special entity, such as following 7 classes:

  • location
  • organization
  • date
  • money
  • person
  • percent
  • time

Online demo: http://nlp.stanford.edu:8080/ner/process

Reference


In [2]:
import nltk 
# with open('sample.txt', 'r') as f:
#     sample = f.read()

sample = "in my own language.\
As a video uploader, this means you can reach\
to people all over the world,\
irrespective of language.\
[Hiroto, Bedhead]\
You can upload multiple tracks like English and French,\
and viewers can choose the track they like.\
[Toliver, Japanese Learner]\
For example, if you enjoy using YouTube in French,"

sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

entity_names = []
for tree in chunked_sentences:
    # Print results per sentence
    # print extract_entity_names(tree)

    entity_names.extend(extract_entity_names(tree))

# Print all entity names
#print entity_names

# Print unique entity names
print set(entity_names)


set(['Bedhead', 'YouTube', 'French', 'English', 'Hiroto', 'Japanese Learner'])

Stanford NER

Install java 8, download Stanford NER, and set the environment variable first.

NLTK just provide an interface of Stanford NER


In [28]:
from nltk.tag import StanfordNERTagger

st = StanfordNERTagger('/vagrant/stanford-ner-2015-12-09/classifiers/english.muc.7class.distsim.crf.ser.gz', '/vagrant/stanford-ner-2015-12-09/stanford-ner.jar') 
st.tag('Rami Eid is studying at Stony Brook University in NY'.split())


Out[28]:
[(u'Rami', u'O'),
 (u'Eid', u'O'),
 (u'is', u'O'),
 (u'studying', u'O'),
 (u'at', u'O'),
 (u'Stony', u'ORGANIZATION'),
 (u'Brook', u'ORGANIZATION'),
 (u'University', u'ORGANIZATION'),
 (u'in', u'O'),
 (u'NY', u'ORGANIZATION')]

In [27]:
sample = "in my own language. \
As a video uploader, this means you can reach\
to people all over the world,\
irrespective of language. \
[Hiroto, Bedhead]\
You can upload multiple tracks like English and French,\
and viewers can choose the track they like. \
[Toliver, Japanese Learner]\
For example, if you enjoy using YouTube in French, 1990, July"

import string
from nltk.tokenize import word_tokenize
st.tag(word_tokenize(sample))


Out[27]:
[(u'in', u'O'),
 (u'my', u'O'),
 (u'own', u'O'),
 (u'language', u'O'),
 (u'.', u'O'),
 (u'As', u'O'),
 (u'a', u'O'),
 (u'video', u'O'),
 (u'uploader', u'O'),
 (u',', u'O'),
 (u'this', u'O'),
 (u'means', u'O'),
 (u'you', u'O'),
 (u'can', u'O'),
 (u'reachto', u'O'),
 (u'people', u'O'),
 (u'all', u'O'),
 (u'over', u'O'),
 (u'the', u'O'),
 (u'world', u'O'),
 (u',', u'O'),
 (u'irrespective', u'O'),
 (u'of', u'O'),
 (u'language', u'O'),
 (u'.', u'O'),
 (u'[', u'O'),
 (u'Hiroto', u'PERSON'),
 (u',', u'O'),
 (u'Bedhead', u'O'),
 (u']', u'O'),
 (u'You', u'O'),
 (u'can', u'O'),
 (u'upload', u'O'),
 (u'multiple', u'O'),
 (u'tracks', u'O'),
 (u'like', u'O'),
 (u'English', u'O'),
 (u'and', u'O'),
 (u'French', u'O'),
 (u',', u'O'),
 (u'and', u'O'),
 (u'viewers', u'O'),
 (u'can', u'O'),
 (u'choose', u'O'),
 (u'the', u'O'),
 (u'track', u'O'),
 (u'they', u'O'),
 (u'like', u'O'),
 (u'.', u'O'),
 (u'[', u'O'),
 (u'Toliver', u'PERSON'),
 (u',', u'O'),
 (u'Japanese', u'O'),
 (u'Learner', u'O'),
 (u']', u'O'),
 (u'For', u'O'),
 (u'example', u'O'),
 (u',', u'O'),
 (u'if', u'O'),
 (u'you', u'O'),
 (u'enjoy', u'O'),
 (u'using', u'O'),
 (u'YouTube', u'O'),
 (u'in', u'O'),
 (u'French', u'O'),
 (u',', u'O'),
 (u'1990', u'DATE'),
 (u',', u'O'),
 (u'July', u'DATE')]