Named Entity Recognition

Author: Christin Seifert, licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/

This is a tutorial for NER (named entity recognition). In this tutorial you will see

  • how to apply a pre-trained named entity recognition model to your text

It is assumed that you have some general knowledge on

  • .. no particular knowledge required. You should be able to read texts, though ;-)

Prerequisites. We first need to install the Stanford NER tagger from here. And java also has to be installed. You have to figure out

  • where the jar file stanford-ner.jar is located
  • where the pretrained models (e.g. english.all.3class.distsim.crf.ser.gz) is located, this is the subdirectory classifiers
  • whether the right version of java is installed. On a command line type java -version to see the version. Refer back to the documentation on the stanford nlp page to see which version is needed.

You can also test the NER tagger online here.


In [7]:
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize

# Adapt those lines to your installation
jar_location = '/Users/sech/stanford-ner-2018-10-16/stanford-ner.jar'
model_location_3classes = '/Users/sech/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz'
model_location_7classes = '/Users/sech/stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz'
st3 = StanfordNERTagger(model_location_3classes,jar_location,encoding='utf-8')
st7 = StanfordNERTagger(model_location_7classes,jar_location,encoding='utf-8')

print(st3)
print(st7)


<nltk.tag.stanford.StanfordNERTagger object at 0x1a1d98af60>
<nltk.tag.stanford.StanfordNERTagger object at 0x1a1d98af28>

Let's take a paragraph from the Wikipedia page of Ada Lovelace as an example. We need to put the text in triple quotes since the text itself contains quoting characters.


In [10]:
text = '''Lovelace became close friends with her tutor Mary Somerville, who introduced her to Charles Babbage in 1833. She had a strong respect and affection for Somerville, and they corresponded for many years. Other acquaintances included the scientists Andrew Crosse, Sir David Brewster, Charles Wheatstone, Michael Faraday and the author Charles Dickens. She was presented at Court at the age of seventeen "and became a popular belle of the season" in part because of her "brilliant mind." By 1834 Ada was a regular at Court and started attending various events. She danced often and was able to charm many people, and was described by most people as being dainty, although John Hobhouse, Byron's friend, described her as "a large, coarse-skinned young woman but with something of my friend's features, particularly the mouth". This description followed their meeting on 24 February 1834 in which Ada made it clear to Hobhouse that she did not like him, probably because of the influence of her mother, which led her to dislike all of her father's friends. This first impression was not to last, and they later became friends.'''
print(text)


Lovelace became close friends with her tutor Mary Somerville, who introduced her to Charles Babbage in 1833. She had a strong respect and affection for Somerville, and they corresponded for many years. Other acquaintances included the scientists Andrew Crosse, Sir David Brewster, Charles Wheatstone, Michael Faraday and the author Charles Dickens. She was presented at Court at the age of seventeen "and became a popular belle of the season" in part because of her "brilliant mind." By 1834 Ada was a regular at Court and started attending various events. She danced often and was able to charm many people, and was described by most people as being dainty, although John Hobhouse, Byron's friend, described her as "a large, coarse-skinned young woman but with something of my friend's features, particularly the mouth". This description followed their meeting on 24 February 1834 in which Ada made it clear to Hobhouse that she did not like him, probably because of the influence of her mother, which led her to dislike all of her father's friends. This first impression was not to last, and they later became friends.

First we need to tokenize the text and then we apply the NER tagger. Let's try both, the 3 class version and the 7 class version.


In [11]:
tokenized_text = word_tokenize(text)
text_ner3 = st3.tag(tokenized_text)
text_ner7 = st7.tag(tokenized_text)

print(text_ner3)
print(text_ner7)


[('Lovelace', 'PERSON'), ('became', 'O'), ('close', 'O'), ('friends', 'O'), ('with', 'O'), ('her', 'O'), ('tutor', 'O'), ('Mary', 'PERSON'), ('Somerville', 'PERSON'), (',', 'O'), ('who', 'O'), ('introduced', 'O'), ('her', 'O'), ('to', 'O'), ('Charles', 'PERSON'), ('Babbage', 'PERSON'), ('in', 'O'), ('1833', 'O'), ('.', 'O'), ('She', 'O'), ('had', 'O'), ('a', 'O'), ('strong', 'O'), ('respect', 'O'), ('and', 'O'), ('affection', 'O'), ('for', 'O'), ('Somerville', 'LOCATION'), (',', 'O'), ('and', 'O'), ('they', 'O'), ('corresponded', 'O'), ('for', 'O'), ('many', 'O'), ('years', 'O'), ('.', 'O'), ('Other', 'O'), ('acquaintances', 'O'), ('included', 'O'), ('the', 'O'), ('scientists', 'O'), ('Andrew', 'PERSON'), ('Crosse', 'PERSON'), (',', 'O'), ('Sir', 'O'), ('David', 'PERSON'), ('Brewster', 'PERSON'), (',', 'O'), ('Charles', 'PERSON'), ('Wheatstone', 'PERSON'), (',', 'O'), ('Michael', 'PERSON'), ('Faraday', 'PERSON'), ('and', 'O'), ('the', 'O'), ('author', 'O'), ('Charles', 'PERSON'), ('Dickens', 'PERSON'), ('.', 'O'), ('She', 'O'), ('was', 'O'), ('presented', 'O'), ('at', 'O'), ('Court', 'O'), ('at', 'O'), ('the', 'O'), ('age', 'O'), ('of', 'O'), ('seventeen', 'O'), ('``', 'O'), ('and', 'O'), ('became', 'O'), ('a', 'O'), ('popular', 'O'), ('belle', 'O'), ('of', 'O'), ('the', 'O'), ('season', 'O'), ("''", 'O'), ('in', 'O'), ('part', 'O'), ('because', 'O'), ('of', 'O'), ('her', 'O'), ('``', 'O'), ('brilliant', 'O'), ('mind', 'O'), ('.', 'O'), ("''", 'O'), ('By', 'O'), ('1834', 'O'), ('Ada', 'PERSON'), ('was', 'O'), ('a', 'O'), ('regular', 'O'), ('at', 'O'), ('Court', 'O'), ('and', 'O'), ('started', 'O'), ('attending', 'O'), ('various', 'O'), ('events', 'O'), ('.', 'O'), ('She', 'O'), ('danced', 'O'), ('often', 'O'), ('and', 'O'), ('was', 'O'), ('able', 'O'), ('to', 'O'), ('charm', 'O'), ('many', 'O'), ('people', 'O'), (',', 'O'), ('and', 'O'), ('was', 'O'), ('described', 'O'), ('by', 'O'), ('most', 'O'), ('people', 'O'), ('as', 'O'), ('being', 'O'), ('dainty', 'O'), (',', 'O'), ('although', 'O'), ('John', 'PERSON'), ('Hobhouse', 'PERSON'), (',', 'O'), ('Byron', 'PERSON'), ("'s", 'O'), ('friend', 'O'), (',', 'O'), ('described', 'O'), ('her', 'O'), ('as', 'O'), ('``', 'O'), ('a', 'O'), ('large', 'O'), (',', 'O'), ('coarse-skinned', 'O'), ('young', 'O'), ('woman', 'O'), ('but', 'O'), ('with', 'O'), ('something', 'O'), ('of', 'O'), ('my', 'O'), ('friend', 'O'), ("'s", 'O'), ('features', 'O'), (',', 'O'), ('particularly', 'O'), ('the', 'O'), ('mouth', 'O'), ("''", 'O'), ('.', 'O'), ('This', 'O'), ('description', 'O'), ('followed', 'O'), ('their', 'O'), ('meeting', 'O'), ('on', 'O'), ('24', 'O'), ('February', 'O'), ('1834', 'O'), ('in', 'O'), ('which', 'O'), ('Ada', 'PERSON'), ('made', 'O'), ('it', 'O'), ('clear', 'O'), ('to', 'O'), ('Hobhouse', 'PERSON'), ('that', 'O'), ('she', 'O'), ('did', 'O'), ('not', 'O'), ('like', 'O'), ('him', 'O'), (',', 'O'), ('probably', 'O'), ('because', 'O'), ('of', 'O'), ('the', 'O'), ('influence', 'O'), ('of', 'O'), ('her', 'O'), ('mother', 'O'), (',', 'O'), ('which', 'O'), ('led', 'O'), ('her', 'O'), ('to', 'O'), ('dislike', 'O'), ('all', 'O'), ('of', 'O'), ('her', 'O'), ('father', 'O'), ("'s", 'O'), ('friends', 'O'), ('.', 'O'), ('This', 'O'), ('first', 'O'), ('impression', 'O'), ('was', 'O'), ('not', 'O'), ('to', 'O'), ('last', 'O'), (',', 'O'), ('and', 'O'), ('they', 'O'), ('later', 'O'), ('became', 'O'), ('friends', 'O'), ('.', 'O')]
[('Lovelace', 'O'), ('became', 'O'), ('close', 'O'), ('friends', 'O'), ('with', 'O'), ('her', 'O'), ('tutor', 'O'), ('Mary', 'PERSON'), ('Somerville', 'PERSON'), (',', 'O'), ('who', 'O'), ('introduced', 'O'), ('her', 'O'), ('to', 'O'), ('Charles', 'PERSON'), ('Babbage', 'PERSON'), ('in', 'O'), ('1833', 'DATE'), ('.', 'O'), ('She', 'O'), ('had', 'O'), ('a', 'O'), ('strong', 'O'), ('respect', 'O'), ('and', 'O'), ('affection', 'O'), ('for', 'O'), ('Somerville', 'LOCATION'), (',', 'O'), ('and', 'O'), ('they', 'O'), ('corresponded', 'O'), ('for', 'O'), ('many', 'O'), ('years', 'O'), ('.', 'O'), ('Other', 'O'), ('acquaintances', 'O'), ('included', 'O'), ('the', 'O'), ('scientists', 'O'), ('Andrew', 'PERSON'), ('Crosse', 'PERSON'), (',', 'O'), ('Sir', 'O'), ('David', 'PERSON'), ('Brewster', 'PERSON'), (',', 'O'), ('Charles', 'PERSON'), ('Wheatstone', 'PERSON'), (',', 'O'), ('Michael', 'PERSON'), ('Faraday', 'PERSON'), ('and', 'O'), ('the', 'O'), ('author', 'O'), ('Charles', 'PERSON'), ('Dickens', 'PERSON'), ('.', 'O'), ('She', 'O'), ('was', 'O'), ('presented', 'O'), ('at', 'O'), ('Court', 'O'), ('at', 'O'), ('the', 'O'), ('age', 'O'), ('of', 'O'), ('seventeen', 'O'), ('``', 'O'), ('and', 'O'), ('became', 'O'), ('a', 'O'), ('popular', 'O'), ('belle', 'O'), ('of', 'O'), ('the', 'O'), ('season', 'O'), ("''", 'O'), ('in', 'O'), ('part', 'O'), ('because', 'O'), ('of', 'O'), ('her', 'O'), ('``', 'O'), ('brilliant', 'O'), ('mind', 'O'), ('.', 'O'), ("''", 'O'), ('By', 'O'), ('1834', 'DATE'), ('Ada', 'ORGANIZATION'), ('was', 'O'), ('a', 'O'), ('regular', 'O'), ('at', 'O'), ('Court', 'O'), ('and', 'O'), ('started', 'O'), ('attending', 'O'), ('various', 'O'), ('events', 'O'), ('.', 'O'), ('She', 'O'), ('danced', 'O'), ('often', 'O'), ('and', 'O'), ('was', 'O'), ('able', 'O'), ('to', 'O'), ('charm', 'O'), ('many', 'O'), ('people', 'O'), (',', 'O'), ('and', 'O'), ('was', 'O'), ('described', 'O'), ('by', 'O'), ('most', 'O'), ('people', 'O'), ('as', 'O'), ('being', 'O'), ('dainty', 'O'), (',', 'O'), ('although', 'O'), ('John', 'PERSON'), ('Hobhouse', 'PERSON'), (',', 'O'), ('Byron', 'PERSON'), ("'s", 'O'), ('friend', 'O'), (',', 'O'), ('described', 'O'), ('her', 'O'), ('as', 'O'), ('``', 'O'), ('a', 'O'), ('large', 'O'), (',', 'O'), ('coarse-skinned', 'O'), ('young', 'O'), ('woman', 'O'), ('but', 'O'), ('with', 'O'), ('something', 'O'), ('of', 'O'), ('my', 'O'), ('friend', 'O'), ("'s", 'O'), ('features', 'O'), (',', 'O'), ('particularly', 'O'), ('the', 'O'), ('mouth', 'O'), ("''", 'O'), ('.', 'O'), ('This', 'O'), ('description', 'O'), ('followed', 'O'), ('their', 'O'), ('meeting', 'O'), ('on', 'O'), ('24', 'O'), ('February', 'DATE'), ('1834', 'DATE'), ('in', 'O'), ('which', 'O'), ('Ada', 'ORGANIZATION'), ('made', 'O'), ('it', 'O'), ('clear', 'O'), ('to', 'O'), ('Hobhouse', 'PERSON'), ('that', 'O'), ('she', 'O'), ('did', 'O'), ('not', 'O'), ('like', 'O'), ('him', 'O'), (',', 'O'), ('probably', 'O'), ('because', 'O'), ('of', 'O'), ('the', 'O'), ('influence', 'O'), ('of', 'O'), ('her', 'O'), ('mother', 'O'), (',', 'O'), ('which', 'O'), ('led', 'O'), ('her', 'O'), ('to', 'O'), ('dislike', 'O'), ('all', 'O'), ('of', 'O'), ('her', 'O'), ('father', 'O'), ("'s", 'O'), ('friends', 'O'), ('.', 'O'), ('This', 'O'), ('first', 'O'), ('impression', 'O'), ('was', 'O'), ('not', 'O'), ('to', 'O'), ('last', 'O'), (',', 'O'), ('and', 'O'), ('they', 'O'), ('later', 'O'), ('became', 'O'), ('friends', 'O'), ('.', 'O')]

We see that each word is tagged. Tags are for instance ORGANIZATION or PERSON. Very prominently, the O tag appears often. This is the other class (everything that is not an organisation or person, etc.). But it is still an aweful lot of text. Let's just have a look at the non-other entities detected. We do this assuming that adjacent words having the same tag should be collapsed into one named entity.


In [13]:
from itertools import groupby

print("**** 3 classes ****")
for tag, chunk in groupby(text_ner3, lambda x:x[1]):
    if tag != "O":
        print("%-12s"%tag, " ".join(w for w, t in chunk))
        
print("**** 7 classes ****")
for tag, chunk in groupby(text_ner7, lambda x:x[1]):
    if tag != "O":
        print("%-12s"%tag, " ".join(w for w, t in chunk))


**** 3 classes ****
PERSON       Lovelace
PERSON       Mary Somerville
PERSON       Charles Babbage
LOCATION     Somerville
PERSON       Andrew Crosse
PERSON       David Brewster
PERSON       Charles Wheatstone
PERSON       Michael Faraday
PERSON       Charles Dickens
PERSON       Ada
PERSON       John Hobhouse
PERSON       Byron
PERSON       Ada
PERSON       Hobhouse
**** 7 classes ****
PERSON       Mary Somerville
PERSON       Charles Babbage
DATE         1833
LOCATION     Somerville
PERSON       Andrew Crosse
PERSON       David Brewster
PERSON       Charles Wheatstone
PERSON       Michael Faraday
PERSON       Charles Dickens
DATE         1834
ORGANIZATION Ada
PERSON       John Hobhouse
PERSON       Byron
DATE         February 1834
ORGANIZATION Ada
PERSON       Hobhouse

We see that while this is pretty impressive, it still makes errors. For example, one occurrence of Ada is tagged a ORGANISATION. You should take the non-perfect nature into account if you use those tags further in your nlp pipeline.

That's all.


In [ ]: