Author: Christin Seifert, licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/
This is a tutorial for NER (named entity recognition). In this tutorial you will see
It is assumed that you have some general knowledge on
Prerequisites. We first need to install the Stanford NER tagger from here. And java also has to be installed. You have to figure out
stanford-ner.jar is locatedenglish.all.3class.distsim.crf.ser.gz) is located, this is the subdirectory classifiersjava -version to see the version. Refer back to the documentation on the stanford nlp page to see which version is needed.You can also test the NER tagger online here.
In [7]:
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
# Adapt those lines to your installation
jar_location = '/Users/sech/stanford-ner-2018-10-16/stanford-ner.jar'
model_location_3classes = '/Users/sech/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz'
model_location_7classes = '/Users/sech/stanford-ner-2018-10-16/classifiers/english.muc.7class.distsim.crf.ser.gz'
st3 = StanfordNERTagger(model_location_3classes,jar_location,encoding='utf-8')
st7 = StanfordNERTagger(model_location_7classes,jar_location,encoding='utf-8')
print(st3)
print(st7)
Let's take a paragraph from the Wikipedia page of Ada Lovelace as an example. We need to put the text in triple quotes since the text itself contains quoting characters.
In [10]:
text = '''Lovelace became close friends with her tutor Mary Somerville, who introduced her to Charles Babbage in 1833. She had a strong respect and affection for Somerville, and they corresponded for many years. Other acquaintances included the scientists Andrew Crosse, Sir David Brewster, Charles Wheatstone, Michael Faraday and the author Charles Dickens. She was presented at Court at the age of seventeen "and became a popular belle of the season" in part because of her "brilliant mind." By 1834 Ada was a regular at Court and started attending various events. She danced often and was able to charm many people, and was described by most people as being dainty, although John Hobhouse, Byron's friend, described her as "a large, coarse-skinned young woman but with something of my friend's features, particularly the mouth". This description followed their meeting on 24 February 1834 in which Ada made it clear to Hobhouse that she did not like him, probably because of the influence of her mother, which led her to dislike all of her father's friends. This first impression was not to last, and they later became friends.'''
print(text)
First we need to tokenize the text and then we apply the NER tagger. Let's try both, the 3 class version and the 7 class version.
In [11]:
tokenized_text = word_tokenize(text)
text_ner3 = st3.tag(tokenized_text)
text_ner7 = st7.tag(tokenized_text)
print(text_ner3)
print(text_ner7)
We see that each word is tagged. Tags are for instance ORGANIZATION or PERSON. Very prominently, the O tag appears often. This is the other class (everything that is not an organisation or person, etc.).
But it is still an aweful lot of text. Let's just have a look at the non-other entities detected. We do this assuming that adjacent words having the same tag should be collapsed into one named entity.
In [13]:
from itertools import groupby
print("**** 3 classes ****")
for tag, chunk in groupby(text_ner3, lambda x:x[1]):
if tag != "O":
print("%-12s"%tag, " ".join(w for w, t in chunk))
print("**** 7 classes ****")
for tag, chunk in groupby(text_ner7, lambda x:x[1]):
if tag != "O":
print("%-12s"%tag, " ".join(w for w, t in chunk))
We see that while this is pretty impressive, it still makes errors. For example, one occurrence of Ada is tagged a ORGANISATION. You should take the non-perfect nature into account if you use those tags further in your nlp pipeline.
That's all.
In [ ]: