Named entity extraction task aims to extract phrases from plain text that correpond to entities. Polyglot recognizes 3 categories of entities:
I-LOC
): cities, countries, regions, continents, neighborhoods, administrative divisions ...I-ORG
): sports teams, newspapers, banks, universities, schools, non-profits, companies, ...I-PER
): politicians, scientists, artists, atheletes ...The models were trained on datasets extracted automatically from Wikipedia. Polyglot currently supports 40 major languages.
In [2]:
from polyglot.downloader import downloader
print(downloader.supported_languages_table("ner2", 3))
In [3]:
%%bash
polyglot download embeddings2.en ner2.en
Entities inside a text object or a sentence are represented as chunks. Each chunk identifies the start and the end indices of the word subsequence within the text.
In [4]:
from polyglot.text import Text
In [5]:
blob = """The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a "threat to the entire world"."""
text = Text(blob)
# We can also specify language of that text by using
# text = Text(blob, hint_language_code='en')
We can query all entities mentioned in a text.
In [6]:
text.entities
Out[6]:
Or, we can query entites per sentence
In [7]:
for sent in text.sentences:
print(sent, "\n")
for entity in sent.entities:
print(entity.tag, entity)
By doing more careful inspection of the second entity Benjamin Netanyahu
, we can locate the position of the entity within the sentence.
In [8]:
benjamin = sent.entities[1]
sent.words[benjamin.start: benjamin.end]
Out[8]:
In [11]:
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en ner | tail -n 20
This work is a direct implementation of the research being described in the Polyglot-NER: Multilingual Named Entity Recognition paper. The author of this library strongly encourage you to cite the following paper if you are using this software.
@article{polyglotner,
author = {Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven},
title = {{Polyglot-NER}: Massive Multilingual Named Entity Recognition},
journal = {{Proceedings of the 2015 {SIAM} International Conference on Data Mining, Vancouver, British Columbia, Canada, April 30 - May 2, 2015}},
month = {April},
year = {2015},
publisher = {SIAM}
}