Part of speech tagging task aims to assign every word/token in plain text a category that identifies the syntactic functionality of the word occurrence.
Polyglot recognizes 17 parts of speech, this set is called the universal part of speech tag set
:
The models were trained on a combination of:
Original CONLL datasets after the tags were converted using the universal POS tables.
Universal Dependencies 1.0 corpora whenever they are available.
In [1]:
from polyglot.downloader import downloader
print(downloader.supported_languages_table("pos2"))
In [2]:
%%bash
polyglot download embeddings2.en pos2.en
We tag each word in the text with one part of speech.
In [3]:
from polyglot.text import Text
In [4]:
blob = """We will meet at eight o'clock on Thursday morning."""
text = Text(blob)
# We can also specify language of that text by using
# text = Text(blob, hint_language_code='en')
We can query all the tagged words
In [5]:
text.pos_tags
Out[5]:
After calling the pos_tags property once, the words objects will carry the POS tags.
In [6]:
text.words[0].pos_tag
Out[6]:
In [7]:
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en pos | tail -n 30
This work is a direct implementation of the research being described in the Polyglot: Distributed Word Representations for Multilingual NLP paper. The author of this library strongly encourage you to cite the following paper if you are using this software.
@InProceedings{polyglot:2013:ACL-CoNLL,
author = {Al-Rfou, Rami and Perozzi, Bryan and Skiena, Steven},
title = {Polyglot: Distributed Word Representations for Multilingual NLP},
booktitle = {Proceedings of the Seventeenth Conference on Computational Natural Language Learning},
month = {August},
year = {2013},
address = {Sofia, Bulgaria},
publisher = {Association for Computational Linguistics},
pages = {183--192},
url = {http://www.aclweb.org/anthology/W13-3520}
}