Part of Speech Tagging

Part of speech tagging task aims to assign every word/token in plain text a category that identifies the syntactic functionality of the word occurrence.

Polyglot recognizes 17 parts of speech, this set is called the universal part of speech tag set:

  • ADJ: adjective
  • ADP: adposition
  • ADV: adverb
  • AUX: auxiliary verb
  • CONJ: coordinating conjunction
  • DET: determiner
  • INTJ: interjection
  • NOUN: noun
  • NUM: numeral
  • PART: particle
  • PRON: pronoun
  • PROPN: proper noun
  • PUNCT: punctuation
  • SCONJ: subordinating conjunction
  • SYM: symbol
  • VERB: verb
  • X: other

Languages Coverage

The models were trained on a combination of:

  • Original CONLL datasets after the tags were converted using the universal POS tables.

  • Universal Dependencies 1.0 corpora whenever they are available.


In [1]:
from polyglot.downloader import downloader
print(downloader.supported_languages_table("pos2"))


  1. German                     2. Italian                    3. Danish                   
  4. Czech                      5. Slovene                    6. French                   
  7. English                    8. Swedish                    9. Bulgarian                
 10. Spanish; Castilian        11. Indonesian                12. Portuguese               
 13. Finnish                   14. Irish                     15. Hungarian                
 16. Dutch                    

Download Necessary Models


In [2]:
%%bash
polyglot download embeddings2.en pos2.en


[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package embeddings2.en is already up-to-date!
[polyglot_data] Downloading package pos2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package pos2.en is already up-to-date!

Example

We tag each word in the text with one part of speech.


In [3]:
from polyglot.text import Text

In [4]:
blob = """We will meet at eight o'clock on Thursday morning."""
text = Text(blob)

# We can also specify language of that text by using
# text = Text(blob, hint_language_code='en')

We can query all the tagged words


In [5]:
text.pos_tags


Out[5]:
[(u'We', u'PRON'),
 (u'will', u'AUX'),
 (u'meet', u'VERB'),
 (u'at', u'ADP'),
 (u'eight', u'NUM'),
 (u"o'clock", u'NOUN'),
 (u'on', u'ADP'),
 (u'Thursday', u'PROPN'),
 (u'morning', u'NOUN'),
 (u'.', u'PUNCT')]

After calling the pos_tags property once, the words objects will carry the POS tags.


In [6]:
text.words[0].pos_tag


Out[6]:
u'PRON'

Command Line Interface


In [7]:
!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en pos | tail -n 30


which           DET  
India           PROPN
beat            VERB 
Bermuda         PROPN
in              ADP  
Port            PROPN
of              ADP  
Spain           PROPN
in              ADP  
2007            NUM  
,               PUNCT
which           DET  
was             AUX  
equalled        VERB 
five            NUM  
days            NOUN 
ago             ADV  
by              ADP  
South           PROPN
Africa          PROPN
in              ADP  
their           PRON 
victory         NOUN 
over            ADP  
West            PROPN
Indies          PROPN
in              ADP  
Sydney          PROPN
.               PUNCT

Citation

This work is a direct implementation of the research being described in the Polyglot: Distributed Word Representations for Multilingual NLP paper. The author of this library strongly encourage you to cite the following paper if you are using this software.

   @InProceedings{polyglot:2013:ACL-CoNLL,
     author    = {Al-Rfou, Rami  and  Perozzi, Bryan  and  Skiena, Steven},
     title     = {Polyglot: Distributed Word Representations for Multilingual NLP},
     booktitle = {Proceedings of the Seventeenth Conference on Computational Natural Language Learning},
     month     = {August},
     year      = {2013},
     address   = {Sofia, Bulgaria},
     publisher = {Association for Computational Linguistics},
     pages     = {183--192}, 
     url       = {http://www.aclweb.org/anthology/W13-3520}
   }