Part of Speech Tagging

Part of speech tagging task aims to assign every word/token in plain text a category that identifies the syntactic functionality of the word occurrence.

Polyglot recognizes 17 parts of speech, this set is called the universal part of speech tag set:

ADJ: adjective
ADP: adposition
ADV: adverb
AUX: auxiliary verb
CONJ: coordinating conjunction
DET: determiner
INTJ: interjection
NOUN: noun
NUM: numeral
PART: particle
PRON: pronoun
PROPN: proper noun
PUNCT: punctuation
SCONJ: subordinating conjunction
SYM: symbol
VERB: verb
X: other

Languages Coverage

The models were trained on a combination of:

Original CONLL datasets after the tags were converted using the universal POS tables.
Universal Dependencies 1.0 corpora whenever they are available.



In [1]:

    
from polyglot.downloader import downloader
print(downloader.supported_languages_table("pos2"))









    



  1. German                     2. Italian                    3. Danish                   
  4. Czech                      5. Slovene                    6. French                   
  7. English                    8. Swedish                    9. Bulgarian                
 10. Spanish; Castilian        11. Indonesian                12. Portuguese               
 13. Finnish                   14. Irish                     15. Hungarian                
 16. Dutch

Download Necessary Models



In [2]:

    
%%bash
polyglot download embeddings2.en pos2.en









    



[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package embeddings2.en is already up-to-date!
[polyglot_data] Downloading package pos2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package pos2.en is already up-to-date!

Example

We tag each word in the text with one part of speech.



In [3]:

    
from polyglot.text import Text



In [4]:

    
blob = """We will meet at eight o'clock on Thursday morning."""
text = Text(blob)

# We can also specify language of that text by using
# text = Text(blob, hint_language_code='en')

We can query all the tagged words



In [5]:

    
text.pos_tags









    Out[5]:





[(u'We', u'PRON'),
 (u'will', u'AUX'),
 (u'meet', u'VERB'),
 (u'at', u'ADP'),
 (u'eight', u'NUM'),
 (u"o'clock", u'NOUN'),
 (u'on', u'ADP'),
 (u'Thursday', u'PROPN'),
 (u'morning', u'NOUN'),
 (u'.', u'PUNCT')]

After calling the pos_tags property once, the words objects will carry the POS tags.



In [6]:

    
text.words[0].pos_tag









    Out[6]:





u'PRON'

Command Line Interface



In [7]:

    
!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en pos | tail -n 30









    



which           DET  
India           PROPN
beat            VERB 
Bermuda         PROPN
in              ADP  
Port            PROPN
of              ADP  
Spain           PROPN
in              ADP  
2007            NUM  
,               PUNCT
which           DET  
was             AUX  
equalled        VERB 
five            NUM  
days            NOUN 
ago             ADV  
by              ADP  
South           PROPN
Africa          PROPN
in              ADP  
their           PRON 
victory         NOUN 
over            ADP  
West            PROPN
Indies          PROPN
in              ADP  
Sydney          PROPN
.               PUNCT

Citation

This work is a direct implementation of the research being described in the Polyglot: Distributed Word Representations for Multilingual NLP paper. The author of this library strongly encourage you to cite the following paper if you are using this software.