In this notebook, we learn more about POS tags.
Universal tagset: (thanks to http://www.tablesgenerator.com/markdown_tables)
Tag | Meaning | English Examples |
---|---|---|
ADJ | adjective | new, good, high, special, big, local |
ADP | adposition | on, of, at, with, by, into, under |
ADV | adverb | really, already, still, early, now |
CONJ | conjunction | and, or, but, if, while, although |
DET | determiner, article | the, a, some, most, every, no, which |
NOUN | noun | year, home, costs, time, Africa |
NUM | numeral | twenty-four, fourth, 1991, 14:24 |
PRT | particle | at, on, out, over per, that, up, with |
PRON | pronoun | he, their, her, its, my, I, us |
VERB | verb | is, say, told, given, playing, would |
. | punctuation marks | . , ; ! |
X | other | ersatz, esprit, dunno, gr8, univeristy |
We list the upenn
(aka. treebank
) tagset below. In addition to that, NLTK also has
nltk.help.brown_tagset()
nltk.help.claws5_tagset()
In [1]:
import nltk
In [2]:
nltk.help.upenn_tagset()
In [3]:
nltk.help.upenn_tagset('WP$')
In [4]:
nltk.help.upenn_tagset('PDT')
In [5]:
nltk.help.upenn_tagset('DT')
In [6]:
nltk.help.upenn_tagset('POS')
In [7]:
nltk.help.upenn_tagset('RBR')
In [8]:
nltk.help.upenn_tagset('RBS')
In [9]:
nltk.help.upenn_tagset('MD')
Or this summary table (also c.f. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
Tag | Meaning | Tag | Meaning | Tag | Meaning |
---|---|---|---|---|---|
CC | Coordinating conjunction | NNP | Proper noun, singular | VB | Verb, base form |
CD | Cardinal number | NNPS | Proper noun, plural | VBD | Verb, past tense |
DT | Determiner | PDT | Predeterminer | VBG | Verb, gerund or present |
EX | Existential there | POS | Possessive ending | VBN | Verb, past participle |
FW | Foreign word | PRP | Personal pronoun | VBP | Verb, non-3rd person singular present |
IN | Preposition or subordinating conjunction | PRP\$ | Possessive pronoun | VBZ | Verb, 3rd person singular |
JJ | Adjective | RB | Adverb | WDT | Wh-determiner |
JJR | Adjective, comparative | RBR | Adverb, comparative | WP | Wh-pronoun |
JJS | Adjective, superlative | RBS | Adverb, superlative | WP\$ | Possessive wh-pronoun |
LS | List item marker | RP | Particle | WRB | Wh-adverb |
MD | Modal | SYM | Symbol | ||
NN | Noun, singular or mass | TO | to | ||
NNS | Noun, plural | UH | Interjection |
In [10]:
from pprint import pprint
sent = 'Beautiful is better than ugly.'
tokens = nltk.tokenize.word_tokenize(sent)
pos_tags = nltk.pos_tag(tokens)
pprint(pos_tags)
Various algorithms can be used to perform POS tagging. In general, the accuracy is pretty high (state-of-the-art can reach approximately 97%). However, there are still incorrect tags. We demonstrate this below.
In [11]:
truths = [[(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'),
(u'years', u'NNS'), (u'old', u'JJ'), (u',', u','), (u'will', u'MD'),
(u'join', u'VB'), (u'the', u'DT'), (u'board', u'NN'), (u'as', u'IN'),
(u'a', u'DT'), (u'nonexecutive', u'JJ'), (u'director', u'NN'),
(u'Nov.', u'NNP'), (u'29', u'CD'), (u'.', u'.')],
[(u'Mr.', u'NNP'), (u'Vinken', u'NNP'), (u'is', u'VBZ'), (u'chairman', u'NN'),
(u'of', u'IN'), (u'Elsevier', u'NNP'), (u'N.V.', u'NNP'), (u',', u','),
(u'the', u'DT'), (u'Dutch', u'NNP'), (u'publishing', u'VBG'),
(u'group', u'NN'), (u'.', u'.'), (u'Rudolph', u'NNP'), (u'Agnew', u'NNP'),
(u',', u','), (u'55', u'CD'), (u'years', u'NNS'), (u'old', u'JJ'),
(u'and', u'CC'), (u'former', u'JJ'), (u'chairman', u'NN'), (u'of', u'IN'),
(u'Consolidated', u'NNP'), (u'Gold', u'NNP'), (u'Fields', u'NNP'),
(u'PLC', u'NNP'), (u',', u','), (u'was', u'VBD'), (u'named', u'VBN'),
(u'a', u'DT'), (u'nonexecutive', u'JJ'), (u'director', u'NN'), (u'of', u'IN'),
(u'this', u'DT'), (u'British', u'JJ'), (u'industrial', u'JJ'),
(u'conglomerate', u'NN'), (u'.', u'.')],
[(u'A', u'DT'), (u'form', u'NN'),
(u'of', u'IN'), (u'asbestos', u'NN'), (u'once', u'RB'), (u'used', u'VBN'),
(u'to', u'TO'), (u'make', u'VB'), (u'Kent', u'NNP'), (u'cigarette', u'NN'),
(u'filters', u'NNS'), (u'has', u'VBZ'), (u'caused', u'VBN'), (u'a', u'DT'),
(u'high', u'JJ'), (u'percentage', u'NN'), (u'of', u'IN'),
(u'cancer', u'NN'), (u'deaths', u'NNS'),
(u'among', u'IN'), (u'a', u'DT'), (u'group', u'NN'), (u'of', u'IN'),
(u'workers', u'NNS'), (u'exposed', u'VBN'), (u'to', u'TO'), (u'it', u'PRP'),
(u'more', u'RBR'), (u'than', u'IN'), (u'30', u'CD'), (u'years', u'NNS'),
(u'ago', u'IN'), (u',', u','), (u'researchers', u'NNS'),
(u'reported', u'VBD'), (u'.', u'.')]]
In [12]:
import pandas as pd
def proj(pair_list, idx):
return [p[idx] for p in pair_list]
data = []
for truth in truths:
sent_toks = proj(truth, 0)
true_tags = proj(truth, 1)
nltk_tags = nltk.pos_tag(sent_toks)
for i in range(len(sent_toks)):
# print('{}\t{}\t{}'.format(sent_toks[i], true_tags[i], nltk_tags[i][1])) # if you do not want to use DataFrame
data.append( (sent_toks[i], true_tags[i], nltk_tags[i][1] ) )
headers = ['token', 'true_tag', 'nltk_tag']
df = pd.DataFrame(data, columns = headers)
df
Out[12]:
In [13]:
# this finds out the tokens that the true_tag and nltk_tag are different.
df[df.true_tag != df.nltk_tag]
Out[13]:
In [ ]: