Part of Speech Tags

In this notebook, we learn more about POS tags.

Tagsets and Examples

Universal tagset: (thanks to http://www.tablesgenerator.com/markdown_tables)

Tag	Meaning	English Examples
ADJ	adjective	new, good, high, special, big, local
ADP	adposition	on, of, at, with, by, into, under
ADV	adverb	really, already, still, early, now
CONJ	conjunction	and, or, but, if, while, although
DET	determiner, article	the, a, some, most, every, no, which
NOUN	noun	year, home, costs, time, Africa
NUM	numeral	twenty-four, fourth, 1991, 14:24
PRT	particle	at, on, out, over per, that, up, with
PRON	pronoun	he, their, her, its, my, I, us
VERB	verb	is, say, told, given, playing, would
.	punctuation marks	. , ; !
X	other	ersatz, esprit, dunno, gr8, univeristy

We list the upenn (aka. treebank) tagset below. In addition to that, NLTK also has

brown: use nltk.help.brown_tagset()
claws5: use nltk.help.claws5_tagset()



In [1]:

    
import nltk



In [2]:

    
nltk.help.upenn_tagset()









    



$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
JJR: adjective, comparative
    bleaker braver breezier briefer brighter brisker broader bumper busier
    calmer cheaper choosier cleaner clearer closer colder commoner costlier
    cozier creamier crunchier cuter ...
JJS: adjective, superlative
    calmest cheapest choicest classiest cleanest clearest closest commonest
    corniest costliest crassest creepiest crudest cutest darkest deadliest
    dearest deepest densest dinkiest ...
LS: list item marker
    A A. B B. C C. D E F First G H I J K One SP-44001 SP-44002 SP-44005
    SP-44007 Second Third Three Two * a b c d first five four one six three
    two
MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNPS: noun, proper, plural
    Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
    Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
    Apache Apaches Apocrypha ...
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...
PDT: pre-determiner
    all both half many quite such sure this
POS: genitive marker
    ' 's
PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
PRP$: pronoun, possessive
    her his mine my our ours their thy your
RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
RBR: adverb, comparative
    further gloomier grander graver greater grimmer harder harsher
    healthier heavier higher however larger later leaner lengthier less-
    perfectly lesser lonelier longer louder lower more ...
RBS: adverb, superlative
    best biggest bluntest earliest farthest first furthest hardest
    heartiest highest largest least less most nearest second tightest worst
RP: particle
    aboard about across along apart around aside at away back before behind
    by crop down ever fast for forth from go high i.e. in into just later
    low more off on open out over per pie raising start teeth that through
    under unto up up-pp upon whole with you
SYM: symbol
    % & ' '' ''. ) ). * + ,. < = > @ A[fj] U.S U.S.S.R * ** ***
TO: "to" as preposition or infinitive marker
    to
UH: interjection
    Goodbye Goody Gosh Wow Jeepers Jee-sus Hubba Hey Kee-reist Oops amen
    huh howdy uh dammit whammo shucks heck anyways whodunnit honey golly
    man baby diddle hush sonuvabitch ...
VB: verb, base form
    ask assemble assess assign assume atone attention avoid bake balkanize
    bank begin behold believe bend benefit bevel beware bless boil bomb
    boost brace break bring broil brush build ...
VBD: verb, past tense
    dipped pleaded swiped regummed soaked tidied convened halted registered
    cushioned exacted snubbed strode aimed adopted belied figgered
    speculated wore appreciated contemplated ...
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...
VBN: verb, past participle
    multihulled dilapidated aerosolized chaired languished panelized used
    experimented flourished imitated reunifed factored condensed sheared
    unsettled primed dubbed desired ...
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
    appear tend stray glisten obtain comprise detest tease attract
    emphasize mold postpone sever return wag ...
VBZ: verb, present tense, 3rd person singular
    bases reconstructs marks mixes displeases seals carps weaves snatches
    slumps stretches authorizes smolders pictures emerges stockpiles
    seduces fizzes uses bolsters slaps speaks pleads ...
WDT: WH-determiner
    that what whatever which whichever
WP: WH-pronoun
    that what whatever whatsoever which who whom whosoever
WP$: WH-pronoun, possessive
    whose
WRB: Wh-adverb
    how however whence whenever where whereby whereever wherein whereof why
``: opening quotation mark
    ` ``



In [3]:

    
nltk.help.upenn_tagset('WP$')









    



WP$: WH-pronoun, possessive
    whose



In [4]:

    
nltk.help.upenn_tagset('PDT')









    



PDT: pre-determiner
    all both half many quite such sure this



In [5]:

    
nltk.help.upenn_tagset('DT')









    



DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those



In [6]:

    
nltk.help.upenn_tagset('POS')









    



POS: genitive marker
    ' 's



In [7]:

    
nltk.help.upenn_tagset('RBR')









    



RBR: adverb, comparative
    further gloomier grander graver greater grimmer harder harsher
    healthier heavier higher however larger later leaner lengthier less-
    perfectly lesser lonelier longer louder lower more ...



In [8]:

    
nltk.help.upenn_tagset('RBS')









    



RBS: adverb, superlative
    best biggest bluntest earliest farthest first furthest hardest
    heartiest highest largest least less most nearest second tightest worst



In [9]:

    
nltk.help.upenn_tagset('MD')









    



MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would

Or this summary table (also c.f. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

Tag	Meaning	Tag	Meaning	Tag	Meaning
CC	Coordinating conjunction	NNP	Proper noun, singular	VB	Verb, base form
CD	Cardinal number	NNPS	Proper noun, plural	VBD	Verb, past tense
DT	Determiner	PDT	Predeterminer	VBG	Verb, gerund or present
EX	Existential there	POS	Possessive ending	VBN	Verb, past participle
FW	Foreign word	PRP	Personal pronoun	VBP	Verb, non-3rd person singular present
IN	Preposition or subordinating conjunction	PRP\$	Possessive pronoun	VBZ	Verb, 3rd person singular
JJ	Adjective	RB	Adverb	WDT	Wh-determiner
JJR	Adjective, comparative	RBR	Adverb, comparative	WP	Wh-pronoun
JJS	Adjective, superlative	RBS	Adverb, superlative	WP\$	Possessive wh-pronoun
LS	List item marker	RP	Particle	WRB	Wh-adverb
MD	Modal	SYM	Symbol
NN	Noun, singular or mass	TO	to
NNS	Noun, plural	UH	Interjection

Tagging a sentence



In [10]:

    
from pprint import pprint

sent = 'Beautiful is better than ugly.'
tokens = nltk.tokenize.word_tokenize(sent)
pos_tags = nltk.pos_tag(tokens)
pprint(pos_tags)









    



[('Beautiful', 'NNP'),
 ('is', 'VBZ'),
 ('better', 'JJR'),
 ('than', 'IN'),
 ('ugly', 'RB'),
 ('.', '.')]

Various algorithms can be used to perform POS tagging. In general, the accuracy is pretty high (state-of-the-art can reach approximately 97%). However, there are still incorrect tags. We demonstrate this below.



In [11]:

    
truths = [[(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'),
            (u'years', u'NNS'), (u'old', u'JJ'), (u',', u','), (u'will', u'MD'),
            (u'join', u'VB'), (u'the', u'DT'), (u'board', u'NN'), (u'as', u'IN'),
            (u'a', u'DT'), (u'nonexecutive', u'JJ'), (u'director', u'NN'),
            (u'Nov.', u'NNP'), (u'29', u'CD'), (u'.', u'.')],
        [(u'Mr.', u'NNP'), (u'Vinken', u'NNP'), (u'is', u'VBZ'), (u'chairman', u'NN'),
            (u'of', u'IN'), (u'Elsevier', u'NNP'), (u'N.V.', u'NNP'), (u',', u','),
            (u'the', u'DT'), (u'Dutch', u'NNP'), (u'publishing', u'VBG'),
            (u'group', u'NN'), (u'.', u'.'), (u'Rudolph', u'NNP'), (u'Agnew', u'NNP'),
            (u',', u','), (u'55', u'CD'), (u'years', u'NNS'), (u'old', u'JJ'),
            (u'and', u'CC'), (u'former', u'JJ'), (u'chairman', u'NN'), (u'of', u'IN'),
            (u'Consolidated', u'NNP'), (u'Gold', u'NNP'), (u'Fields', u'NNP'),
            (u'PLC', u'NNP'), (u',', u','), (u'was', u'VBD'), (u'named', u'VBN'),
            (u'a', u'DT'), (u'nonexecutive', u'JJ'), (u'director', u'NN'), (u'of', u'IN'),
            (u'this', u'DT'), (u'British', u'JJ'), (u'industrial', u'JJ'),
            (u'conglomerate', u'NN'), (u'.', u'.')],
        [(u'A', u'DT'), (u'form', u'NN'),
            (u'of', u'IN'), (u'asbestos', u'NN'), (u'once', u'RB'), (u'used', u'VBN'),
            (u'to', u'TO'), (u'make', u'VB'), (u'Kent', u'NNP'), (u'cigarette', u'NN'),
            (u'filters', u'NNS'), (u'has', u'VBZ'), (u'caused', u'VBN'), (u'a', u'DT'),
            (u'high', u'JJ'), (u'percentage', u'NN'), (u'of', u'IN'),
            (u'cancer', u'NN'), (u'deaths', u'NNS'),
            (u'among', u'IN'), (u'a', u'DT'), (u'group', u'NN'), (u'of', u'IN'),
            (u'workers', u'NNS'), (u'exposed', u'VBN'), (u'to', u'TO'), (u'it', u'PRP'),
            (u'more', u'RBR'), (u'than', u'IN'), (u'30', u'CD'), (u'years', u'NNS'),
            (u'ago', u'IN'), (u',', u','), (u'researchers', u'NNS'),
            (u'reported', u'VBD'), (u'.', u'.')]]



In [12]:

    
import pandas as pd

def proj(pair_list, idx):
    return [p[idx] for p in pair_list]

data = []
for truth in truths:
    sent_toks = proj(truth, 0)
    true_tags = proj(truth, 1)
    nltk_tags = nltk.pos_tag(sent_toks)
    for i in range(len(sent_toks)):
        # print('{}\t{}\t{}'.format(sent_toks[i], true_tags[i], nltk_tags[i][1])) # if you do not want to use DataFrame
        data.append( (sent_toks[i], true_tags[i], nltk_tags[i][1] ) )

headers = ['token', 'true_tag', 'nltk_tag']
df = pd.DataFrame(data, columns = headers)
df









    Out[12]:






  
    
      
      token
      true_tag
      nltk_tag
    
  
  
    
      0
      Pierre
      NNP
      NNP
    
    
      1
      Vinken
      NNP
      NNP
    
    
      2
      ,
      ,
      ,
    
    
      3
      61
      CD
      CD
    
    
      4
      years
      NNS
      NNS
    
    
      5
      old
      JJ
      JJ
    
    
      6
      ,
      ,
      ,
    
    
      7
      will
      MD
      MD
    
    
      8
      join
      VB
      VB
    
    
      9
      the
      DT
      DT
    
    
      10
      board
      NN
      NN
    
    
      11
      as
      IN
      IN
    
    
      12
      a
      DT
      DT
    
    
      13
      nonexecutive
      JJ
      JJ
    
    
      14
      director
      NN
      NN
    
    
      15
      Nov.
      NNP
      NNP
    
    
      16
      29
      CD
      CD
    
    
      17
      .
      .
      .
    
    
      18
      Mr.
      NNP
      NNP
    
    
      19
      Vinken
      NNP
      NNP
    
    
      20
      is
      VBZ
      VBZ
    
    
      21
      chairman
      NN
      NN
    
    
      22
      of
      IN
      IN
    
    
      23
      Elsevier
      NNP
      NNP
    
    
      24
      N.V.
      NNP
      NNP
    
    
      25
      ,
      ,
      ,
    
    
      26
      the
      DT
      DT
    
    
      27
      Dutch
      NNP
      NNP
    
    
      28
      publishing
      VBG
      NN
    
    
      29
      group
      NN
      NN
    
    
      ...
      ...
      ...
      ...
    
    
      63
      to
      TO
      TO
    
    
      64
      make
      VB
      VB
    
    
      65
      Kent
      NNP
      NNP
    
    
      66
      cigarette
      NN
      NN
    
    
      67
      filters
      NNS
      NNS
    
    
      68
      has
      VBZ
      VBZ
    
    
      69
      caused
      VBN
      VBN
    
    
      70
      a
      DT
      DT
    
    
      71
      high
      JJ
      JJ
    
    
      72
      percentage
      NN
      NN
    
    
      73
      of
      IN
      IN
    
    
      74
      cancer
      NN
      NN
    
    
      75
      deaths
      NNS
      NNS
    
    
      76
      among
      IN
      IN
    
    
      77
      a
      DT
      DT
    
    
      78
      group
      NN
      NN
    
    
      79
      of
      IN
      IN
    
    
      80
      workers
      NNS
      NNS
    
    
      81
      exposed
      VBN
      VBN
    
    
      82
      to
      TO
      TO
    
    
      83
      it
      PRP
      PRP
    
    
      84
      more
      RBR
      JJR
    
    
      85
      than
      IN
      IN
    
    
      86
      30
      CD
      CD
    
    
      87
      years
      NNS
      NNS
    
    
      88
      ago
      IN
      RB
    
    
      89
      ,
      ,
      ,
    
    
      90
      researchers
      NNS
      NNS
    
    
      91
      reported
      VBD
      VBD
    
    
      92
      .
      .
      .
    
  

93 rows × 3 columns



In [13]:

    
# this finds out the tokens that the true_tag and nltk_tag are different. 
df[df.true_tag != df.nltk_tag]









    Out[13]:






  
    
      
      token
      true_tag
      nltk_tag
    
  
  
    
      28
      publishing
      VBG
      NN
    
    
      62
      used
      VBN
      VBD
    
    
      84
      more
      RBR
      JJR
    
    
      88
      ago
      IN
      RB



In [ ]:

	token	true_tag	nltk_tag
0	Pierre	NNP	NNP
1	Vinken	NNP	NNP
2	,	,	,
3	61	CD	CD
4	years	NNS	NNS
5	old	JJ	JJ
6	,	,	,
7	will	MD	MD
8	join	VB	VB
9	the	DT	DT
10	board	NN	NN
11	as	IN	IN
12	a	DT	DT
13	nonexecutive	JJ	JJ
14	director	NN	NN
15	Nov.	NNP	NNP
16	29	CD	CD
17	.	.	.
18	Mr.	NNP	NNP
19	Vinken	NNP	NNP
20	is	VBZ	VBZ
21	chairman	NN	NN
22	of	IN	IN
23	Elsevier	NNP	NNP
24	N.V.	NNP	NNP
25	,	,	,
26	the	DT	DT
27	Dutch	NNP	NNP
28	publishing	VBG	NN
29	group	NN	NN
...	...	...	...
63	to	TO	TO
64	make	VB	VB
65	Kent	NNP	NNP
66	cigarette	NN	NN
67	filters	NNS	NNS
68	has	VBZ	VBZ
69	caused	VBN	VBN
70	a	DT	DT
71	high	JJ	JJ
72	percentage	NN	NN
73	of	IN	IN
74	cancer	NN	NN
75	deaths	NNS	NNS
76	among	IN	IN
77	a	DT	DT
78	group	NN	NN
79	of	IN	IN
80	workers	NNS	NNS
81	exposed	VBN	VBN
82	to	TO	TO
83	it	PRP	PRP
84	more	RBR	JJR
85	than	IN	IN
86	30	CD	CD
87	years	NNS	NNS
88	ago	IN	RB
89	,	,	,
90	researchers	NNS	NNS
91	reported	VBD	VBD
92	.	.	.