Part of Speech Basics

The challenge of correctly identifying parts of speech is summed up nicely in the spaCy docs:

Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it's possible to solve some problems starting from only the raw characters, it's usually better to use linguistic knowledge to add useful information. That's exactly what spaCy is designed to do: you put in raw text, and get back a **Doc** object, that comes with a variety of annotations.
In this section we'll take a closer look at coarse POS tags (noun, verb, adjective) and fine-grained tags (plural noun, past-tense verb, superlative adjective).


In [1]:
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
# Create a simple Doc object
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")

View token tags

Recall that you can obtain a particular token by its index position.

  • To view the coarse POS tag use token.pos_
  • To view the fine-grained tag use token.tag_
  • To view the description of either type of tag use spacy.explain(tag)
Note that `token.pos` and `token.tag` return integer hash values; by adding the underscores we get the text equivalent that lives in **doc.vocab**.

In [3]:
# Print the full text:
print(doc.text)


The quick brown fox jumped over the lazy dog's back.

In [4]:
# Print the fifth word and associated tags:
print(doc[4].text, doc[4].pos_, doc[4].tag_, spacy.explain(doc[4].tag_))


jumped VERB VBD verb, past tense

We can apply this technique to the entire Doc object:


In [5]:
for token in doc:
    print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')


The        DET      DT     determiner
quick      ADJ      JJ     adjective
brown      ADJ      JJ     adjective
fox        NOUN     NN     noun, singular or mass
jumped     VERB     VBD    verb, past tense
over       ADP      IN     conjunction, subordinating or preposition
the        DET      DT     determiner
lazy       ADJ      JJ     adjective
dog        NOUN     NN     noun, singular or mass
's         PART     POS    possessive ending
back       NOUN     NN     noun, singular or mass
.          PUNCT    .      punctuation mark, sentence closer

Coarse-grained Part-of-speech Tags

Every token is assigned a POS Tag from the following list:

POSDESCRIPTIONEXAMPLES

ADJadjective*big, old, green, incomprehensible, first* ADPadposition*in, to, during* ADVadverb*very, tomorrow, down, where, there* AUXauxiliary*is, has (done), will (do), should (do)* CONJconjunction*and, or, but* CCONJcoordinating conjunction*and, or, but* DETdeterminer*a, an, the* INTJinterjection*psst, ouch, bravo, hello* NOUNnoun*girl, cat, tree, air, beauty* NUMnumeral*1, 2017, one, seventy-seven, IV, MMXIV* PARTparticle*'s, not,* PRONpronoun*I, you, he, she, myself, themselves, somebody* PROPNproper noun*Mary, John, London, NATO, HBO* PUNCTpunctuation*., (, ), ?* SCONJsubordinating conjunction*if, while, that* SYMsymbol*$, %, §, ©, +, −, ×, ÷, =, :), 😝* VERBverb*run, runs, running, eat, ate, eating* Xother*sfpksdpsxmsa* SPACEspace

Fine-grained Part-of-speech Tags

Tokens are subsequently given a fine-grained tag as determined by morphology:

POSDescriptionFine-grained TagDescriptionMorphology
ADJadjectiveAFXaffixHyph=yes
ADJJJadjectiveDegree=pos
ADJJJRadjective, comparativeDegree=comp
ADJJJSadjective, superlativeDegree=sup
ADJPDTpredeterminerAdjType=pdt PronType=prn
ADJPRP\$pronoun, possessivePronType=prs Poss=yes
ADJWDTwh-determinerPronType=int rel
ADJWP\$wh-pronoun, possessivePoss=yes PronType=int rel
ADPadpositionINconjunction, subordinating or preposition
ADVadverbEXexistential thereAdvType=ex
ADVRBadverbDegree=pos
ADVRBRadverb, comparativeDegree=comp
ADVRBSadverb, superlativeDegree=sup
ADVWRBwh-adverbPronType=int rel
CONJconjunctionCCconjunction, coordinatingConjType=coor
DETdeterminerDTdeterminer
INTJinterjectionUHinterjection
NOUNnounNNnoun, singular or massNumber=sing
NOUNNNSnoun, pluralNumber=plur
NOUNWPwh-pronoun, personalPronType=int rel
NUMnumeralCDcardinal numberNumType=card
PARTparticlePOSpossessive endingPoss=yes
PARTRPadverb, particle
PARTTOinfinitival toPartType=inf VerbForm=inf
PRONpronounPRPpronoun, personalPronType=prs
PROPNproper nounNNPnoun, proper singularNounType=prop Number=sign
PROPNNNPSnoun, proper pluralNounType=prop Number=plur
PUNCTpunctuation-LRB-left round bracketPunctType=brck PunctSide=ini
PUNCT-RRB-right round bracketPunctType=brck PunctSide=fin
PUNCT,punctuation mark, commaPunctType=comm
PUNCT:punctuation mark, colon or ellipsis
PUNCT.punctuation mark, sentence closerPunctType=peri
PUNCT''closing quotation markPunctType=quot PunctSide=fin
PUNCT""closing quotation markPunctType=quot PunctSide=fin
PUNCT``opening quotation markPunctType=quot PunctSide=ini
PUNCTHYPHpunctuation mark, hyphenPunctType=dash
PUNCTLSlist item markerNumType=ord
PUNCTNFPsuperfluous punctuation
SYMsymbol#symbol, number signSymType=numbersign
SYM\$symbol, currencySymType=currency
SYMSYMsymbol
VERBverbBESauxiliary "be"
VERBHVSforms of "have"
VERBMDverb, modal auxiliaryVerbType=mod
VERBVBverb, base formVerbForm=inf
VERBVBDverb, past tenseVerbForm=fin Tense=past
VERBVBGverb, gerund or present participleVerbForm=part Tense=pres Aspect=prog
VERBVBNverb, past participleVerbForm=part Tense=past Aspect=perf
VERBVBPverb, non-3rd person singular presentVerbForm=fin Tense=pres
VERBVBZverb, 3rd person singular presentVerbForm=fin Tense=pres Number=sing Person=3
XotherADDemail
XFWforeign wordForeign=yes
XGWadditional word in multi-word expression
XXXunknown
SPACEspace_SPspace
NILmissing tag

For a current list of tags for all languages visit https://spacy.io/api/annotation#pos-tagging

Working with POS Tags

In the English language, the same string of characters can have different meanings, even within the same sentence. For this reason, morphology is important. spaCy uses machine learning algorithms to best predict the use of a token in a sentence. Is "I read books on NLP" present or past tense? Is wind a verb or a noun?


In [6]:
doc = nlp(u'I read books on NLP.')
r = doc[1]

print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')


read       VERB     VBP    verb, non-3rd person singular present

In [7]:
doc = nlp(u'I read a book on NLP.')
r = doc[1]

print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')


read       VERB     VBD    verb, past tense

In the first example, with no other cues to work from, spaCy assumed that read was present tense.
In the second example the present tense form would be I am reading a book, so spaCy assigned the past tense.

Counting POS Tags

The Doc.count_by() method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object. Keys in the dictionary are the integer values of the given attribute ID, and values are the frequency. Counts of zero are not included.


In [8]:
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")

# Count the frequencies of different coarse-grained POS tags:
POS_counts = doc.count_by(spacy.attrs.POS)
POS_counts


Out[8]:
{83: 3, 84: 1, 89: 2, 91: 3, 93: 1, 96: 1, 99: 1}

This isn't very helpful until you decode the attribute ID:


In [9]:
doc.vocab[83].text


Out[9]:
'ADJ'

Create a frequency list of POS tags from the entire document

Since POS_counts returns a dictionary, we can obtain a list of keys with POS_counts.items().
By sorting the list we have access to the tag and its count, in order.


In [10]:
for k,v in sorted(POS_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{5}}: {v}')


83. ADJ  : 3
84. ADP  : 1
89. DET  : 2
91. NOUN : 3
93. PART : 1
96. PUNCT: 1
99. VERB : 1

In [11]:
# Count the different fine-grained tags:
TAG_counts = doc.count_by(spacy.attrs.TAG)

for k,v in sorted(TAG_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{4}}: {v}')


74. POS : 1
1292078113972184607. IN  : 1
10554686591937588953. JJ  : 3
12646065887601541794. .   : 1
15267657372422890137. DT  : 2
15308085513773655218. NN  : 3
17109001835818727656. VBD : 1
**Why did the ID numbers get so big?** In spaCy, certain text values are hardcoded into `Doc.vocab` and take up the first several hundred ID numbers. Strings like 'NOUN' and 'VERB' are used frequently by internal operations. Others, like fine-grained tags, are assigned hash values as needed.
**Why don't SPACE tags appear?** In spaCy, only strings of spaces (two or more) are assigned tokens. Single spaces are not.

In [12]:
# Count the different dependencies:
DEP_counts = doc.count_by(spacy.attrs.DEP)

for k,v in sorted(DEP_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{4}}: {v}')


399. amod: 3
412. det : 2
426. nsubj: 1
436. pobj: 1
437. poss: 1
440. prep: 1
442. punct: 1
8110129090154140942. case: 1
8206900633647566924. ROOT: 1

Here we've shown spacy.attrs.POS, spacy.attrs.TAG and spacy.attrs.DEP.
Refer back to the Vocabulary and Matching lecture from the previous section for a table of Other token attributes.


Fine-grained POS Tag Examples

These are some grammatical examples (shown in bold) of specific fine-grained tags. We've removed punctuation and rarely used tags:

POSTAGDESCRIPTIONEXAMPLE
ADJAFXaffixThe Flintstones were a **pre**-historic family.
ADJJJadjectiveThis is a **good** sentence.
ADJJJRadjective, comparativeThis is a **better** sentence.
ADJJJSadjective, superlativeThis is the **best** sentence.
ADJPDTpredeterminerWaking up is **half** the battle.
ADJPRP\$pronoun, possessive**His** arm hurts.
ADJWDTwh-determinerIt's blue, **which** is odd.
ADJWP\$wh-pronoun, possessiveWe don't know **whose** it is.
ADPINconjunction, subordinating or prepositionIt arrived **in** a box.
ADVEXexistential there**There** is cake.
ADVRBadverbHe ran **quickly**.
ADVRBRadverb, comparativeHe ran **quicker**.
ADVRBSadverb, superlativeHe ran **fastest**.
ADVWRBwh-adverb**When** was that?
CONJCCconjunction, coordinatingThe balloon popped **and** everyone jumped.
DETDTdeterminer**This** is **a** sentence.
INTJUHinterjection**Um**, I don't know.
NOUNNNnoun, singular or massThis is a **sentence**.
NOUNNNSnoun, pluralThese are **words**.
NOUNWPwh-pronoun, personal**Who** was that?
NUMCDcardinal numberI want **three** things.
PARTPOSpossessive endingFred**'s** name is short.
PARTRPadverb, particlePut it **back**!
PARTTOinfinitival toI want **to** go.
PRONPRPpronoun, personal**I** want **you** to go.
PROPNNNPnoun, proper singular**Kilroy** was here.
PROPNNNPSnoun, proper pluralThe **Flintstones** were a pre-historic family.
VERBMDverb, modal auxiliaryThis **could** work.
VERBVBverb, base formI want to **go**.
VERBVBDverb, past tenseThis **was** a sentence.
VERBVBGverb, gerund or present participleI am **going**.
VERBVBNverb, past participleThe treasure was **lost**.
VERBVBPverb, non-3rd person singular presentI **want** to go.
VERBVBZverb, 3rd person singular presentHe **wants** to go.

Up Next: Visualizing POS