Part of Speech Basics

The challenge of correctly identifying parts of speech is summed up nicely in the spaCy docs:

Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it's possible to solve some problems starting from only the raw characters, it's usually better to use linguistic knowledge to add useful information. That's exactly what spaCy is designed to do: you put in raw text, and get back a **Doc** object, that comes with a variety of annotations.

In this section we'll take a closer look at coarse POS tags (noun, verb, adjective) and fine-grained tags (plural noun, past-tense verb, superlative adjective).



In [1]:

    
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')



In [2]:

    
# Create a simple Doc object
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")

View token tags

Recall that you can obtain a particular token by its index position.

To view the coarse POS tag use token.pos_
To view the fine-grained tag use token.tag_
To view the description of either type of tag use spacy.explain(tag)

Note that `token.pos` and `token.tag` return integer hash values; by adding the underscores we get the text equivalent that lives in **doc.vocab**.



In [3]:

    
# Print the full text:
print(doc.text)









    



The quick brown fox jumped over the lazy dog's back.



In [4]:

    
# Print the fifth word and associated tags:
print(doc[4].text, doc[4].pos_, doc[4].tag_, spacy.explain(doc[4].tag_))









    



jumped VERB VBD verb, past tense

We can apply this technique to the entire Doc object:



In [5]:

    
for token in doc:
    print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')









    



The        DET      DT     determiner
quick      ADJ      JJ     adjective
brown      ADJ      JJ     adjective
fox        NOUN     NN     noun, singular or mass
jumped     VERB     VBD    verb, past tense
over       ADP      IN     conjunction, subordinating or preposition
the        DET      DT     determiner
lazy       ADJ      JJ     adjective
dog        NOUN     NN     noun, singular or mass
's         PART     POS    possessive ending
back       NOUN     NN     noun, singular or mass
.          PUNCT    .      punctuation mark, sentence closer

Coarse-grained Part-of-speech Tags

Every token is assigned a POS Tag from the following list:

POS	DESCRIPTION	EXAMPLES

ADJadjective*big, old, green, incomprehensible, first* ADPadposition*in, to, during* ADVadverb*very, tomorrow, down, where, there* AUXauxiliary*is, has (done), will (do), should (do)* CONJconjunction*and, or, but* CCONJcoordinating conjunction*and, or, but* DETdeterminer*a, an, the* INTJinterjection*psst, ouch, bravo, hello* NOUNnoun*girl, cat, tree, air, beauty* NUMnumeral*1, 2017, one, seventy-seven, IV, MMXIV* PARTparticle*'s, not,* PRONpronoun*I, you, he, she, myself, themselves, somebody* PROPNproper noun*Mary, John, London, NATO, HBO* PUNCTpunctuation*., (, ), ?* SCONJsubordinating conjunction*if, while, that* SYMsymbol*$, %, §, ©, +, −, ×, ÷, =, :), 😝* VERBverb*run, runs, running, eat, ate, eating* Xother*sfpksdpsxmsa* SPACEspace

Fine-grained Part-of-speech Tags

Tokens are subsequently given a fine-grained tag as determined by morphology:

POS	Description	Fine-grained Tag	Description	Morphology
ADJ	adjective	AFX	affix	Hyph=yes
ADJ		JJ	adjective	Degree=pos
ADJ		JJR	adjective, comparative	Degree=comp
ADJ		JJS	adjective, superlative	Degree=sup
ADJ		PDT	predeterminer	AdjType=pdt PronType=prn
ADJ		PRP\$	pronoun, possessive	PronType=prs Poss=yes
ADJ		WDT	wh-determiner	PronType=int rel
ADJ		WP\$	wh-pronoun, possessive	Poss=yes PronType=int rel
ADP	adposition	IN	conjunction, subordinating or preposition
ADV	adverb	EX	existential there	AdvType=ex
ADV		RB	adverb	Degree=pos
ADV		RBR	adverb, comparative	Degree=comp
ADV		RBS	adverb, superlative	Degree=sup
ADV		WRB	wh-adverb	PronType=int rel
CONJ	conjunction	CC	conjunction, coordinating	ConjType=coor
DET	determiner	DT	determiner
INTJ	interjection	UH	interjection
NOUN	noun	NN	noun, singular or mass	Number=sing
NOUN		NNS	noun, plural	Number=plur
NOUN		WP	wh-pronoun, personal	PronType=int rel
NUM	numeral	CD	cardinal number	NumType=card
PART	particle	POS	possessive ending	Poss=yes
PART		RP	adverb, particle
PART		TO	infinitival to	PartType=inf VerbForm=inf
PRON	pronoun	PRP	pronoun, personal	PronType=prs
PROPN	proper noun	NNP	noun, proper singular	NounType=prop Number=sign
PROPN		NNPS	noun, proper plural	NounType=prop Number=plur
PUNCT	punctuation	-LRB-	left round bracket	PunctType=brck PunctSide=ini
PUNCT		-RRB-	right round bracket	PunctType=brck PunctSide=fin
PUNCT		,	punctuation mark, comma	PunctType=comm
PUNCT		:	punctuation mark, colon or ellipsis
PUNCT		.	punctuation mark, sentence closer	PunctType=peri
PUNCT		''	closing quotation mark	PunctType=quot PunctSide=fin
PUNCT		""	closing quotation mark	PunctType=quot PunctSide=fin
PUNCT		``	opening quotation mark	PunctType=quot PunctSide=ini
PUNCT		HYPH	punctuation mark, hyphen	PunctType=dash
PUNCT		LS	list item marker	NumType=ord
PUNCT		NFP	superfluous punctuation
SYM	symbol	#	symbol, number sign	SymType=numbersign
SYM		\$	symbol, currency	SymType=currency
SYM		SYM	symbol
VERB	verb	BES	auxiliary "be"
VERB		HVS	forms of "have"
VERB		MD	verb, modal auxiliary	VerbType=mod
VERB		VB	verb, base form	VerbForm=inf
VERB		VBD	verb, past tense	VerbForm=fin Tense=past
VERB		VBG	verb, gerund or present participle	VerbForm=part Tense=pres Aspect=prog
VERB		VBN	verb, past participle	VerbForm=part Tense=past Aspect=perf
VERB		VBP	verb, non-3rd person singular present	VerbForm=fin Tense=pres
VERB		VBZ	verb, 3rd person singular present	VerbForm=fin Tense=pres Number=sing Person=3
X	other	ADD	email
X		FW	foreign word	Foreign=yes
X		GW	additional word in multi-word expression
X		XX	unknown
SPACE	space	_SP	space
		NIL	missing tag

For a current list of tags for all languages visit https://spacy.io/api/annotation#pos-tagging

Working with POS Tags

In the English language, the same string of characters can have different meanings, even within the same sentence. For this reason, morphology is important. spaCy uses machine learning algorithms to best predict the use of a token in a sentence. Is "I read books on NLP" present or past tense? Is wind a verb or a noun?



In [6]:

    
doc = nlp(u'I read books on NLP.')
r = doc[1]

print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')









    



read       VERB     VBP    verb, non-3rd person singular present



In [7]:

    
doc = nlp(u'I read a book on NLP.')
r = doc[1]

print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')









    



read       VERB     VBD    verb, past tense

In the first example, with no other cues to work from, spaCy assumed that read was present tense.
In the second example the present tense form would be I am reading a book, so spaCy assigned the past tense.

Counting POS Tags

The Doc.count_by() method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object. Keys in the dictionary are the integer values of the given attribute ID, and values are the frequency. Counts of zero are not included.



In [8]:

    
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")

# Count the frequencies of different coarse-grained POS tags:
POS_counts = doc.count_by(spacy.attrs.POS)
POS_counts









    Out[8]:





{83: 3, 84: 1, 89: 2, 91: 3, 93: 1, 96: 1, 99: 1}

This isn't very helpful until you decode the attribute ID:



In [9]:

    
doc.vocab[83].text









    Out[9]:





'ADJ'

Create a frequency list of POS tags from the entire document

Since POS_counts returns a dictionary, we can obtain a list of keys with POS_counts.items().
By sorting the list we have access to the tag and its count, in order.



In [10]:

    
for k,v in sorted(POS_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{5}}: {v}')









    



83. ADJ  : 3
84. ADP  : 1
89. DET  : 2
91. NOUN : 3
93. PART : 1
96. PUNCT: 1
99. VERB : 1



In [11]:

    
# Count the different fine-grained tags:
TAG_counts = doc.count_by(spacy.attrs.TAG)

for k,v in sorted(TAG_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{4}}: {v}')









    



74. POS : 1
1292078113972184607. IN  : 1
10554686591937588953. JJ  : 3
12646065887601541794. .   : 1
15267657372422890137. DT  : 2
15308085513773655218. NN  : 3
17109001835818727656. VBD : 1

**Why did the ID numbers get so big?** In spaCy, certain text values are hardcoded into `Doc.vocab` and take up the first several hundred ID numbers. Strings like 'NOUN' and 'VERB' are used frequently by internal operations. Others, like fine-grained tags, are assigned hash values as needed.

**Why don't SPACE tags appear?** In spaCy, only strings of spaces (two or more) are assigned tokens. Single spaces are not.



In [12]:

    
# Count the different dependencies:
DEP_counts = doc.count_by(spacy.attrs.DEP)

for k,v in sorted(DEP_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{4}}: {v}')









    



399. amod: 3
412. det : 2
426. nsubj: 1
436. pobj: 1
437. poss: 1
440. prep: 1
442. punct: 1
8110129090154140942. case: 1
8206900633647566924. ROOT: 1

Here we've shown spacy.attrs.POS, spacy.attrs.TAG and spacy.attrs.DEP.
Refer back to the Vocabulary and Matching lecture from the previous section for a table of Other token attributes.

Fine-grained POS Tag Examples

These are some grammatical examples (shown in bold) of specific fine-grained tags. We've removed punctuation and rarely used tags:

POS	TAG	DESCRIPTION	EXAMPLE
ADJ	AFX	affix	The Flintstones were a pre-historic family.
ADJ	JJ	adjective	This is a good sentence.
ADJ	JJR	adjective, comparative	This is a better sentence.
ADJ	JJS	adjective, superlative	This is the best sentence.
ADJ	PDT	predeterminer	Waking up is half the battle.
ADJ	PRP\$	pronoun, possessive	His arm hurts.
ADJ	WDT	wh-determiner	It's blue, which is odd.
ADJ	WP\$	wh-pronoun, possessive	We don't know whose it is.
ADP	IN	conjunction, subordinating or preposition	It arrived in a box.
ADV	EX	existential there	There is cake.
ADV	RB	adverb	He ran quickly.
ADV	RBR	adverb, comparative	He ran quicker.
ADV	RBS	adverb, superlative	He ran fastest.
ADV	WRB	wh-adverb	When was that?
CONJ	CC	conjunction, coordinating	The balloon popped and everyone jumped.
DET	DT	determiner	This is a sentence.
INTJ	UH	interjection	Um, I don't know.
NOUN	NN	noun, singular or mass	This is a sentence.
NOUN	NNS	noun, plural	These are words.
NOUN	WP	wh-pronoun, personal	Who was that?
NUM	CD	cardinal number	I want three things.
PART	POS	possessive ending	Fred's name is short.
PART	RP	adverb, particle	Put it back!
PART	TO	infinitival to	I want to go.
PRON	PRP	pronoun, personal	I want you to go.
PROPN	NNP	noun, proper singular	Kilroy was here.
PROPN	NNPS	noun, proper plural	The Flintstones were a pre-historic family.
VERB	MD	verb, modal auxiliary	This could work.
VERB	VB	verb, base form	I want to go.
VERB	VBD	verb, past tense	This was a sentence.
VERB	VBG	verb, gerund or present participle	I am going.
VERB	VBN	verb, past participle	The treasure was lost.
VERB	VBP	verb, non-3rd person singular present	I want to go.
VERB	VBZ	verb, 3rd person singular present	He wants to go.