Parts of Speech Assessment

For this assessment we'll be using the short story The Tale of Peter Rabbit by Beatrix Potter (1902).
The story is in the public domain; the text file was obtained from Project Gutenberg.


In [3]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

1. Create a Doc object from the file peterrabbit.txt

HINT: Use with open('../TextFiles/peterrabbit.txt') as f:


In [4]:
with open('../TextFiles/peterrabbit.txt') as f:
    doc = nlp(f.read())

2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.


In [16]:
# Enter your code here:

for tokens in list(doc.sents)[3]:
    print(f"{tokens.text:{15}} {tokens.pos_:{10}} {tokens.tag_:{10}} {spacy.explain(tokens.tag_)} ")


They            PRON       PRP        pronoun, personal 
lived           VERB       VBD        verb, past tense 
with            ADP        IN         conjunction, subordinating or preposition 
their           DET        PRP$       pronoun, possessive 
Mother          PROPN      NNP        noun, proper singular 
in              ADP        IN         conjunction, subordinating or preposition 
a               DET        DT         determiner 
sand            NOUN       NN         noun, singular or mass 
-               PUNCT      HYPH       punctuation mark, hyphen 
bank            NOUN       NN         noun, singular or mass 
,               PUNCT      ,          punctuation mark, comma 
underneath      ADP        IN         conjunction, subordinating or preposition 
the             DET        DT         determiner 
root            NOUN       NN         noun, singular or mass 
of              ADP        IN         conjunction, subordinating or preposition 
a               DET        DT         determiner 

               SPACE      _SP        None 
very            ADV        RB         adverb 
big             ADJ        JJ         adjective 
fir             NOUN       NN         noun, singular or mass 
-               PUNCT      HYPH       punctuation mark, hyphen 
tree            NOUN       NN         noun, singular or mass 
.               PUNCT      .          punctuation mark, sentence closer 


              SPACE      _SP        None 

3. Provide a frequency list of POS tags from the entire document


In [22]:
POS_counts = doc.count_by(spacy.attrs.POS)

for k,v in sorted(POS_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{10}} {v}')


84. ADJ        50
85. ADP        123
86. ADV        67
87. AUX        48
89. CCONJ      61
90. DET        118
92. NOUN       171
93. NUM        8
94. PART       29
95. PRON       81
96. PROPN      73
97. PUNCT      174
98. SCONJ      20
100. VERB       136
103. SPACE      99

4. CHALLENGE: What percentage of tokens are nouns?
HINT: the attribute ID for 'NOUN' is 91


In [27]:
total_tokens = len([tokens for tokens in doc])
noun_tokens = len([tokens for tokens in doc if tokens.pos_ == 'NOUN'])


(noun_tokens / total_tokens) * 100


Out[27]:
13.593004769475359

5. Display the Dependency Parse for the third sentence


In [32]:
displacy.render(list(doc.sents)[3],style='dep', jupyter=True, options={'distance':50})


They PRON lived VERB with ADP their DET Mother PROPN in ADP a DET sand- NOUN bank, NOUN underneath ADP the DET root NOUN of ADP a DET SPACE very ADV big ADJ fir- NOUN tree. NOUN SPACE nsubj prep poss pobj prep det compound pobj prep det pobj prep det advmod amod compound punct
  1. Show the first two named entities from Beatrix Potter's The Tale of Peter Rabbit **

In [34]:
for ent in doc.ents[:3]:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))


Peter Rabbit - PERSON - People, including fictional
Beatrix Potter - PERSON - People, including fictional
1902 - DATE - Absolute or relative dates or periods

7. How many sentences are contained in The Tale of Peter Rabbit?


In [35]:
len([s for s in doc.sents])


Out[35]:
68

8. CHALLENGE: How many sentences contain named entities?


In [36]:
list_of_sents = [nlp(sent.text) for sent in doc.sents]
list_of_ners = [doc for doc in list_of_sents if doc.ents]
len(list_of_ners)


Out[36]:
40

9. CHALLENGE: Display the named entity visualization for list_of_sents[0] from the previous problem


In [37]:
displacy.render(list_of_sents[0], style='ent', jupyter=True)


The Tale of Peter Rabbit PERSON , by Beatrix Potter PERSON ( 1902 DATE ).

Great Job!