In [1]:
# https://spacy.io/usage/linguistic-features
# Importing spacy
import spacy

In [2]:
# Loading language model
nlp = spacy.load('en')

In [3]:
# Noun Chunks

doc = nlp(u'Autonomous cars shift insurance liability toward manufacturers')

# Text: The original noun chunk text.
# Root text: The original text of the word connecting the noun chunk to the rest of the parse.
# Root dep: Dependency relation connecting the root to its head.
# Root head text: The text of the root token's head.

for chunk in doc.noun_chunks:
    # Printing the attributes of chunk
    print('Text -', chunk.text, '|', 'Root Text -', chunk.root.text, '|',
          'Root Dep -', chunk.root.dep_, '|', 'Root Head text', chunk.root.head.text)
    # Printing the explanation of root dep of chunk
    print('Dep Explanation -', chunk.root.dep_ ,'-', spacy.explain(chunk.root.dep_), '\n')


Text - Autonomous cars | Root Text - cars | Root Dep - nsubj | Root Head text shift
Dep Explanation - nsubj - nominal subject 

Text - insurance liability | Root Text - liability | Root Dep - dobj | Root Head text shift
Dep Explanation - dobj - direct object 

Text - manufacturers | Root Text - manufacturers | Root Dep - pobj | Root Head text toward
Dep Explanation - pobj - object of preposition 


In [4]:
# Navigating Parse Tree. More here - https://spacy.io/usage/linguistic-features#navigating
# spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree.

# Text: The original token text.
# Dep: The syntactic relation connecting child to head.
# Head text: The original text of the token head.
# Head POS: The part-of-speech tag of the token head.
# Children: The immediate syntactic dependents of the token.
doc = nlp(u'Autonomous cars shift insurance liability toward manufacturers')
for token in doc:
    print('Text -', token.text, '|', 'Dep - ', token.dep_, '|', 'Head Text -', token.head.text
          , '|', 'Head POS -', token.head.pos_, '|', 'Children - ', [child for child in token.children])


Text - Autonomous | Dep -  amod | Head Text - cars | Head POS - NOUN | Children -  []
Text - cars | Dep -  nsubj | Head Text - shift | Head POS - VERB | Children -  [Autonomous]
Text - shift | Dep -  ROOT | Head Text - shift | Head POS - VERB | Children -  [cars, liability, toward]
Text - insurance | Dep -  compound | Head Text - liability | Head POS - NOUN | Children -  []
Text - liability | Dep -  dobj | Head Text - shift | Head POS - VERB | Children -  [insurance]
Text - toward | Dep -  prep | Head Text - shift | Head POS - VERB | Children -  [manufacturers]
Text - manufacturers | Dep -  pobj | Head Text - toward | Head POS - ADP | Children -  []

In [5]:
# Visualizing the Graph(TODO: is it tree ?)

# Importing displacy from spacy
from spacy import displacy

# TODO: Why the graph is different from the graph shown in documentation

# Rendering with support for jupyter
displacy.render(doc, jupyter=True)


Autonomous ADJ cars NOUN shift VERB insurance NOUN liability NOUN toward ADP manufacturers NOUN amod nsubj compound dobj prep pobj

Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest — from below:


In [6]:
# Importing the necessary symbols
from spacy.symbols import nsubj, VERB

# Finding a verb with a subject from below — good
verbs = set()
for possible_subject in doc:
    if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
        verbs.add(possible_subject.head)
# Printing the verbs
print(verbs)


{shift}

More on Iterating over local tree

Children that occur before and after the Token

  • Token.lefts, Token.rights
  • Token.n_lefts, Token.n_rights

Get a whole phrase by its syntactic head using the Token.subtree

  • Token.ancestors, Token.is_ancestor()

In [7]:
# Creating a new document from the sentence
doc = nlp(u'Credit and mortgage account holders must submit their requests')
# Finding out the root token
root = [token for token in doc if token.head == token][0]

print('Root word token text - ', root.text)

# The other way may be

"""
for token in doc:
    # Rather than checking with static text of 'ROOT' can we get the text from spacy.symbols ?
    if token.dep_ == 'ROOT':
        root = token.dep_
        break
"""
# Getting the subject of the doc, that subject is always left to ROOT and it is first element of it
# subject is also the 'Token' instance
subject = list(root.lefts)[0]
print('Subject word token text -', subject.text, '\n')

# Iterating over the subtree
print('Descendant Text | Depenedency | No of Lefts | No of Rights | All Ancestors')
for descendant in subject.subtree:
    assert subject is descendant or subject.is_ancestor(descendant)
    print(descendant.text, '|', descendant.dep_, '|', descendant.n_lefts, '|', descendant.n_rights,
          '|', [ancestor.text for ancestor in descendant.ancestors])


Root word token text -  submit
Subject word token text - holders 

Descendant Text | Depenedency | No of Lefts | No of Rights | All Ancestors
Credit | nmod | 0 | 2 | ['holders', 'submit']
and | cc | 0 | 0 | ['Credit', 'holders', 'submit']
mortgage | compound | 0 | 0 | ['account', 'Credit', 'holders', 'submit']
account | conj | 1 | 0 | ['Credit', 'holders', 'submit']
holders | nsubj | 1 | 0 | ['submit']

Finally, the .left_edge and .right_edge attributes can be especially useful, because they give you the first and last token of the subtree. This is the easiest way to create a Span object for a syntactic phrase. Note that .right_edge gives a token within the subtree — so if you use it as the end-point of a range(for python lists), don't forget to +1!


In [8]:
# Creating new document from sentence
doc = nlp(u'Credit and mortgage account holders must submit their request')

# Left and right edges Edges and indexes of token in a doc, Observe .i of right_edge
print('left edge(LE) - ', doc[4].left_edge, ' LE Index -', doc[4].left_edge.i)
print('Right edge(RE) - ', doc[4].right_edge, ' RE Index -', doc[4].right_edge.i, '\n')

span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]

# Merging it to override the doc object
span.merge()
print('Text | POST Tag | Dep | HEAD Text')
for token in doc:
    print(token.text, '|', token.pos_, '|', token.dep_, '|', token.head.text)


left edge(LE) -  Credit  LE Index - 0
Right edge(RE) -  holders  RE Index - 4 

Text | POST Tag | Dep | HEAD Text
Credit and mortgage account holders | NOUN | nsubj | submit
must | VERB | aux | submit
submit | VERB | ROOT | submit
their | ADJ | poss | request
request | NOUN | dobj | submit

Disabling Parser

In the default models, the parser is loaded and enabled as part of the standard processing pipeline. We can disable it on demand, Disabling the parser will make spaCy load and run much faster.

We can disable it a whole while loading the model, or we can also disable it for specific document


In [9]:
en_model_disabled_parser_nlp = spacy.load('en', disable=['parser'])

doc1 = en_model_disabled_parser_nlp(u"This sentence shouldn't be parsed")

# TODO: Need to check why the word shouldn't is parsed ??
for token in doc1:
    print(token)

doc2 = en_model_disabled_parser_nlp(u'Same like above sentence.')

for token in doc2:
    print(token)

normal_model = spacy.load('en')

# TODO: Need to check why the word don't is parsed ??
doc = normal_model(u"I don't want this to be parsed", disable=['parser'])

for token in doc:
    print(token)


This
sentence
should
n't
be
parsed
Same
like
above
sentence
.
I
do
n't
want
this
to
be
parsed

In [ ]:
# Continued to Linguistic_features_entity_recognition.ipynb