In [1]:
# https://spacy.io/usage/linguistic-features
# Importing spacy
import spacy
In [2]:
# Loading language model
nlp = spacy.load('en')
In [3]:
# Noun Chunks
doc = nlp(u'Autonomous cars shift insurance liability toward manufacturers')
# Text: The original noun chunk text.
# Root text: The original text of the word connecting the noun chunk to the rest of the parse.
# Root dep: Dependency relation connecting the root to its head.
# Root head text: The text of the root token's head.
for chunk in doc.noun_chunks:
# Printing the attributes of chunk
print('Text -', chunk.text, '|', 'Root Text -', chunk.root.text, '|',
'Root Dep -', chunk.root.dep_, '|', 'Root Head text', chunk.root.head.text)
# Printing the explanation of root dep of chunk
print('Dep Explanation -', chunk.root.dep_ ,'-', spacy.explain(chunk.root.dep_), '\n')
In [4]:
# Navigating Parse Tree. More here - https://spacy.io/usage/linguistic-features#navigating
# spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree.
# Text: The original token text.
# Dep: The syntactic relation connecting child to head.
# Head text: The original text of the token head.
# Head POS: The part-of-speech tag of the token head.
# Children: The immediate syntactic dependents of the token.
doc = nlp(u'Autonomous cars shift insurance liability toward manufacturers')
for token in doc:
print('Text -', token.text, '|', 'Dep - ', token.dep_, '|', 'Head Text -', token.head.text
, '|', 'Head POS -', token.head.pos_, '|', 'Children - ', [child for child in token.children])
In [5]:
# Visualizing the Graph(TODO: is it tree ?)
# Importing displacy from spacy
from spacy import displacy
# TODO: Why the graph is different from the graph shown in documentation
# Rendering with support for jupyter
displacy.render(doc, jupyter=True)
Because the syntactic relations form a tree, every word has exactly one head. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of interest — from below:
In [6]:
# Importing the necessary symbols
from spacy.symbols import nsubj, VERB
# Finding a verb with a subject from below — good
verbs = set()
for possible_subject in doc:
if possible_subject.dep == nsubj and possible_subject.head.pos == VERB:
verbs.add(possible_subject.head)
# Printing the verbs
print(verbs)
More on Iterating over local tree
Children that occur before and after the Token
Get a whole phrase by its syntactic head using the Token.subtree
In [7]:
# Creating a new document from the sentence
doc = nlp(u'Credit and mortgage account holders must submit their requests')
# Finding out the root token
root = [token for token in doc if token.head == token][0]
print('Root word token text - ', root.text)
# The other way may be
"""
for token in doc:
# Rather than checking with static text of 'ROOT' can we get the text from spacy.symbols ?
if token.dep_ == 'ROOT':
root = token.dep_
break
"""
# Getting the subject of the doc, that subject is always left to ROOT and it is first element of it
# subject is also the 'Token' instance
subject = list(root.lefts)[0]
print('Subject word token text -', subject.text, '\n')
# Iterating over the subtree
print('Descendant Text | Depenedency | No of Lefts | No of Rights | All Ancestors')
for descendant in subject.subtree:
assert subject is descendant or subject.is_ancestor(descendant)
print(descendant.text, '|', descendant.dep_, '|', descendant.n_lefts, '|', descendant.n_rights,
'|', [ancestor.text for ancestor in descendant.ancestors])
Finally, the .left_edge and .right_edge attributes can be especially useful, because they give you the first and last token of the subtree. This is the easiest way to create a Span object for a syntactic phrase. Note that .right_edge gives a token within the subtree — so if you use it as the end-point of a range(for python lists), don't forget to +1!
In [8]:
# Creating new document from sentence
doc = nlp(u'Credit and mortgage account holders must submit their request')
# Left and right edges Edges and indexes of token in a doc, Observe .i of right_edge
print('left edge(LE) - ', doc[4].left_edge, ' LE Index -', doc[4].left_edge.i)
print('Right edge(RE) - ', doc[4].right_edge, ' RE Index -', doc[4].right_edge.i, '\n')
span = doc[doc[4].left_edge.i : doc[4].right_edge.i+1]
# Merging it to override the doc object
span.merge()
print('Text | POST Tag | Dep | HEAD Text')
for token in doc:
print(token.text, '|', token.pos_, '|', token.dep_, '|', token.head.text)
In the default models, the parser is loaded and enabled as part of the standard processing pipeline. We can disable it on demand, Disabling the parser will make spaCy load and run much faster.
We can disable it a whole while loading the model, or we can also disable it for specific document
In [9]:
en_model_disabled_parser_nlp = spacy.load('en', disable=['parser'])
doc1 = en_model_disabled_parser_nlp(u"This sentence shouldn't be parsed")
# TODO: Need to check why the word shouldn't is parsed ??
for token in doc1:
print(token)
doc2 = en_model_disabled_parser_nlp(u'Same like above sentence.')
for token in doc2:
print(token)
normal_model = spacy.load('en')
# TODO: Need to check why the word don't is parsed ??
doc = normal_model(u"I don't want this to be parsed", disable=['parser'])
for token in doc:
print(token)
In [ ]:
# Continued to Linguistic_features_entity_recognition.ipynb