Link here
The default model identifies a variety of named and numeric entities, including companies, locations, organizations and products.
You can add arbitrary classes to the entity recognition system, and update the model with new examples.
Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.
In [47]:
# Importing spacy
import spacy
In [48]:
# Loading language model
nlp = spacy.load('en')
In [49]:
# Creating the document from the sentence
doc = nlp(u'Netflix is hiring a new VP of global policy')
# TODO: From documentation - https://spacy.io/usage/linguistic-features#setting-entities
# it says it didnt recognize entities but the model actually recognized some entities
# TODO: If we load large model - "en_core_web_lg", does it recognize correctly ?
for e in doc.ents:
print('text - ', e.text, ' | label - ', e.label_)
In [50]:
from spacy import displacy
# Observations
# 1. Netflix is identified as PERSON entity, rather it is an ORG / company
# 2. VP is identified as ORG, rather it should a person
# Now lets override and assert the same below
displacy.render(doc, style='ent', jupyter=True)
In [51]:
# Importing Span from spacy
from spacy.tokens import Span
doc = nlp(u'Netflix is hiring a new VP of global policy')
# get hash values of entity labels
ORG = doc.vocab.strings[u'ORG']
PERSON = doc.vocab.strings[u'PERSON']
# creating Span(s) for the new entities
# Note: Span with the start and end index of the token, not the start and end index of the entity in the document.
netflix_ent = Span(doc, 0, 1, label=ORG) # Netflix is first token so indexes is 0:1
vp_ent = Span(doc, 5, 6, label=PERSON) # VP is 6th token so indexes is 5:6
# Overriding original entities
doc.ents = [netflix_ent, vp_ent]
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Modified Entities - ', ents)
assert ents == [(u'Netflix', 0, 7, u'ORG'), (u'VP', 24, 26, u'PERSON')] # assetion should be successful
# Now lets visualize the new entites
# You can see that now "Netflix" entity is changed to "ORG" and "VP" is changed to "PERSON"
displacy.render(doc, style='ent', jupyter=True)
In [52]:
# Setting entity annotations from array
import numpy
from spacy.attrs import ENT_IOB, ENT_TYPE
# Making a new document, which will not do any entity tagging
doc = nlp.make_doc(u'London is a big city in the United Kingdom.')
# Entities are empty
assert list(doc.ents) == []
# Creating headers
header = [ENT_IOB, ENT_TYPE]
# Initializing array with numpy zeros
attr_array = numpy.zeros((len(doc), len(header)))
# TODO: Need more understanding
attr_array[0, 0] = 2 # B
attr_array[0, 1] = doc.vocab.strings[u'GPE']
doc.from_array(header, attr_array)
# TODO: Assertion fails
#assert list(doc.ents)[0].text == u'London'
Out[52]:
Built in Entity Types here
In [53]:
# Adding special case Tokenization rules
# https://spacy.io/usage/linguistic-features#special-cases
# Importing necessary symbols from spacy
from spacy.symbols import ORTH, LEMMA, POS, TAG
# New sentence wanted to tokenize
doc = nlp(u'gimme that')
assert [w.text for w in doc] == [u'gimme', u'that'] # current tokenization has only 2 tokens
# add special case rule
special_case = [{ORTH: u'gim', LEMMA: u'give', POS: u'VERB'}, {ORTH: u'me'}]
# Adding the special case to tokenizer, which will be affective from next sentence parsing
nlp.tokenizer.add_special_case(u'gimme', special_case)
assert [w.text for w in nlp(u'gimme that')] == [u'gim', u'me', u'that'] # After customization got 3 tokens
# Pronoun lemma is returned as -PRON-!
assert [w.lemma_ for w in nlp(u'gimme that')] == [u'give', u'-PRON-', u'that']
In [55]:
# The special case doesn't have to match an entire whitespace-delimited substring.
# The tokenizer will incrementally split off punctuation, and keep looking up the remaining substring
# gimme! when split it returns 3 values gim, me, ! and lemmas are give, -PRON-, !
assert 'gimme' not in [w.text for w in nlp(u'gimme!')] # gimme should not be there in token texts
assert [w.lemma_ for w in nlp(u'gimme!')] == [u'give', u'-PRON-', u'!'] # lemmas should be give, -PRON-, !
# Tokenizing works even with the string part of the periods and punctuations
assert 'gimme' not in [w.text for w in nlp(u'("...gimme...?")')]
# Asserting lemmas for '...gimme...?'
assert [w.lemma_ for w in nlp(u'...gimme...?')] == [u'...', u'give', u'-PRON-', u'...', u'?']
# Adding another case for matching the whole "...gimme...?" to a single token
special_case = [{ORTH: u'...gimme...?', LEMMA: u'give', TAG: u'VB'}]
nlp.tokenizer.add_special_case(u'...gimme...?', special_case)
# the length of tokens should be one
assert len(nlp(u'...gimme...?')) == 1