In [1]:
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')
In [2]:
# Write a function to display basic entity info:
def show_ents(doc):
if doc.ents:
for ent in doc.ents:
print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
else:
print('No named entities found.')
In [3]:
doc = nlp(u'May I go to Washington, DC next May to see the Washington Monument?')
show_ents(doc)
Here we see tokens combine to form the entities Washington, DC
, next May
and the Washington Monument
Doc.ents
are token spans with their own set of annotations.
`ent.text` | The original entity text |
`ent.label` | The entity type's hash value |
`ent.label_` | The entity type's string description |
`ent.start` | The token span's *start* index position in the Doc |
`ent.end` | The token span's *stop* index position in the Doc |
`ent.start_char` | The entity text's *start* index position in the Doc |
`ent.end_char` | The entity text's *stop* index position in the Doc |
In [4]:
doc = nlp(u'Can I please borrow 500 dollars from you to buy some Microsoft stock?')
for ent in doc.ents:
print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_)
Tags are accessible through the .label_
property of an entity.
TYPE | DESCRIPTION | EXAMPLE |
---|---|---|
`PERSON` | People, including fictional. | *Fred Flintstone* |
`NORP` | Nationalities or religious or political groups. | *The Republican Party* |
`FAC` | Buildings, airports, highways, bridges, etc. | *Logan International Airport, The Golden Gate* |
`ORG` | Companies, agencies, institutions, etc. | *Microsoft, FBI, MIT* |
`GPE` | Countries, cities, states. | *France, UAR, Chicago, Idaho* |
`LOC` | Non-GPE locations, mountain ranges, bodies of water. | *Europe, Nile River, Midwest* |
`PRODUCT` | Objects, vehicles, foods, etc. (Not services.) | *Formula 1* |
`EVENT` | Named hurricanes, battles, wars, sports events, etc. | *Olympic Games* |
`WORK_OF_ART` | Titles of books, songs, etc. | *The Mona Lisa* |
`LAW` | Named documents made into laws. | *Roe v. Wade* |
`LANGUAGE` | Any named language. | *English* |
`DATE` | Absolute or relative dates or periods. | *20 July 1969* |
`TIME` | Times smaller than a day. | *Four hours* |
`PERCENT` | Percentage, including "%". | *Eighty percent* |
`MONEY` | Monetary values, including unit. | *Twenty Cents* |
`QUANTITY` | Measurements, as of weight or distance. | *Several kilometers, 55kg* |
`ORDINAL` | "first", "second", etc. | *9th, Ninth* |
`CARDINAL` | Numerals that do not fall under another type. | *2, Two, Fifty-two* |
In [5]:
doc = nlp(u'Tesla to build a U.K. factory for $6 million')
show_ents(doc)
Right now, spaCy does not recognize "Tesla" as a company.
In [6]:
from spacy.tokens import Span
# Get the hash value of the ORG entity label
ORG = doc.vocab.strings[u'ORG']
# Create a Span for the new entity
new_ent = Span(doc, 0, 1, label=ORG)
# Add the entity to the existing Doc object
doc.ents = list(doc.ents) + [new_ent]
In the code above, the arguments passed to `Span()` are:
doc
- the name of the Doc object0
- the start index position of the span1
- the stop index position (exclusive)label=ORG
- the label assigned to our entity
In [7]:
show_ents(doc)
In [8]:
doc = nlp(u'Our company plans to introduce a new vacuum cleaner. '
u'If successful, the vacuum cleaner will be our first product.')
show_ents(doc)
In [9]:
# Import PhraseMatcher and create a matcher object:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
In [10]:
# Create the desired phrase patterns:
phrase_list = ['vacuum cleaner', 'vacuum-cleaner']
phrase_patterns = [nlp(text) for text in phrase_list]
In [11]:
# Apply the patterns to our matcher object:
matcher.add('newproduct', None, *phrase_patterns)
# Apply the matcher to our Doc object:
matches = matcher(doc)
# See what matches occur:
matches
Out[11]:
In [12]:
# Here we create Spans from each match, and create named entities from them:
from spacy.tokens import Span
PROD = doc.vocab.strings[u'PRODUCT']
new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches]
doc.ents = list(doc.ents) + new_ents
In [13]:
show_ents(doc)
In [14]:
doc = nlp(u'Originally priced at $29.50, the sweater was marked down to five dollars.')
show_ents(doc)
In [15]:
len([ent for ent in doc.ents if ent.label_=='MONEY'])
Out[15]:
In [16]:
spacy.__version__
Out[16]:
In [17]:
doc = nlp(u'Originally priced at $29.50,\nthe sweater was marked down to five dollars.')
show_ents(doc)
In [18]:
# Quick function to remove ents formed on whitespace:
def remove_whitespace_entities(doc):
doc.ents = [e for e in doc.ents if not e.text.isspace()]
return doc
# Insert this into the pipeline AFTER the ner component:
nlp.add_pipe(remove_whitespace_entities, after='ner')
In [19]:
# Rerun nlp on the text above, and show ents:
doc = nlp(u'Originally priced at $29.50,\nthe sweater was marked down to five dollars.')
show_ents(doc)
For more on Named Entity Recognition visit https://spacy.io/usage/linguistic-features#101
Doc.noun_chunks
are base noun phrases: token spans that include the noun and words describing the noun. Noun chunks cannot be nested, cannot overlap, and do not involve prepositional phrases or relative clauses.
Where Doc.ents
rely on the ner pipeline component, Doc.noun_chunks
are provided by the parser.
In [20]:
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")
for chunk in doc.noun_chunks:
print(chunk.text+' - '+chunk.root.text+' - '+chunk.root.dep_+' - '+chunk.root.head.text)
In [21]:
len(doc.noun_chunks)
In [22]:
len(list(doc.noun_chunks))
Out[22]:
For more on noun_chunks visit https://spacy.io/usage/linguistic-features#noun-chunks