spaCy offers a rule-matching tool called Matcher
that allows you to build a library of token patterns, then match those patterns against a Doc object to return a list of found matches. You can match on any part of the token including text and annotations, and you can add multiple patterns to the same matcher.
In [1]:
# Perform standard imports
import spacy
nlp = spacy.load('en_core_web_sm')
In [2]:
# Import the Matcher library
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
Here `matcher` is an object that pairs to the current `Vocab` object. We can add and remove specific named matchers to `matcher` as needed.
In [3]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]
matcher.add('SolarPower', None, pattern1, pattern2, pattern3)
Let's break this down:
pattern1
looks for a single token whose lowercase text reads 'solarpower'pattern2
looks for two adjacent tokens that read 'solar' and 'power' in that orderpattern3
looks for three adjacent tokens, with a middle token that can be any punctuation.*\* Remember that single spaces are not tokenized, so they don't count as punctuation.
Once we define our patterns, we pass them into matcher
with the name 'SolarPower', and set callbacks to None
(more on callbacks later).
In [4]:
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')
In [5]:
found_matches = matcher(doc)
print(found_matches)
matcher
returns a list of tuples. Each tuple contains an ID for the match, with start & end tokens that map to the span doc[start:end]
In [6]:
for match_id, start, end in found_matches:
string_id = nlp.vocab.strings[match_id] # get string representation
span = doc[start:end] # get the matched span
print(match_id, string_id, start, end, span.text)
The match_id
is simply the hash value of the string_ID
'SolarPower'
In [7]:
# Redefine the patterns:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')
# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2)
In [8]:
found_matches = matcher(doc)
print(found_matches)
This found both two-word patterns, with and without the hyphen!
The following quantifiers can be passed to the 'OP'
key:
OP | Description |
---|---|
\! | Negate the pattern, by requiring it to match exactly 0 times |
? | Make the pattern optional, by allowing it to match 0 or 1 times |
\+ | Require the pattern to match 1 or more times |
\* | Allow the pattern to match zero or more times |
In [9]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LEMMA': 'power'}] # CHANGE THIS PATTERN
# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')
# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2)
In [10]:
doc2 = nlp(u'Solar-powered energy runs solar-powered cars.')
In [11]:
found_matches = matcher(doc2)
print(found_matches)
The matcher found the first occurrence because the lemmatizer treated 'Solar-powered' as a verb, but not the second as it considered it an adjective.
For this case it may be better to set explicit token patterns.
In [12]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solarpowered'}]
pattern4 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'powered'}]
# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')
# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2, pattern3, pattern4)
In [13]:
found_matches = matcher(doc2)
print(found_matches)
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
Attribute | Description |
---|---|
`ORTH` | The exact verbatim text of a token |
`LOWER` | The lowercase form of the token text |
`LENGTH` | The length of the token text |
`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphanumeric characters, ASCII characters, digits |
`IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase |
`IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word |
`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email |
`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, dependency label, lemma, shape |
`ENT_TYPE` | The token's entity label |
In [14]:
# Perform standard imports, reset nlp
import spacy
nlp = spacy.load('en_core_web_sm')
In [15]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
For this exercise we're going to import a Wikipedia article on Reaganomics
Source: https://en.wikipedia.org/wiki/Reaganomics
In [16]:
with open('../TextFiles/reaganomics.txt', encoding='utf8') as f:
doc3 = nlp(f.read())
In [17]:
# First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']
# Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]
# Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)
# Build a list of matches:
matches = matcher(doc3)
In [18]:
# (match_id, start, end)
matches
Out[18]:
The first four matches are where these terms are used in the definition of Reaganomics:
In [19]:
doc3[:70]
Out[19]:
In [20]:
doc3[665:685] # Note that the fifth match starts at doc3[673]
Out[20]:
In [21]:
doc3[2975:2995] # The sixth match starts at doc3[2985]
Out[21]:
Another way is to first apply the sentencizer
to the Doc, then iterate through the sentences to the match point:
In [22]:
# Build a list of sentences
sents = [sent for sent in doc3.sents]
# In the next section we'll see that sentences contain start and end token values:
print(sents[0].start, sents[0].end)
In [23]:
# Iterate over the sentence list until the sentence end value exceeds a match start value:
for sent in sents:
if matches[4][1] < sent.end: # this is the fifth match, that starts at doc3[673]
print(sent)
break
For additional information visit https://spacy.io/usage/linguistic-features#section-rule-based-matching