One of the things I learned early on about scraping web pages (often referred to as "screen scraping") is that it often amounts to trying to recreate databases that have been re-presented as web pages using HTML templates. For example:
The aim of the scrape in these cases might be as simple as pulling the table from the page and representing it as a dataframe, or trying to reverse engineer the HTML template that converts data to HTML into something that can extract the data from the HTML back as a row in a corresponding data table.
In the latter case, the scrape may proceed in a couple of ways. For example:
In more general cases, however, such as when trying to abstract meaningful information from arbitrary, natural language, texts, we need to up our game and start to analyse the texts as natural language texts.
As an example, consider the following text:
From February 2016, as an author, payments from Head of Zeus Publishing; a client of Averbrook Ltd. Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street. London WC1N 2LS. From October 2016 until July 2018, I will receive a regular payment of £13,000 per month (previously £11,000). Hours: 12 non-consecutive hrs per week. Any additional payments are listed below. (Updated 20 January 2016, 14 October 2016 and 2 March 2018)
To a human reader, we can identify various structural patterns, as well as parsing the natural language sentences.
Let's start with some of the structural patterns:
In [1]:
from parse import parse
In [2]:
bigtext = '''\
From February 2016, as an author, payments from Head of Zeus Publishing; \
a client of Averbrook Ltd. Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street. \
London WC1N 2LS. From October 2016 until July 2018, I will receive a regular payment \
of £13,000 per month (previously £11,000). Hours: 12 non-consecutive hrs per week. \
Any additional payments are listed below. (Updated 20 January 2016, 14 October 2016 and 2 March 2018)'''
In [3]:
#Extract the sentence containing the update dates
parse('{}(Updated {updated})', bigtext)['updated']
Out[3]:
In [4]:
#Extract the phrase describing the hours
parse('{}Hours: {hours}.{}', bigtext)['hours']
Out[4]:
There also look to be sentences that might be standard sentences, such as Any additional payments are listed below.
Within the text are things that we might recognise as company names, dates, or addresses. Entity recognition refers to a natural language processing technique that attempts to extract words that describe "things", that is, entities, as well as identifying what sorts of "thing", or entity, they are.
One powerful Python natural language processing package, spacy
, has an entity recognition capability. Let's see how to use it and what sort of output it produces:
In [5]:
#Import the spacy package
import spacy
#The package parses lanuage according to different statistically trained models
#Let's load in the basic English model:
nlp = spacy.load('en')
In [6]:
#Generate a version of the text annotated using features detected by the model
doc = nlp(bigtext)
The parsed text is annotated in a variety of ways.
For example, we can directly access all the sentences in the original text:
In [7]:
list(doc.sents)
Out[7]:
In [49]:
ents = list(doc.ents)
entTypes = []
for entity in ents:
entTypes.append(entity.label_)
print(entity, '::', entity.label_)
In [9]:
for entType in set(entTypes):
print(entType, spacy.explain(entType))
We can also look at each of the tokens in text and identify whether it is part of a entity, and if so, what sort. The .iob_
attributes identifies O
as not part of an entity, B
as the first token in an entity, and I
as continuing part of an entity.
In [65]:
for token in doc[:15]:
print('::'.join([token.text, token.ent_type_,token.ent_iob_]) )
Looking at the extracted entities, we see we get some good hits:
Averbrook Ltd.
is an ORG
;20 January 2016
and 14 October 2016
are both instances of a DATE
Some near misses:
Zeus Publishing
isn't a PERSON
, although we might see why it has been recognised as such. (Could we overlay the model with an additional mapping of if PERSON and endswith.in(['Publishing', 'Holdings']) -> ORG
?) And some things that are mis-categorised:
52 Doughty Street
isn't really meaningful as a QUANTITY
.Several things we might usefully want to categorise - such as a UK postcode, for example, which might be useful in and of itself, or when helping us to identify an address - is not recognised as an entity.
Things recognised as dates we might want to then further parse as date object types:
In [10]:
from dateutil import parser as dtparser
[(d, dtparser.parse(d.string)) for d in ents if d.label_ == 'DATE']
Out[10]:
In [11]:
#see also https://github.com/akoumjian/datefinder
#datefinder - Find dates inside text using Python and get back datetime objects
As well as indentifying entities, spacy
analyses texts at several othr levels. One such level of abstraction is the "shape" of each token. This identifies whether or not a character is an upper or lower case alphabetic character, a digit, or a punctuation character (which appears as itself):
In [64]:
for token in doc[:15]:
print(token, '::', token.shape_)
The "shape" of a token provides an additional structural item that we might be able to make use of in scrapers of the raw text.
For example, writing an efficient regular expression to identify a UK postcode can be a difficult task, but we can start to cobble one together from the shapes of different postcodes written in "standard" postcode form:
In [13]:
[pc.shape_ for pc in nlp('MK7 6AA, SW1A 1AA, N7 6BB')]
Out[13]:
We can define a matcher
function that will identify the tokens in a document that match a particular ordered combination of shape patterns.
For example, the postcode like things described above have the shapes:
XXd dXX
XXdX dXX
Xd dXX
We can use these structural patterns to identify token pairs as possible postcodes.
In [66]:
from spacy.matcher import Matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
matcher.add('POSTCODE', None,
[{'SHAPE':'XXdX'}, {'SHAPE':'dXX'}],
[{'SHAPE':'XXd'}, {'SHAPE':'dXX'}],
[{'SHAPE':'Xd'}, {'SHAPE':'dXX'}])
Let's test that:
In [15]:
pcdoc = nlp('pc is WC1N 4CC okay, as is MK7 4AA and Sir James Smith and Lady Jane Grey are presumably persons.')
matches = matcher(pcdoc)
#See what we matched, and let's see what entities we have detected
print('Matches: {}\nEntities: {}'.format([pcdoc[m[1]:m[2]] for m in matches], [(m,m.label_) for m in pcdoc.ents]))
The matcher seems to have matched the postcodes, but is not identifying them as entities. (We also note that the entity matcher has missed the "Sir" title. In some cases, it might also match a postcode as a person.)
To add the matched items to the entity list, we need to add a callback function to the matcher.
In [71]:
##Define a POSTCODE as a new entity type by adding matched postcodes to the doc.ents
#https://stackoverflow.com/a/47799669
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
def add_entity_label(matcher, doc, i, matches):
match_id, start, end = matches[i]
doc.ents += ((match_id, start, end),)
#Recognise postcodes from different shapes
matcher.add('POSTCODE', add_entity_label, [{'SHAPE': 'XXdX'},{'SHAPE':'dXX'}], [{'SHAPE':'XXd'},{'SHAPE':'dXX'}])
pcdoc = nlp('pc is WC1N 4CC okay, as is MK7 4AA and James Smith is presumably a person')
matches = matcher(pcdoc)
print('Matches: {}\nEntities: {}'.format([pcdoc[m[1]:m[2]] for m in matches], [(m,m.label_) for m in pcdoc.ents]))
Let's put those pieces together more succinctly:
In [52]:
bigtext
Out[52]:
In [53]:
#Generate base tagged doc
doc = nlp(bigtext)
#Run postcode tagger over the doc
_ = matcher(doc)
The tagged document should now include POSTCODE
entities. One of the easiest ways to check the effectiveness of a new entity tagger is to check the document with recognised entities visualised within it.
The displacy
package has a Jupyter enabled visualiser for doing just that.
In [54]:
from spacy import displacy
displacy.render(doc, jupyter=True, style='ent')
In [67]:
import pandas as pd
mpdata=pd.read_csv('members_mar18.csv')
mpdata.head(5)
Out[67]:
From this, we can extract a list of MP names, albeit in reverse word order.
In [130]:
term_list = mpdata['list_name'].tolist()
term_list[:5]
Out[130]:
If we wanted to match those names as "MP" entities, we could use the following recipe to add an MP entity type that will be returned if any of the MP names are matched:
In [75]:
from spacy.matcher import PhraseMatcher
nlp = spacy.load('en')
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp(text) for text in term_list]
matcher.add('MP', add_entity_label, *patterns)
Let's test that new entity on a test string:
In [74]:
doc = nlp("The MPs were Adams, Nigel, Afolami, Bim and Abbott, Ms Diane.")
matches = matcher(doc)
displacy.render(doc, jupyter=True, style='ent')
In [181]:
import re
#https://stackoverflow.com/a/164994/454773
regex_ukpc = r'([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))\s?[0-9][A-Za-z]{2})'
In [198]:
#Based on https://spacy.io/usage/linguistic-features
nlp = spacy.load('en')
doc = nlp("The postcodes were MK1 6AA and W1A 1AA.")
for match in re.finditer(regex_ukpc, doc.text):
start, end = match.span() # get matched indices
entity = doc.char_span(start, end, label='POSTCODE') # create Span from indices
doc.ents = list(doc.ents) + [entity]
entity.merge()
displacy.render(doc, jupyter=True, style='ent')
In [19]:
nlp('pc is WC1N 4CC okay, as is MK7 4AA and Sir James Smith and Lady Jane Grey are presumably persons').ents
Out[19]:
Let's see if we can update the training of the model so that it does recognise the "Sir" title as part of a person's name.
We can do that by creating some new training data and using it to update the model. The entities
dict identifies the index values in the test string that delimit the entity we want to extract.
In [20]:
# training data
TRAIN_DATA = [
('Received from Sir John Smith last week.', {
'entities': [(14, 28, 'PERSON')]
}),
('Sir Richard Jones is another person', {
'entities': [(0, 18, 'PERSON')]
})
]
In this case, we are going to let spacy
learn its own patterns, as a statistical model, that will - if the learning pays off correctly - identify things like "Sir Bimble Bobs" as a PERSON
entity.
In [21]:
import random
#model='en' #'en_core_web_sm'
#nlp = spacy.load(model)
cycles=20
optimizer = nlp.begin_training()
for i in range(cycles):
random.shuffle(TRAIN_DATA)
for txt, annotations in TRAIN_DATA:
nlp.update([txt], [annotations], sgd=optimizer)
In [22]:
nlp('pc is WC1N 4CC okay, as is MK7 4AA and Sir James Smith and Lady Jane Grey are presumably persons').ents
Out[22]:
One of the things that can be a bit fiddly is generating the training strings. We ca produce a little utility function that will help us create a training pattern by identifying the index value(s) associated with a particular substring, that we wish to identify as an example of a particular entity type, inside a text string.
The first thing we need to do is find the index values within a string that show where a particular substring can be found. The Python find()
and index()
methods will find the first location of a substring in a string. However, where a substring appears several times in a sring, we need a new function to identify all the locations. There are several ways of doing this...
In [23]:
#Find multiple matches using .find()
#https://stackoverflow.com/a/4665027/454773
def _find_all(string, substring):
#Generator to return index of each string match
start = 0
while True:
start = string.find(substring, start)
if start == -1: return
yield start
start += len(substring)
def find_all(string, substring):
return list(_find_all(string, substring))
#Find multiple matches using a regular expression
#https://stackoverflow.com/a/4664889/454773
import re
def refind_all(string, substring):
return [m.start() for m in re.finditer(substring, string)]
In [24]:
txt = 'This is a string.'
substring = 'is'
print( find_all(txt, substring) )
print( refind_all(txt, substring) )
We can use either of these functions to find the location of a substring in a string, and then use these index values to help us create our training data.
In [25]:
def trainingTupleBuilder(string, substring, typ, entities=None):
ixs = refind_all(string, substring)
offset = len(substring)
if entities is None: entities = {'entities':[]}
for ix in ixs:
entities['entities'].append( (ix, ix+offset, typ) )
return (string, entities)
#('Received from Sir John Smith last week.', {'entities': [(14, 28, 'PERSON')]})
trainingTupleBuilder('Received from Sir John Smith last week.','Sir John Smith','PERSON')
Out[25]:
In [26]:
TRAIN_DATA = []
TRAIN_DATA.append(trainingTupleBuilder("He lives at 27, Oswaldtwistle Way, Birmingham",'27, Oswaldtwistle Way, Birmingham','B-ADDRESS'))
TRAIN_DATA.append(trainingTupleBuilder("Payments from Boondoggle Limited, 377, Hope Street, Little Village, Halifax. Received: October, 2017",'377, Hope Street, Little Village, Halifax','B-ADDRESS'))
TRAIN_DATA
Out[26]:
The B-
prefix identifies the entity as a multi-token entity.
In [89]:
#https://spacy.io/usage/training
def spacytrainer(model=None, output_dir=None, n_iter=100, debug=False):
"""Load the model, set up the pipeline and train the entity recognizer."""
if model is not None:
if isinstance(model,str):
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
#Else we assume we have passed in an nlp model
else: nlp = model
else:
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
ner = nlp.get_pipe('ner')
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
if debug: print(losses)
# test the trained model
if debug:
for text, _ in TRAIN_DATA:
doc = nlp(text)
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
doc = nlp2(text)
print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])
return nlp
Let's update the en
model to include a really crude address parser based on the two lines of training data described above.
In [96]:
nlp = spacytrainer('en')
In [97]:
#See if we can identify the address
addr_doc = nlp(text)
displacy.render(addr_doc , jupyter=True, style='ent')
As well as recognising different types of entity, which may be identified across several different words, the spacy
parser also marks up each separate word (or token) as a particular "part-of-speech" (POS), such as a noun, verb, or adjective.
Parts of speech are identified as .pos_
or .tag_
token attributes.
In [31]:
tags = []
for token in doc[:15]:
print(token, '::', token.pos_, '::', token.tag_)
tags.append(token.tag_)
An explain()
function describes each POS type in natural language terms:
In [32]:
for tag in set(tags):
print(tag, '::', spacy.explain(tag))
We can also get a list of "noun chunks" identified in the text, as well as other words they relate to in a sentence:
In [46]:
for chunk in doc.noun_chunks:
print(' :: '.join([chunk.text, chunk.root.text, chunk.root.dep_,
chunk.root.head.text]))
textacy
As well as the basic spacy
functionality, packages exist that build on spacy
to provide further tools for working with abstractions identified using spacy
.
For example, the textacy
package provides a way of parsing sentences using regular expressions defined over (Ontonotes5?) POS tags:
In [33]:
import textacy
list(textacy.extract.pos_regex_matches(nlp(text),r'<NOUN> <ADP> <PROPN|ADP>+'))
Out[33]:
In [34]:
textacy.constants.POS_REGEX_PATTERNS
Out[34]:
In [35]:
xx='A sum of £2000-3000 last or £2,000 or £2000-£3000 or £2,000-£3,000 year'
for t in nlp(xx):
print(t,t.tag_, t.pos_)
In [36]:
for e in nlp(xx).ents:
print(e, e.label_)
In [37]:
list(textacy.extract.pos_regex_matches(nlp('A sum of £2000-3000 last or £2,000 or £2000-£3000 or £2,000-£3,000 year'),r'<SYM><NUM><SYM>?<NUM>?'))
Out[37]:
If we can define appropriate POS pattern, we can extract terms from an arbitrary text based on that pattern, an approach that is far more general than trying to write a regular expression pattern matcher over just the raw text.
In [ ]:
In [38]:
#define approx amount eg £10,000-£15,000 or £10,000-15,000
parse('{}£{a}-£{b:g}{}','eg £10,000-£15,000 or £14,000-£16,000'.replace(',',''))
Out[38]:
Matchers can be created over a wide range of attributes (docs), including POS tags and entity labels.
For example, we can start trying to build an address tagger by looking for things that end with a postcode.
In [156]:
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
matcher.add('POSTCODE', add_entity_label, [{'SHAPE': 'XXdX'},{'SHAPE':'dXX'}], [{'SHAPE':'XXd'},{'SHAPE':'dXX'}])
matcher.add('ADDRESS', add_entity_label,
[{'POS':'NUM','OP':'+'},{'POS':'PROPN','OP':'+'}, {'ENT_TYPE':'POSTCODE', 'OP':'+'}],
[{'ENT_TYPE':'GPE','OP':'+'}, {'ENT_TYPE':'POSTCODE', 'OP':'+'}])
addr_doc = nlp(text)
matcher(addr_doc)
displacy.render(addr_doc , jupyter=True, style='ent')
for m in matcher(addr_doc):
print(addr_doc[m[1]:m[2]])
print([(e, e.label_) for e in addr_doc.ents])
In this case, we note that the visualiser cannot cope with rendering multiple entity types over one or more words. In the above example, the POSTCODE
entitites are highlighted, but we note from the matcher that ADDRESS
ranges are also identified that extend across entities defined over fewer terms.
We can look at the structure of a text by printing out the child elements associated with each token in a sentence:
In [135]:
for sent in nlp(text).sents:
print(sent,'\n')
for token in sent:
print(token, ': ', str(list(token.children)))
print()
However, the displaCy
toolset, included as part of spacy
, provides a more appealing way of visualising parsed documents in two different ways:
The dependency graph identifies POS tags as well as how tokens are related in natural language grammatical phrases:
In [39]:
from spacy import displacy
In [47]:
displacy.render(doc, jupyter=True,style='dep')
In [48]:
displacy.render(doc, jupyter=True,style='dep',options={'distance':85, 'compact':True})
We can also use displaCy
to highlight, inline, the entities extracted from a text.
In [42]:
displacy.render(pcdoc, jupyter=True,style='ent')
In [43]:
displacy.render(doc, jupyter=True,style='ent')
In [151]:
mpdata=pd.read_csv('members_mar18.csv')
tmp = mpdata.to_dict(orient='record')
mpdatadict = {k['list_name']:k for k in tmp }
In [148]:
#via https://spacy.io/usage/processing-pipelines
mpdata=pd.read_csv('members_mar18.csv')
"""Example of a spaCy v2.0 pipeline component to annotate MP record with MNIS data"""
from spacy.tokens import Doc, Span, Token
class RESTMPComponent(object):
"""spaCy v2.0 pipeline component that annotates MP entity with MP data.
"""
name = 'mp_annotator' # component name, will show up in the pipeline
def __init__(self, nlp, label='MP'):
"""Initialise the pipeline component. The shared nlp instance is used
to initialise the matcher with the shared vocab, get the label ID and
generate Doc objects as phrase match patterns.
"""
# Get MP data
mpdata=pd.read_csv('members_mar18.csv')
mpdatadict = mpdata.to_dict(orient='record')
# Convert MP data to a dict keyed by MP name
self.mpdata = {k['list_name']:k for k in mpdatadict }
self.label = nlp.vocab.strings[label] # get entity label ID
# Set up the PhraseMatcher with Doc patterns for each MP name
patterns = [nlp(c) for c in self.mpdata.keys()]
self.matcher = PhraseMatcher(nlp.vocab)
self.matcher.add('MPS', None, *patterns)
# Register attribute on the Token. We'll be overwriting this based on
# the matches, so we're only setting a default value, not a getter.
# If no default value is set, it defaults to None.
Token.set_extension('is_mp', default=False)
Token.set_extension('mnis_id')
Token.set_extension('constituency')
Token.set_extension('party')
# Register attributes on Doc and Span via a getter that checks if one of
# the contained tokens is set to is_country == True.
Doc.set_extension('is_mp', getter=self.is_mp)
Span.set_extension('is_mp', getter=self.is_mp)
def __call__(self, doc):
"""Apply the pipeline component on a Doc object and modify it if matches
are found. Return the Doc, so it can be processed by the next component
in the pipeline, if available.
"""
matches = self.matcher(doc)
spans = [] # keep the spans for later so we can merge them afterwards
for _, start, end in matches:
# Generate Span representing the entity & set label
entity = Span(doc, start, end, label=self.label)
spans.append(entity)
# Set custom attribute on each token of the entity
# Can be extended with other data associated with the MP
for token in entity:
token._.set('is_mp', True)
token._.set('mnis_id', self.mpdata[entity.text]['member_id'])
token._.set('constituency', self.mpdata[entity.text]['constituency'])
token._.set('party', self.mpdata[entity.text]['party'])
# Overwrite doc.ents and add entity – be careful not to replace!
doc.ents = list(doc.ents) + [entity]
for span in spans:
# Iterate over all spans and merge them into one token. This is done
# after setting the entities – otherwise, it would cause mismatched
# indices!
span.merge()
return doc # don't forget to return the Doc!
def is_mp(self, tokens):
"""Getter for Doc and Span attributes. Returns True if one of the tokens
is an MP."""
return any([t._.get('is_mp') for t in tokens])
In [150]:
# For simplicity, we start off with only the blank English Language class
# and no model or pre-defined pipeline loaded.
nlp = spacy.load('en')
rest_mp = RESTMPComponent(nlp) # initialise component
nlp.add_pipe(rest_mp) # add it to the pipeline
doc = nlp(u"Some text about MPs Abbott, Ms Diane and Afriyie, Adam")
print('Pipeline', nlp.pipe_names) # pipeline contains component name
print('Doc has MPs', doc._.is_mp) # Doc contains MPs
for token in doc:
if token._.is_mp:
print(token.text, '::', token._.constituency,'::', token._.party,
'::', token._.mnis_id) # MP data
print('Entities', [(e.text, e.label_) for e in doc.ents]) # entities
It may be worth producing other exemplar pipelines based around UK gov registers. Or updating the above model to build the model using data directly pulled from the UK Parliament API.
[Loosely related (relevant to registers): https://github.com/frankieroberto/govuk-government-organisations-autocomplete]
In [ ]: