Text Scraping

One of the things I learned early on about scraping web pages (often referred to as "screen scraping") is that it often amounts to trying to recreate databases that have been re-presented as web pages using HTML templates. For example:

display a database table as an HTML table in a web page;
display each row of a database as a templated HTML page.

The aim of the scrape in these cases might be as simple as pulling the table from the page and representing it as a dataframe, or trying to reverse engineer the HTML template that converts data to HTML into something that can extract the data from the HTML back as a row in a corresponding data table.

In the latter case, the scrape may proceed in a couple of ways. For example:

by trying to identify structural HTML tag elements that contain recognisable data items, retrieving the HTML tag element, then extracting the data value;
parsing the recognisable literal text displayed on the web page and trying to extract data items based on that (i.e. ignore the HTML structural eelements and go straight for the extracted text). For an example of this sort of parsing, see the r1chardj0n3s/parse Python package as applied to text pulled from a page using something like the kennethreitz/requests-html package.

In more general cases, however, such as when trying to abstract meaningful information from arbitrary, natural language, texts, we need to up our game and start to analyse the texts as natural language texts.

Entity Extraction

As an example, consider the following text:

From February 2016, as an author, payments from Head of Zeus Publishing; a client of Averbrook Ltd. Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street. London WC1N 2LS. From October 2016 until July 2018, I will receive a regular payment of £13,000 per month (previously £11,000). Hours: 12 non-consecutive hrs per week. Any additional payments are listed below. (Updated 20 January 2016, 14 October 2016 and 2 March 2018)

To a human reader, we can identify various structural patterns, as well as parsing the natural language sentences.

Let's start with some of the structural patterns:



In [1]:

    
from parse import parse



In [2]:

    
bigtext = '''\
From February 2016, as an author, payments from Head of Zeus Publishing; \
a client of Averbrook Ltd. Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street. \
London WC1N 2LS. From October 2016 until July 2018, I will receive a regular payment \
of £13,000 per month (previously £11,000). Hours: 12 non-consecutive hrs per week. \
Any additional payments are listed below. (Updated 20 January 2016, 14 October 2016 and 2 March 2018)'''



In [3]:

    
#Extract the sentence containing the update dates
parse('{}(Updated {updated})', bigtext)['updated']









    Out[3]:





'20 January 2016, 14 October 2016 and 2 March 2018'



In [4]:

    
#Extract the phrase describing the hours
parse('{}Hours: {hours}.{}', bigtext)['hours']









    Out[4]:





'12 non-consecutive hrs per week'

There also look to be sentences that might be standard sentences, such as Any additional payments are listed below.

From Web Scraping to Text-Scraping Using Natural Language Processing

Within the text are things that we might recognise as company names, dates, or addresses. Entity recognition refers to a natural language processing technique that attempts to extract words that describe "things", that is, entities, as well as identifying what sorts of "thing", or entity, they are.

One powerful Python natural language processing package, spacy, has an entity recognition capability. Let's see how to use it and what sort of output it produces:



In [5]:

    
#Import the spacy package
import spacy

#The package parses lanuage according to different statistically trained models
#Let's load in the basic English model:
nlp = spacy.load('en')



In [6]:

    
#Generate a version of the text annotated using features detected by the model
doc = nlp(bigtext)

The parsed text is annotated in a variety of ways.

For example, we can directly access all the sentences in the original text:



In [7]:

    
list(doc.sents)









    Out[7]:





[From February 2016, as an author, payments from Head of Zeus Publishing; a client of Averbrook Ltd.,
 Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street.,
 London WC1N 2LS.,
 From October 2016 until July 2018, I will receive a regular payment of £13,000 per month (previously £11,000).,
 Hours: 12 non-consecutive hrs per week.,
 Any additional payments are listed below.,
 (Updated 20 January 2016, 14 October 2016 and 2 March 2018)]



In [49]:

    
ents = list(doc.ents)
entTypes = []
for entity in ents:
    entTypes.append(entity.label_)
    
    print(entity, '::', entity.label_)









    



February 2016 :: DATE
Zeus Publishing :: PERSON
Averbrook Ltd. :: ORG
45 :: CARDINAL
EC1R 0HT :: POSTCODE
Sheil Land :: ORG
52 Doughty Street :: QUANTITY
London :: GPE
WC1N 2LS :: POSTCODE
October 2016 :: DATE
July 2018 :: DATE
13,000 :: MONEY
11,000 :: MONEY
12 :: CARDINAL
Updated :: ORG
20 January 2016 :: DATE
14 October 2016 :: DATE
2 March 2018 :: DATE



In [9]:

    
for entType in set(entTypes):
    print(entType, spacy.explain(entType))









    



ORG Companies, agencies, institutions, etc.
GPE Countries, cities, states
PERSON People, including fictional
DATE Absolute or relative dates or periods
MONEY Monetary values, including unit
QUANTITY Measurements, as of weight or distance
CARDINAL Numerals that do not fall under another type

We can also look at each of the tokens in text and identify whether it is part of a entity, and if so, what sort. The .iob_ attributes identifies O as not part of an entity, B as the first token in an entity, and I as continuing part of an entity.



In [65]:

    
for token in doc[:15]:
    print('::'.join([token.text, token.ent_type_,token.ent_iob_]) )









    



From::::O
February::DATE::B
2016::DATE::I
,::::O
as::::O
an::::O
author::::O
,::::O
payments::::O
from::::O
Head::::O
of::::O
Zeus::PERSON::B
Publishing::PERSON::I
;::::O

Looking at the extracted entities, we see we get some good hits:

Averbrook Ltd. is an ORG;
20 January 2016 and 14 October 2016 are both instances of a DATE

Some near misses:

Zeus Publishing isn't a PERSON, although we might see why it has been recognised as such. (Could we overlay the model with an additional mapping of if PERSON and endswith.in(['Publishing', 'Holdings']) -> ORG ?)

And some things that are mis-categorised:

52 Doughty Street isn't really meaningful as a QUANTITY.

Several things we might usefully want to categorise - such as a UK postcode, for example, which might be useful in and of itself, or when helping us to identify an address - is not recognised as an entity.

Things recognised as dates we might want to then further parse as date object types:



In [10]:

    
from dateutil import parser as dtparser

[(d, dtparser.parse(d.string)) for d in ents if d.label_ == 'DATE']









    Out[10]:





[(February 2016, datetime.datetime(2016, 2, 19, 0, 0)),
 (October 2016, datetime.datetime(2016, 10, 19, 0, 0)),
 (July 2018, datetime.datetime(2018, 7, 19, 0, 0)),
 (20 January 2016, datetime.datetime(2016, 1, 20, 0, 0)),
 (14 October 2016, datetime.datetime(2016, 10, 14, 0, 0)),
 (2 March 2018, datetime.datetime(2018, 3, 2, 0, 0))]



In [11]:

    
#see also https://github.com/akoumjian/datefinder
#datefinder - Find dates inside text using Python and get back datetime objects

Token Shapes

As well as indentifying entities, spacy analyses texts at several othr levels. One such level of abstraction is the "shape" of each token. This identifies whether or not a character is an upper or lower case alphabetic character, a digit, or a punctuation character (which appears as itself):



In [64]:

    
for token in doc[:15]:
    print(token, '::', token.shape_)









    



From :: Xxxx
February :: Xxxxx
2016 :: dddd
, :: ,
as :: xx
an :: xx
author :: xxxx
, :: ,
payments :: xxxx
from :: xxxx
Head :: Xxxx
of :: xx
Zeus :: Xxxx
Publishing :: Xxxxx
; :: ;

Scraping a Text Based on Its Shape Structure And Adding New Entity Types

The "shape" of a token provides an additional structural item that we might be able to make use of in scrapers of the raw text.

For example, writing an efficient regular expression to identify a UK postcode can be a difficult task, but we can start to cobble one together from the shapes of different postcodes written in "standard" postcode form:



In [13]:

    
[pc.shape_ for pc in nlp('MK7 6AA, SW1A 1AA, N7 6BB')]









    Out[13]:





['XXd', 'dXX', ',', 'XXdX', 'dXX', ',', 'Xd', 'dXX']

We can define a matcher function that will identify the tokens in a document that match a particular ordered combination of shape patterns.

For example, the postcode like things described above have the shapes:

XXd dXX
XXdX dXX
Xd dXX

We can use these structural patterns to identify token pairs as possible postcodes.



In [66]:

    
from spacy.matcher import Matcher

nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)

matcher.add('POSTCODE', None, 
            [{'SHAPE':'XXdX'}, {'SHAPE':'dXX'}],
            [{'SHAPE':'XXd'}, {'SHAPE':'dXX'}],
            [{'SHAPE':'Xd'}, {'SHAPE':'dXX'}])

Let's test that:



In [15]:

    
pcdoc = nlp('pc is WC1N 4CC okay, as is MK7 4AA and Sir James Smith and Lady Jane Grey are presumably persons.')
matches = matcher(pcdoc)

#See what we matched, and let's see what entities we have detected
print('Matches: {}\nEntities: {}'.format([pcdoc[m[1]:m[2]] for m in matches], [(m,m.label_) for m in pcdoc.ents]))









    



Matches: [WC1N 4CC, MK7 4AA]
Entities: [(James Smith, 'PERSON'), (Lady Jane Grey, 'PERSON')]

Adding a new entity type with a matcher callback

The matcher seems to have matched the postcodes, but is not identifying them as entities. (We also note that the entity matcher has missed the "Sir" title. In some cases, it might also match a postcode as a person.)

To add the matched items to the entity list, we need to add a callback function to the matcher.



In [71]:

    
##Define a POSTCODE as a new entity type by adding matched postcodes to the doc.ents
#https://stackoverflow.com/a/47799669

nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)

def add_entity_label(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    doc.ents += ((match_id, start, end),)
    
#Recognise postcodes from different shapes
matcher.add('POSTCODE', add_entity_label, [{'SHAPE': 'XXdX'},{'SHAPE':'dXX'}], [{'SHAPE':'XXd'},{'SHAPE':'dXX'}])

pcdoc = nlp('pc is WC1N 4CC okay, as is MK7 4AA and James Smith is presumably a person')
matches = matcher(pcdoc)

print('Matches: {}\nEntities: {}'.format([pcdoc[m[1]:m[2]] for m in matches], [(m,m.label_) for m in pcdoc.ents]))









    



Matches: [WC1N 4CC, MK7 4AA]
Entities: [(WC1N 4CC, 'POSTCODE'), (MK7 4AA, 'POSTCODE'), (James Smith, 'PERSON')]

Let's put those pieces together more succinctly:



In [52]:

    
bigtext









    Out[52]:





'From February 2016, as an author, payments from Head of Zeus Publishing; a client of Averbrook Ltd. Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street. London WC1N 2LS. From October 2016 until July 2018, I will receive a regular payment of £13,000 per month (previously £11,000). Hours: 12 non-consecutive hrs per week. Any additional payments are listed below. (Updated 20 January 2016, 14 October 2016 and 2 March 2018)'



In [53]:

    
#Generate base tagged doc
doc = nlp(bigtext)

#Run postcode tagger over the doc
_ = matcher(doc)

The tagged document should now include POSTCODE entities. One of the easiest ways to check the effectiveness of a new entity tagger is to check the document with recognised entities visualised within it.

The displacy package has a Jupyter enabled visualiser for doing just that.



In [54]:

    
from spacy import displacy

displacy.render(doc, jupyter=True, style='ent')









    




From 

    February 2016
    DATE

, as an author, payments from Head of 

    Zeus Publishing
    PERSON

; a client of 

    Averbrook Ltd.
    ORG

 Address: 

    45
    CARDINAL

-47 Clerkenwell Green London 

    EC1R 0HT
    POSTCODE

, via 

    Sheil Land
    ORG

, 

    52 Doughty Street
    QUANTITY

. 

    London
    GPE

 

    WC1N 2LS
    POSTCODE

. From 

    October 2016
    DATE

 until 

    July 2018
    DATE

, I will receive a regular payment of £

    13,000
    MONEY

 per month (previously £

    11,000
    MONEY

). Hours: 

    12
    CARDINAL

 non-consecutive hrs per week. Any additional payments are listed below. (

    Updated
    ORG

 

    20 January 2016
    DATE

, 

    14 October 2016
    DATE

 and 

    2 March 2018
    DATE

)

Matching A Large Number of Phrases

If we have a large number of phrases that are examples of a particular (new) entity type, we can match them using a PhraseMatcher.

For example, suppose we have a table of MP data:



In [67]:

    
import pandas as pd
mpdata=pd.read_csv('members_mar18.csv')
mpdata.head(5)









    Out[67]:







  
    
      
      constituency
      date_of_birth
      days_service
      first_start_date
      gender
      list_name
      member_id
      party
    
  
  
    
      0
      Hackney North and Stoke Newington
      1953-09-27
      11041
      1987-06-11
      F
      Abbott, Ms Diane
      172
      Labour
    
    
      1
      Oldham East and Saddleworth
      1960-09-15
      2543
      2011-01-13
      F
      Abrahams, Debbie
      4212
      Labour
    
    
      2
      Selby and Ainsty
      1966-11-30
      2795
      2010-05-06
      M
      Adams, Nigel
      4057
      Conservative
    
    
      3
      Hitchin and Harpenden
      1986-02-11
      279
      2017-06-08
      M
      Afolami, Bim
      4639
      Conservative
    
    
      4
      Windsor
      1965-08-04
      4598
      2005-05-05
      M
      Afriyie, Adam
      1586
      Conservative

From this, we can extract a list of MP names, albeit in reverse word order.



In [130]:

    
term_list = mpdata['list_name'].tolist()
term_list[:5]









    Out[130]:





['Abbott, Ms Diane',
 'Abrahams, Debbie',
 'Adams, Nigel',
 'Afolami, Bim',
 'Afriyie, Adam']

If we wanted to match those names as "MP" entities, we could use the following recipe to add an MP entity type that will be returned if any of the MP names are matched:



In [75]:

    
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en')
matcher = PhraseMatcher(nlp.vocab)

patterns = [nlp(text) for text in term_list]

matcher.add('MP', add_entity_label, *patterns)

Let's test that new entity on a test string:



In [74]:

    
doc = nlp("The MPs were Adams, Nigel, Afolami, Bim and Abbott, Ms Diane.")

matches = matcher(doc)

displacy.render(doc, jupyter=True, style='ent')









    




The MPs were 

    Adams, Nigel
    MP

, 

    Afolami, Bim
    MP

 and 

    Abbott, Ms Diane
    MP

.

Matching a Regular Expression

Sometimes we may want to use a regular expression as an entity detector. For example, we might want to tighten up the postcode entity detectio by using a regular expression, rather than shape matching.



In [181]:

    
import re

#https://stackoverflow.com/a/164994/454773
regex_ukpc = r'([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))\s?[0-9][A-Za-z]{2})'



In [198]:

    
#Based on https://spacy.io/usage/linguistic-features
nlp = spacy.load('en')

doc = nlp("The postcodes were MK1 6AA and W1A 1AA.")

for match in re.finditer(regex_ukpc, doc.text):
    start, end = match.span()         # get matched indices
    entity = doc.char_span(start, end, label='POSTCODE')  # create Span from indices
    doc.ents = list(doc.ents) + [entity]
    entity.merge()

    
displacy.render(doc, jupyter=True, style='ent')









    




The postcodes were 

    MK1 6AA
    POSTCODE

 and 

    W1A 1AA
    POSTCODE

.

Updating the training of already existing Entities

We note previously that the matcher was missing the "Sir" title on matched persons.



In [19]:

    
nlp('pc is WC1N 4CC okay, as is MK7 4AA and Sir James Smith and Lady Jane Grey are presumably persons').ents









    Out[19]:





(James Smith, Lady Jane Grey)

Let's see if we can update the training of the model so that it does recognise the "Sir" title as part of a person's name.

We can do that by creating some new training data and using it to update the model. The entities dict identifies the index values in the test string that delimit the entity we want to extract.



In [20]:

    
# training data
TRAIN_DATA = [
    ('Received from Sir John Smith last week.', {
        'entities': [(14, 28, 'PERSON')]
    }),
    ('Sir Richard Jones is another person', {
        'entities': [(0, 18, 'PERSON')]
    })
]

In this case, we are going to let spacy learn its own patterns, as a statistical model, that will - if the learning pays off correctly - identify things like "Sir Bimble Bobs" as a PERSON entity.



In [21]:

    
import random

#model='en' #'en_core_web_sm'
#nlp = spacy.load(model)

cycles=20
optimizer = nlp.begin_training()
for i in range(cycles):
    random.shuffle(TRAIN_DATA)
    for txt, annotations in TRAIN_DATA:
        nlp.update([txt], [annotations], sgd=optimizer)









    



/Users/ajh59/anaconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py:2257: RuntimeWarning: invalid value encountered in sqrt
  ret = sqrt(sqnorm)



In [22]:

    
nlp('pc is WC1N 4CC okay, as is MK7 4AA and Sir James Smith and Lady Jane Grey are presumably persons').ents









    Out[22]:





(Sir James Smith, Lady Jane Grey)

One of the things that can be a bit fiddly is generating the training strings. We ca produce a little utility function that will help us create a training pattern by identifying the index value(s) associated with a particular substring, that we wish to identify as an example of a particular entity type, inside a text string.

The first thing we need to do is find the index values within a string that show where a particular substring can be found. The Python find() and index() methods will find the first location of a substring in a string. However, where a substring appears several times in a sring, we need a new function to identify all the locations. There are several ways of doing this...



In [23]:

    
#Find multiple matches using .find()
#https://stackoverflow.com/a/4665027/454773
def _find_all(string, substring):
    #Generator to return index of each string match
    start = 0
    while True:
        start = string.find(substring, start)
        if start == -1: return
        yield start
        start += len(substring)

def find_all(string, substring):
    return list(_find_all(string, substring))


    
#Find multiple matches using a regular expression
#https://stackoverflow.com/a/4664889/454773
import re
def refind_all(string, substring):
    return [m.start() for m in re.finditer(substring, string)]



In [24]:

    
txt = 'This is a string.'
substring = 'is'

print( find_all(txt, substring) )
print( refind_all(txt, substring) )









    



[2, 5]
[2, 5]

We can use either of these functions to find the location of a substring in a string, and then use these index values to help us create our training data.



In [25]:

    
def trainingTupleBuilder(string, substring, typ, entities=None):
    ixs = refind_all(string, substring)
    offset = len(substring)
    if entities is None: entities = {'entities':[]}
    for ix in ixs:
        entities['entities'].append( (ix, ix+offset, typ) )
    
    return (string, entities)

#('Received from Sir John Smith last week.', {'entities': [(14, 28, 'PERSON')]})
trainingTupleBuilder('Received from Sir John Smith last week.','Sir John Smith','PERSON')









    Out[25]:





('Received from Sir John Smith last week.', {'entities': [(14, 28, 'PERSON')]})

Training a Simple Model to Recognise Addresses

As well as extracting postcodes as entities, could we also train a simple model to extract addresses?



In [26]:

    
TRAIN_DATA = []

TRAIN_DATA.append(trainingTupleBuilder("He lives at 27, Oswaldtwistle Way, Birmingham",'27, Oswaldtwistle Way, Birmingham','B-ADDRESS'))
TRAIN_DATA.append(trainingTupleBuilder("Payments from Boondoggle Limited, 377, Hope Street, Little Village, Halifax. Received: October, 2017",'377, Hope Street, Little Village, Halifax','B-ADDRESS'))

TRAIN_DATA









    Out[26]:





[('He lives at 27, Oswaldtwistle Way, Birmingham',
  {'entities': [(12, 45, 'B-ADDRESS')]}),
 ('Payments from Boondoggle Limited, 377, Hope Street, Little Village, Halifax. Received: October, 2017',
  {'entities': [(34, 75, 'B-ADDRESS')]})]

The B- prefix identifies the entity as a multi-token entity.



In [89]:

    
#https://spacy.io/usage/training
def spacytrainer(model=None, output_dir=None, n_iter=100, debug=False):
    """Load the model, set up the pipeline and train the entity recognizer."""
    if model is not None:
        if isinstance(model,str):
            nlp = spacy.load(model)  # load existing spaCy model
            print("Loaded model '%s'" % model)
        #Else we assume we have passed in an nlp model
        else: nlp = model
    else:
        nlp = spacy.blank('en')  # create blank Language class
        print("Created blank 'en' model")

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe('ner')

    # add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get('entities'):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                    [text],  # batch of texts
                    [annotations],  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)
            if debug: print(losses)

    # test the trained model
    if debug:
        for text, _ in TRAIN_DATA:
            doc = nlp(text)
            print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
            print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        for text, _ in TRAIN_DATA:
            doc = nlp2(text)
            print('Entities', [(ent.text, ent.label_) for ent in doc.ents])
            print('Tokens', [(t.text, t.ent_type_, t.ent_iob) for t in doc])

    return nlp

Let's update the en model to include a really crude address parser based on the two lines of training data described above.



In [96]:

    
nlp = spacytrainer('en')









    



Loaded model 'en'






    



/Users/ajh59/anaconda3/lib/python3.6/site-packages/numpy/linalg/linalg.py:2257: RuntimeWarning: invalid value encountered in sqrt
  ret = sqrt(sqnorm)



In [97]:

    
#See if we can identify the address
addr_doc = nlp(text)

displacy.render(addr_doc , jupyter=True, style='ent')









    




From 

    February 2016
    DATE

, as an author, payments from Head of Zeus Publishing; a client of 

    Averbrook Ltd.
    ORG

 Address: 

    45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street.
    B-ADDRESS

 London WC1N 

    2LS
    CARDINAL

. From 

    October 2016
    DATE

 until 

    July 2018
    DATE

, I will receive a regular payment of £

    13,000
    MONEY

 per month (previously £

    11,000
    MONEY

). Hours: 

    12
    CARDINAL

 non-consecutive hrs per week. Any additional payments are listed below. (Updated 20 January 2016, 

    14 October 2016
    DATE

 and 

    2 March 2018
    DATE

)

Parts of Speech (POS)

As well as recognising different types of entity, which may be identified across several different words, the spacy parser also marks up each separate word (or token) as a particular "part-of-speech" (POS), such as a noun, verb, or adjective.

Parts of speech are identified as .pos_ or .tag_ token attributes.



In [31]:

    
tags = []
for token in doc[:15]:
    print(token, '::', token.pos_, '::', token.tag_)
    tags.append(token.tag_)









    



From :: ADP :: IN
February :: PROPN :: NNP
2016 :: NUM :: CD
, :: PUNCT :: ,
as :: ADP :: IN
an :: DET :: DT
author :: NOUN :: NN
, :: PUNCT :: ,
payments :: NOUN :: NNS
from :: ADP :: IN
Head :: PROPN :: NNP
of :: ADP :: IN
Zeus :: PROPN :: NNP
Publishing :: PROPN :: NNP
; :: PUNCT :: :

An explain() function describes each POS type in natural language terms:



In [32]:

    
for tag in set(tags):
    print(tag, '::', spacy.explain(tag))









    



: :: punctuation mark, colon or ellipsis
NNS :: noun, plural
CD :: cardinal number
, :: punctuation mark, comma
IN :: conjunction, subordinating or preposition
NNP :: noun, proper singular
NN :: noun, singular or mass
DT :: determiner

We can also get a list of "noun chunks" identified in the text, as well as other words they relate to in a sentence:



In [46]:

    
for chunk in doc.noun_chunks:
    print(' :: '.join([chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text]))









    



February :: February :: pobj :: From
an author :: author :: pobj :: as
payments :: payments :: conj :: author
Head :: Head :: pobj :: from
Zeus Publishing :: Publishing :: pobj :: of
Averbrook Ltd. :: Ltd. :: pobj :: of
Address :: Address :: ROOT :: Address
45-47 Clerkenwell Green London EC1R :: EC1R :: appos :: Address
Sheil Land :: Land :: pobj :: via
52 Doughty Street :: Street :: appos :: Land
London WC1N :: WC1N :: ROOT :: WC1N
October :: October :: pobj :: From
July :: July :: pobj :: until
I :: I :: nsubj :: receive
a regular payment :: payment :: dobj :: receive
month :: month :: pobj :: per
Hours :: Hours :: ROOT :: Hours
12 non-consecutive hrs :: hrs :: appos :: Hours
week :: week :: pobj :: per
Any additional payments :: payments :: nsubjpass :: listed
(Updated 20 January :: January :: ROOT :: January
14 October :: October :: appos :: January
2 March :: March :: conj :: October

Scraping a Text Based on Its POS Structure - `textacy`

As well as the basic spacy functionality, packages exist that build on spacy to provide further tools for working with abstractions identified using spacy.

For example, the textacy package provides a way of parsing sentences using regular expressions defined over (Ontonotes5?) POS tags:



In [33]:

    
import textacy

list(textacy.extract.pos_regex_matches(nlp(text),r'<NOUN> <ADP> <PROPN|ADP>+'))









    Out[33]:





[payments from Head of Zeus Publishing, client of Averbrook Ltd.]



In [34]:

    
textacy.constants.POS_REGEX_PATTERNS









    Out[34]:





{'en': {'NP': '<DET>? <NUM>* (<ADJ> <PUNCT>? <CONJ>?)* (<NOUN>|<PROPN> <PART>?)+',
  'PP': '<ADP> <DET>? <NUM>* (<ADJ> <PUNCT>? <CONJ>?)* (<NOUN> <PART>?)+',
  'VP': '<AUX>* <ADV>* <VERB>'}}



In [35]:

    
xx='A sum of £2000-3000 last or £2,000 or £2000-£3000 or £2,000-£3,000 year'
for t in nlp(xx):
    print(t,t.tag_, t.pos_)









    



A DT DET
sum NN NOUN
of IN ADP
£ $ SYM
2000 CD NUM
- SYM SYM
3000 CD NUM
last JJ ADJ
or CC CCONJ
£ $ SYM
2,000 CD NUM
or CC CCONJ
£ $ SYM
2000-£3000 CD NUM
or CC CCONJ
£ $ SYM
2,000-£3,000 CD NUM
year NN NOUN



In [36]:

    
for e in nlp(xx).ents:
    print(e, e.label_)









    



2,000 MONEY
2000-£3000 MONEY
2,000-£3,000 MONEY



In [37]:

    
list(textacy.extract.pos_regex_matches(nlp('A sum of £2000-3000 last or £2,000 or £2000-£3000 or £2,000-£3,000 year'),r'<SYM><NUM><SYM>?<NUM>?'))









    Out[37]:





[£2000-3000, £2,000, £2000-£3000, £2,000-£3,000]

If we can define appropriate POS pattern, we can extract terms from an arbitrary text based on that pattern, an approach that is far more general than trying to write a regular expression pattern matcher over just the raw text.



In [ ]:



In [38]:

    
#define approx amount eg £10,000-£15,000 or £10,000-15,000
parse('{}£{a}-£{b:g}{}','eg £10,000-£15,000 or £14,000-£16,000'.replace(',',''))









    Out[38]:





<Result ('eg ', ' or £14000-£16000') {'a': '10000', 'b': 15000.0}>

More Complex Matching Rules

Matchers can be created over a wide range of attributes (docs), including POS tags and entity labels.

For example, we can start trying to build an address tagger by looking for things that end with a postcode.



In [156]:

    
nlp = spacy.load('en')

matcher = Matcher(nlp.vocab)
matcher.add('POSTCODE', add_entity_label, [{'SHAPE': 'XXdX'},{'SHAPE':'dXX'}], [{'SHAPE':'XXd'},{'SHAPE':'dXX'}])
matcher.add('ADDRESS', add_entity_label, 
            [{'POS':'NUM','OP':'+'},{'POS':'PROPN','OP':'+'}, {'ENT_TYPE':'POSTCODE', 'OP':'+'}],
            [{'ENT_TYPE':'GPE','OP':'+'}, {'ENT_TYPE':'POSTCODE', 'OP':'+'}])

addr_doc = nlp(text)
matcher(addr_doc)

displacy.render(addr_doc , jupyter=True, style='ent')
for m in matcher(addr_doc):
    print(addr_doc[m[1]:m[2]])
    
print([(e, e.label_) for e in addr_doc.ents])









    




From 

    February 2016
    DATE

, as an author, payments from Head of 

    Zeus Publishing
    PERSON

; a client of 

    Averbrook Ltd.
    ORG

 Address: 

    45
    CARDINAL

-47 Clerkenwell Green London 

    EC1R 0HT
    POSTCODE

, via 

    Sheil Land
    ORG

, 

    52 Doughty Street
    QUANTITY

. 

    London
    GPE

 

    WC1N 2LS
    POSTCODE

. From 

    October 2016
    DATE

 until 

    July 2018
    DATE

, I will receive a regular payment of £

    13,000
    MONEY

 per month (previously £

    11,000
    MONEY

). Hours: 

    12
    CARDINAL

 non-consecutive hrs per week. Any additional payments are listed below. (

    Updated
    ORG

 

    20 January 2016
    DATE

, 

    14 October 2016
    DATE

 and 

    2 March 2018
    DATE

)






    



47 Clerkenwell Green London EC1R 0HT
EC1R 0HT
London WC1N 2LS
WC1N 2LS
[(February 2016, 'DATE'), (Zeus Publishing, 'PERSON'), (Averbrook Ltd., 'ORG'), (45, 'CARDINAL'), (47 Clerkenwell Green London, 'ADDRESS'), (EC1R 0HT, 'POSTCODE'), (Sheil Land, 'ORG'), (52 Doughty Street, 'QUANTITY'), (London, 'ADDRESS'), (WC1N 2LS, 'POSTCODE'), (October 2016, 'DATE'), (July 2018, 'DATE'), (13,000, 'MONEY'), (11,000, 'MONEY'), (12, 'CARDINAL'), (Updated, 'ORG'), (20 January 2016, 'DATE'), (14 October 2016, 'DATE'), (2 March 2018, 'DATE')]

In this case, we note that the visualiser cannot cope with rendering multiple entity types over one or more words. In the above example, the POSTCODE entitites are highlighted, but we note from the matcher that ADDRESS ranges are also identified that extend across entities defined over fewer terms.

Visualising - `displaCy`

We can look at the structure of a text by printing out the child elements associated with each token in a sentence:



In [135]:

    
for sent in nlp(text).sents:
    print(sent,'\n')
    for token in sent:
        print(token, ': ', str(list(token.children)))
    print()









    



From February 2016, as an author, payments from Head of Zeus Publishing; a client of Averbrook Ltd. 

From :  [February, ,, as]
February :  [2016]
2016 :  []
, :  []
as :  [author, ;]
an :  []
author :  [an, ,, payments]
, :  []
payments :  [from]
from :  [Head]
Head :  [of]
of :  [Publishing]
Zeus :  []
Publishing :  [Zeus]
; :  [client]
a :  []
client :  [a, of]
of :  [Ltd.]
Averbrook :  []
Ltd. :  [Averbrook]

Address: 45-47 Clerkenwell Green London EC1R 0HT, via Sheil Land, 52 Doughty Street. 

Address :  [:, EC1R, 0HT, ,, via, .]
: :  []
45 :  []
- :  []
47 :  [-]
Clerkenwell :  []
Green :  []
London :  [Clerkenwell, Green]
EC1R :  [45, 47, London]
0HT :  []
, :  []
via :  [Land]
Sheil :  []
Land :  [Sheil, ,, Street]
, :  []
52 :  []
Doughty :  []
Street :  [52, Doughty]
. :  []

London WC1N 2LS. 

London :  []
WC1N :  [London, 2LS, .]
2LS :  []
. :  []

From October 2016 until July 2018, I will receive a regular payment of £13,000 per month (previously £11,000). 

From :  [October, until]
October :  [2016]
2016 :  []
until :  [July]
July :  [2018]
2018 :  []
, :  []
I :  []
will :  []
receive :  [From, ,, I, will, payment, .]
a :  []
regular :  []
payment :  [a, regular, of, (, 11,000, )]
of :  [13,000]
£ :  []
13,000 :  [£, per]
per :  [month]
month :  []
( :  []
previously :  []
£ :  []
11,000 :  [previously, £]
) :  []
. :  []

Hours: 12 non-consecutive hrs per week. 

Hours :  [:, hrs, .]
: :  []
12 :  []
non :  []
- :  [non]
consecutive :  []
hrs :  [12, -, consecutive, per]
per :  [week]
week :  []
. :  []

Any additional payments are listed below. 

Any :  []
additional :  []
payments :  [Any, additional]
are :  []
listed :  [payments, are, below, .]
below :  []
. :  []

(Updated 20 January 2016, 14 October 2016 and 2 March 2018) 

( :  []
Updated :  []
20 :  []
January :  [(, Updated, 20, 2016, ,, October, )]
2016 :  []
, :  []
14 :  []
October :  [14, 2016, and, March]
2016 :  []
and :  []
2 :  []
March :  [2, 2018]
2018 :  []
) :  []

However, the displaCy toolset, included as part of spacy, provides a more appealing way of visualising parsed documents in two different ways:

as a dependency graph, showing POS tags for each token and how they relate to each other;
as a text display with extracted entities highlighted.

The dependency graph identifies POS tags as well as how tokens are related in natural language grammatical phrases:



In [39]:

    
from spacy import displacy



In [47]:

    
displacy.render(doc, jupyter=True,style='dep')



In [48]:

    
displacy.render(doc, jupyter=True,style='dep',options={'distance':85, 'compact':True})

We can also use displaCy to highlight, inline, the entities extracted from a text.



In [42]:

    
displacy.render(pcdoc, jupyter=True,style='ent')









    




pc is 

    WC1N 4CC
    POSTCODE

 okay, as is 

    MK7 4AA
    POSTCODE

 and 

    James Smith
    PERSON

 is presumably a person



In [43]:

    
displacy.render(doc, jupyter=True,style='ent')









    




From 

    February 2016
    DATE

, as an author, payments from Head of 

    Zeus Publishing
    PERSON

; a client of 

    Averbrook Ltd.
    ORG

 Address: 

    45
    CARDINAL

-47 Clerkenwell Green London 

    EC1R 0HT
    POSTCODE

, via 

    Sheil Land
    ORG

, 

    52 Doughty Street
    QUANTITY

. 

    London
    GPE

 

    WC1N 2LS
    POSTCODE

. From 

    October 2016
    DATE

 until 

    July 2018
    DATE

, I will receive a regular payment of £

    13,000
    MONEY

 per month (previously £

    11,000
    MONEY

). Hours: 

    12
    CARDINAL

 non-consecutive hrs per week. Any additional payments are listed below. (

    Updated
    ORG

 

    20 January 2016
    DATE

, 

    14 October 2016
    DATE

 and 

    2 March 2018
    DATE

)

Extending Entities

eg add a flag to say a person is an MP



In [151]:

    
mpdata=pd.read_csv('members_mar18.csv')
tmp = mpdata.to_dict(orient='record')
mpdatadict = {k['list_name']:k for k in tmp }



In [148]:

    
#via https://spacy.io/usage/processing-pipelines

mpdata=pd.read_csv('members_mar18.csv')


"""Example of a spaCy v2.0 pipeline component to annotate MP record with MNIS data"""
from spacy.tokens import Doc, Span, Token

    
class RESTMPComponent(object):
    """spaCy v2.0 pipeline component that annotates MP entity with MP data.
    """
    name = 'mp_annotator' # component name, will show up in the pipeline

    def __init__(self, nlp, label='MP'):
        """Initialise the pipeline component. The shared nlp instance is used
        to initialise the matcher with the shared vocab, get the label ID and
        generate Doc objects as phrase match patterns.
        """
        # Get MP data
        mpdata=pd.read_csv('members_mar18.csv')
        mpdatadict = mpdata.to_dict(orient='record')

        # Convert MP data to a dict keyed by MP name
        self.mpdata = {k['list_name']:k for k in mpdatadict }
        
        self.label = nlp.vocab.strings[label]  # get entity label ID

        # Set up the PhraseMatcher with Doc patterns for each MP name
        patterns = [nlp(c) for c in self.mpdata.keys()]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add('MPS', None, *patterns)

        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        # If no default value is set, it defaults to None.
        Token.set_extension('is_mp', default=False)
        Token.set_extension('mnis_id')
        Token.set_extension('constituency')
        Token.set_extension('party')

        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_country == True.
        Doc.set_extension('is_mp', getter=self.is_mp)
        Span.set_extension('is_mp', getter=self.is_mp)


    def __call__(self, doc):
        """Apply the pipeline component on a Doc object and modify it if matches
        are found. Return the Doc, so it can be processed by the next component
        in the pipeline, if available.
        """
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # Generate Span representing the entity & set label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            # Set custom attribute on each token of the entity
            # Can be extended with other data associated with the MP
            for token in entity:
                token._.set('is_mp', True)
                token._.set('mnis_id', self.mpdata[entity.text]['member_id'])
                token._.set('constituency', self.mpdata[entity.text]['constituency'])
                token._.set('party', self.mpdata[entity.text]['party'])
            # Overwrite doc.ents and add entity – be careful not to replace!
            doc.ents = list(doc.ents) + [entity]
        for span in spans:
            # Iterate over all spans and merge them into one token. This is done
            # after setting the entities – otherwise, it would cause mismatched
            # indices!
            span.merge()
        return doc  # don't forget to return the Doc!

    def is_mp(self, tokens):
        """Getter for Doc and Span attributes. Returns True if one of the tokens
        is an MP."""
        return any([t._.get('is_mp') for t in tokens])



In [150]:

    
# For simplicity, we start off with only the blank English Language class
# and no model or pre-defined pipeline loaded.
nlp = spacy.load('en')

rest_mp = RESTMPComponent(nlp)  # initialise component
nlp.add_pipe(rest_mp) # add it to the pipeline
doc = nlp(u"Some text about MPs Abbott, Ms Diane and Afriyie, Adam")

print('Pipeline', nlp.pipe_names)  # pipeline contains component name
print('Doc has MPs', doc._.is_mp)  # Doc contains MPs
for token in doc:
    if token._.is_mp:
        print(token.text, '::', token._.constituency,'::', token._.party,
            '::', token._.mnis_id)  # MP data
print('Entities', [(e.text, e.label_) for e in doc.ents])  # entities









    



Pipeline ['tagger', 'parser', 'ner', 'mp_annotator']
Doc has MPs True
Abbott, Ms Diane :: Hackney North and Stoke Newington :: Labour :: 172
Afriyie, Adam :: Windsor :: Conservative :: 1586
Entities [('Abbott, Ms Diane and', 'MP'), ('Afriyie, Adam', 'MP')]

It may be worth producing other exemplar pipelines based around UK gov registers. Or updating the above model to build the model using data directly pulled from the UK Parliament API.

[Loosely related (relevant to registers): https://github.com/frankieroberto/govuk-government-organisations-autocomplete]



In [ ]:

	constituency	date_of_birth	days_service	first_start_date	gender	list_name	member_id	party
0	Hackney North and Stoke Newington	1953-09-27	11041	1987-06-11	F	Abbott, Ms Diane	172	Labour
1	Oldham East and Saddleworth	1960-09-15	2543	2011-01-13	F	Abrahams, Debbie	4212	Labour
2	Selby and Ainsty	1966-11-30	2795	2010-05-06	M	Adams, Nigel	4057	Conservative
3	Hitchin and Harpenden	1986-02-11	279	2017-06-08	M	Afolami, Bim	4639	Conservative
4	Windsor	1965-08-04	4598	2005-05-05	M	Afriyie, Adam	1586	Conservative