Scraping Low Hanging Fruit on the UK Register of Members' Financial Interests

This notebook describes a recipe for starting to scrape the UK MP register of interests.

[See also: benscott/mp-financial-interests] The register entries for each member contains semi-structured text data, which is to say that there are some recognisable patterns in the text that makes up the register entries.

The entries are made in different sections, and have a form that repeats, ish...

We can use these repeating structures as the basis of a scraper that will extract the information from the page and put it into a form we can work with, such as a spreadsheet or simple database.

The Structure of a Single Register Entry Page

The page you see in your web browser is a rendering of a structured HTML document. You can look at the "code" that defines the page using browser developer tools.

In Chrome, you can view the source code of a web page from the View -> Developer -> View Source menu.

Looking at the raw HTML of a page can be confusing, but many browsers have built in tools to make it easier to inspect the source code of a web page.

In Chrome, you can launch the developer tools from he View -> Developer -> Developer Tools menu option.

Exploring the page

In Chrome developer tools, if you click on the arrow / pointer icon in the top left corner of the tools panel, you can use it to highlight areas of the rendered web page; the HTML used to define that block is then highlighted.

One of the tricks to scraping is to try to identify structural elements or patterns in the HTML that identify the things we are interested in and that we can grab hold of and use as the basis for our scrape.

In the MPs' register of interest pages, we notice that the div tag with the id mainTextBlock contains all the elements that describe the register of entries. In particular, we also notice that each spearate entry is contained within its own <p> tag.

This gives us one strategy for scraping the page:

  • grab the parent div tag - and its contents - that contains all the separate register entries;
  • look at each <p> tag - and its contents - in turn and try to pull out the member interests.

What structured or semi-structured data can we see?

Some of the register entries have a largely unstructured form. For example, the entry:

13 December 2016, received £2,500 from Hampshire Cricket, The Ageas Bowl, Botley Road, West End, Southampton SO30 3XH, for media and communications training. Hours: 16 hrs including travel and preparation. (Registered 19 December 2016)

is largely free text. We can see some structure in there (a date, followed by an amount, then a name and an address) but we get the feeling that this entry could be made up of arbitrary text.

An entry such as:

Name of donor: VGC Group
Address of donor: Cardinal House, Bury Street, Ruislip HA4 7GD
Amount of donation or nature and value if donation in kind: £1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.
Donor status: company, registration 5741473
(Registered 04 May 2016)

is semi-structured, in that we have structural items of the form attribute: value where the value term may or may not itself be structured.

For example, the Name of donor attribute is simply that - a name - which explicitly represents the name of the donor. But the text associated with the Amount of donation or nature and value if donation in kind is more unstructured. For sure, we can see recognisable thinks in the description, but the way they are presented is leargely as free text. Which is to say, the way it's presented is arbitrary, which makes it harder to extract information from in a reliable way.

What this means is that there is some low hanging fruit in this register that we can extract reasonably reliably (the name of a donor, for example), but there is also information in that that we may have to parse by hand if we want to do it reliably.

Getting Started with the Scrape

There are many tools available to help you scrape a web page or set of web pages, but I tend to use code becuase it gives me the most control over the scrape, albeit at the cost of added complexity compared to point and click style applications.

A couple of Python packages that crossed my radar recently provide a relatively easy way in to getting started with Python based scrapers, so we'll use those in this recipe.

kennethreitz/requests-html

The first package is Kenneth Reitz' (requests-html, straplined "Pythonic HTML Parsing for Humans™". This package helps us grab an HTML page and extract the text from it.


In [1]:
#https://github.com/kennethreitz/requests-html
#!~/anaconda3/bin/pip install requests_html

r1chardj0n3s/parse

The second package is Richard Jones' parse package (r1chardj0n3s/parse).

This package provides a set of tools that make it relatively easy to extract the rendered / visually structured information, such as the attributes/values identified in the MPs' register above.


In [2]:
#https://github.com/r1chardj0n3s/parse
#Install the package
#!~/anaconda3/bin/pip install parse

Grabbing a test page

We'll start by working with a single test page: https://publications.parliament.uk/pa/cm/cmregmem/170502/dugher_michael.htm

This was selected from the last register of the 2015 Parliament. For now, I'm just hoping that the register for the current 2017 Parliament has the same structural form!

Disclaimer: the page I picked from the register was picked becuase it had a range of content structures, not for any reason relating to the member it relates to or the actual content of an entry. Which is to say: nothing is implied by the selection of the test page, etc etc.


In [3]:
#Set the url of the test page
url='https://publications.parliament.uk/pa/cm/cmregmem/170502/dugher_michael.htm'

The first thing to do is get hold of the page HTML. The requests_html makes this easy:


In [4]:
#Import the package - we only need to do this once
from requests_html import HTMLSession

#Create a session - we only need to do this once
session = HTMLSession()

In [5]:
#Grab the page
r = session.get(url)

To find the p tags within an HTML block with a given id, such as mainTextBlock, we can apply the .html.find() method to the page and get a list of items in return.


In [6]:
ptags = r.html.find('#mainTextBlock > p')

#View the contents of the 7th p tag in the list (the index starts at 0)
ptags[6].text


Out[6]:
'Name of donor: VGC Group\nAddress of donor: Cardinal House, Bury Street, Ruislip HA4 7GD\nAmount of donation or nature and value if donation in kind: £1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.\nDonor status: company, registration 5741473\n(Registered 04 May 2016)'

Now we can start to extract some information as data using the parse package.


In [7]:
#Import everything from the package - this is not best practice!
from parse import *

The parse package encourages you to split a text string into recognisable components. The text we want to extract is wrapped using braces ({}). The contents of the braces may be a name we want to assigned to the extracted text, and / or a pattern that describes the text we want to extract "into" that pair of braces.

The expression we need to us has the form:

parse(stringExtractionPattern, stringWeWantToParse)

In [8]:
pattern = '''Name of donor: {name}\nAddress of donor: {addr}\nAmount of donation or nature and value if donation in kind: {txt}\nDonor status: {status}\n(Registered {date})'''

pr = parse(pattern, ptags[6].text)

#View the results of parsing the string
pr


Out[8]:
<Result () {'name': 'VGC Group', 'addr': 'Cardinal House, Bury Street, Ruislip HA4 7GD', 'txt': '£1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.', 'status': 'company, registration 5741473', 'date': '04 May 2016'}>

If the the pattern matcher doesn't match the string, nothing is returned.


In [9]:
pr = parse(pattern,"Some arbitrary text that is unlikely to match...")
print('This returns >>', pr, '<<')


This returns >> None <<

We can make use of this in a condiional statement to take an action depending on whether or not we get a response:


In [10]:
if pr:
    print('There was a match')
else:
    print('No match')


No match

We can pass the "named" items we have extracted from the text string into its own python dict.

The python expression used to do this is known as a "list comprehension" (or more specifically in this case, a "dict comprehension"). Essentially what it does is take the contents of one dict and use them to create another. Don't worry about it: it's voodoo magic...


In [11]:
pr = parse(pattern, ptags[6].text)
extractedItems = {k:pr[k] for k in pr.named}

#Preview items
extractedItems


Out[11]:
{'addr': 'Cardinal House, Bury Street, Ruislip HA4 7GD',
 'date': '04 May 2016',
 'name': 'VGC Group',
 'status': 'company, registration 5741473',
 'txt': '£1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.'}

We can mask the complexity of the dict comprehension by creating a function that deploys it for us:


In [12]:
def todict(result):
    #If there's no match, return an empty dict
    if not result: return {}
    #If there is a match, add the named results items to the returned dict
    return {k:result[k] for k in result.named}

In [13]:
todict(pr)


Out[13]:
{'addr': 'Cardinal House, Bury Street, Ruislip HA4 7GD',
 'date': '04 May 2016',
 'name': 'VGC Group',
 'status': 'company, registration 5741473',
 'txt': '£1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.'}

If the parse expression does not match the string, then it won't return anything.

Looking at the extracted text, we can see that there are some other elements in there that we might be able to extract.

For example, if the status is a company, it looks like we might be able to extract out the company number:


In [14]:
#View status
extractedItems['status']


Out[14]:
'company, registration 5741473'

We can try to extract the fact the entity is a company, along with the company number:


In [15]:
parse('{status2}, registration {cn}', extractedItems['status'])


Out[15]:
<Result () {'status2': 'company', 'cn': '5741473'}>

We could directly create a dict from that:


In [16]:
todict( parse('{status2}, registration {cn}', extractedItems['status']) )


Out[16]:
{'cn': '5741473', 'status2': 'company'}

Or we could write a new function that adds any newly extracted items to the extractedItems dict, and does the parsing:


In [17]:
def todict2(pattern, string, extracted=None):
    if extracted is None: extracted = {}
    newextract = todict( parse(pattern, string) )
    
    #Add the contents of newextract to the original dict and return
    #Note that this updates any dict passed in via the extracted argument and doesn't strictly need to return it 
    extracted.update(newextract)
    return extracted

In [18]:
todict2('{status2}, registration {cn}', extractedItems['status'], extractedItems)
extractedItems


Out[18]:
{'addr': 'Cardinal House, Bury Street, Ruislip HA4 7GD',
 'cn': '5741473',
 'date': '04 May 2016',
 'name': 'VGC Group',
 'status': 'company, registration 5741473',
 'status2': 'company',
 'txt': '£1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.'}

Alternatively, we might write a function to handle the company number extraction and return the company number (and extracted status) as a dict.


In [19]:
def companynumber(string):
    return todict( parse('{status2}, registration {cn}', string) )

In [20]:
print( companynumber(extractedItems['status']) )
print( companynumber('Arbitrary text.') )
#Check to see if it works with a company number that starts with a leading 0
print( companynumber('company, registration 05741473') )


{'status2': 'company', 'cn': '5741473'}
{}
{'status2': 'company', 'cn': '05741473'}

Let's look at another entry - if we also print it, the end of line (\n) characters will be rendered make things easier to read:


In [21]:
print(ptags[11].text)

ptags[11].text


Name of donor: Balmoral Tanks Ltd
Address of donor: Balmoral Park, Aberdeen AB12 3GY
Amount of donation or nature and value if donation in kind: £2,000 to support my Primary School Christmas Card Competition
Date received: 8 December 2016
Date accepted: 8 December 2016
Donor status: company, registration 300656
(Registered 09 December 2016)
Out[21]:
'Name of donor: Balmoral Tanks Ltd\nAddress of donor: Balmoral Park, Aberdeen AB12 3GY\nAmount of donation or nature and value if donation in kind: £2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016\nDonor status: company, registration 300656\n(Registered 09 December 2016)'

This entry has Date received and Date accepted fields that were not in the entry we scraped first. After parsing, they form part of the txt item:


In [22]:
#If we don't pass a dict in to todict2(), one will be created for us
extracteditems = todict2(pattern, ptags[11].text)

extracteditems


Out[22]:
{'addr': 'Balmoral Park, Aberdeen AB12 3GY',
 'date': '09 December 2016',
 'name': 'Balmoral Tanks Ltd',
 'status': 'company, registration 300656',
 'txt': '£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'}

So let's parse that item and grab the dates.

(The reason we don't add them to the pattern we used earlier is becuase that pattern would then not match the entries that do not contain the received and accepted dates.)


In [23]:
#Here's what we're going to parse
extracteditems['txt']


Out[23]:
'£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'

In [24]:
datepattern = '{}\nDate received: {dateRxd}\nDate accepted: {dateAccd}'

#This updates extracteditems
todict2(datepattern, extracteditems['txt'], extracteditems)
extracteditems


Out[24]:
{'addr': 'Balmoral Park, Aberdeen AB12 3GY',
 'date': '09 December 2016',
 'dateAccd': '8 December 2016',
 'dateRxd': '8 December 2016',
 'name': 'Balmoral Tanks Ltd',
 'status': 'company, registration 300656',
 'txt': '£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'}

We could also clean the text field a bit by splitting the string on \nDate fragments and just retaining the first part:


In [25]:
extracteditems['txt'].split('\nDate')


Out[25]:
['£2,000 to support my Primary School Christmas Card Competition',
 ' received: 8 December 2016',
 ' accepted: 8 December 2016']

In [26]:
#The split returns a list of items - just grab the first one with index 0
extracteditems['cleanertxt'] = extracteditems['txt'].split('\nDate')[0]
extracteditems


Out[26]:
{'addr': 'Balmoral Park, Aberdeen AB12 3GY',
 'cleanertxt': '£2,000 to support my Primary School Christmas Card Competition',
 'date': '09 December 2016',
 'dateAccd': '8 December 2016',
 'dateRxd': '8 December 2016',
 'name': 'Balmoral Tanks Ltd',
 'status': 'company, registration 300656',
 'txt': '£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'}

One thing we might notice at this part is that we have some dates. These are represented as text strings, but we can also part them into a "date-timey" computational thing that identifies the date as a date and let's us do datey things to it.


In [27]:
#The dateutil package extends the standard Python datetime  module and helps us parse dates
#~/anaconda3/bin/pip install python-dateutil
from dateutil import parser as dtparser

In [28]:
def parsedate(string):
    #This is not best practice - if the parse fails, return None
    try:
        dt = dtparser.parse(string)
    except:
        dt = None
    return dt

In [29]:
parsedate('8 December 2016')


Out[29]:
datetime.datetime(2016, 12, 8, 0, 0)

Having things in datetime format lets us work with them as such. For example, we can display them in a variety of ways:


In [30]:
print( parsedate('8 December 2016').strftime("%d/%m/%y") )
print( parsedate('8 December 2016').strftime("%B %d, %Y") )
print( parsedate('8 December 2016').strftime("%A, %B %, %Y") )
print( parsedate('8 December 2016').isoformat() )


08/12/16
December 08, 2016
Thursday, December , 2016
2016-12-08T00:00:00

There is a reference card for strftime modifiers / formatters here: http://strftime.org/

We can use the formatter to format our dates for us:


In [31]:
extracteditems['date_f'] = parsedate(extracteditems['date'])
extracteditems


Out[31]:
{'addr': 'Balmoral Park, Aberdeen AB12 3GY',
 'cleanertxt': '£2,000 to support my Primary School Christmas Card Competition',
 'date': '09 December 2016',
 'dateAccd': '8 December 2016',
 'dateRxd': '8 December 2016',
 'date_f': datetime.datetime(2016, 12, 9, 0, 0),
 'name': 'Balmoral Tanks Ltd',
 'status': 'company, registration 300656',
 'txt': '£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'}

We can also automate a bit if date items have 'date' to start their name and haven't already been formatted (identified using the _f suffix as part of their name).

Part of the automation requires creating dict attributes, named after date attributes but with the additional _f suffix as part of the name. We can use a Python string formatter to help us do this:


In [32]:
'{}_f'.format('date')


Out[32]:
'date_f'

In [33]:
def parsedates(record):
    #This looks complicated but what it basically does is look for attributes called date* and not ending _f
    for k in [k for k in record.keys() if k.lower().startswith('date') and not k.lower().endswith('_f') ]:
        #The record is a dict and is mutable - that is, the dict we passed in is changed by the function
        record['{}_f'.format(k)] = parsedate( record[k] )

In [34]:
#Remember - this automatically updates the dict we pass to it
parsedates(extracteditems)

#Show dict updated with parsed dates
extracteditems


Out[34]:
{'addr': 'Balmoral Park, Aberdeen AB12 3GY',
 'cleanertxt': '£2,000 to support my Primary School Christmas Card Competition',
 'date': '09 December 2016',
 'dateAccd': '8 December 2016',
 'dateAccd_f': datetime.datetime(2016, 12, 8, 0, 0),
 'dateRxd': '8 December 2016',
 'dateRxd_f': datetime.datetime(2016, 12, 8, 0, 0),
 'date_f': datetime.datetime(2016, 12, 9, 0, 0),
 'name': 'Balmoral Tanks Ltd',
 'status': 'company, registration 300656',
 'txt': '£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'}

Where's the money?

We've now grabbed quite a lot of the low hanging fruit from the page, but what about the money?

Ideally, there should only be one monetary amount specified in an entry (this is not always the case), but for now we'll just make an attempt at grabbing the first. We're also going to assume amounts are converted to, and given as, £ equivalents. To make life easier for the parser, we remove any commas (which is to say, commas used as thouseands separators) from the parsed string by replacing them with an empty string.


In [35]:
print( '£1,250,000.00 #loadsamoney'.replace(',','') )

moneystring='£1,250,000.00 #loadsamoney'
print( moneystring.replace(',','') )


£1250000.00 #loadsamoney
£1250000.00 #loadsamoney

In [36]:
#Define a simple helper function
def commaclean(string):
    return string.replace(',','')

In [37]:
#Demo the helper function
commaclean( moneystring )


Out[37]:
'£1250000.00 #loadsamoney'

Let's test a cash amount detecting pattern to see how it works with different strings.


In [38]:
gbppattern = '{?}£{gbp:g}{}'

def testcashparse(string):
    print('{}: {}'.format(string, todict2(gbppattern, commaclean(string)) ))
    

testcashparse('I got £5,000, okay?')
testcashparse('£5,000')
testcashparse('£5001')
testcashparse('They paid me, in 2015, £5000')
testcashparse('£1,250,000.00. #loadsamoney')
testcashparse('The sum of £2.75 for a coffee which cost £2.75')

testcashparse('I got £1000. In two lots: £200 and £800')
testcashparse('Gaming the system: I got £1.25 then £50,000 on top')
testcashparse("I got 50000 you won't pick up on")


I got £5,000, okay?: {'gbp': 5000.0}
£5,000: {'gbp': 500.0}
£5001: {'gbp': 500.0}
They paid me, in 2015, £5000: {'gbp': 2015.0}
£1,250,000.00. #loadsamoney: {'gbp': 1250000.0}
The sum of £2.75 for a coffee which cost £2.75: {'gbp': 2.75}
I got £1000. In two lots: £200 and £800: {'gbp': 1000.0}
Gaming the system: I got £1.25 then £50,000 on top: {'gbp': 1.25}
I got 50000 you won't pick up on: {'gbp': 50000.0}

There are some issues with this:

  • where the cash item is at the end of the string, it looks like we can lose the last digit because of the final match requirement. We can get round this by adding whitespace at the end (and perhaps also the start) of the string to provide a match opportunity;
  • things other than financial amounts, which are identified with the preceding £, are returned as numbers. (I'm not sure why?); one of the things the parse function returns is an index of where the match took place in the parsed string. So if we check the character before a match, if it's a £ we know we're quids in...
  • in some cases there may be multiple amounts, so rather than use the parse() function let's use the findall() function to see if we can find all the sterling amounts.

When we manage to parse a string, the parser also returns an index that shows where a match took place:


In [39]:
string = 'A sum of £1000'

locator = parse(gbppattern, string)
locator, locator.spans


Out[39]:
(<Result ('A sum of £', '0') {'gbp': 100.0}>,
 {0: (0, 10), 1: (13, 14), 'gbp': (10, 13)})

We can retrieve a particular character from a string by passing referencing its index value, starting at the beginiing of the string (counting up from index 0) or the end of the string (an increasinglt negative count starting with index value -1).


In [40]:
'abc DEF'[0], 'abc DEF'[4], 'abc DEF'[-1]


Out[40]:
('a', 'D', 'F')

So we can find the character before a match string as follows:


In [41]:
ix = locator.spans['gbp'][0]-1
string[ ix ]


Out[41]:
'£'

We can handle the end-of-line character match requirement by adding whitespace around the string before we parse it, also adding in the ability to check for the presence of a currency symbol.


In [42]:
def moneyfudge(string):
    return ' {} '.format( commaclean(string))

In [43]:
moneystring = 'Gaming the system: I got £1.25 then £50,000 on top in 2015'

items=findall(gbppattern, moneyfudge(moneystring))
for i in items:
    print(i, moneyfudge(moneystring)[i.spans['gbp'][0]-1] )


<Result (' Gaming the system: I got £', ' ') {'gbp': 1.25}> £
<Result ('then £', ' ') {'gbp': 50000.0}> £
<Result ('on top in', ' ') {'gbp': 2015.0}> n

Let's create a function to get the financial amounts from a string identified by a preceding currency symbol such as a £ sign.

Preemptively, we can also build in a check for amounts in other currencies, using generate a simple dict to map currency symbols onto currency labels.


In [44]:
currencies={'£': 'GBP',
            '$': 'USD',
            '€': 'EUR'}

for unit in ['£', '€', '$']:
    print( currencies[unit] )


GBP
EUR
USD

Here are the currency symbols we know about:


In [45]:
currencies.keys()


Out[45]:
dict_keys(['£', '$', '€'])

In [46]:
def getamounts(string, numpattern = '{}{num:g}{}'):
    response = {}
    response['amounts'] = []
    response['currency'] = []
    for amount in findall(numpattern, moneyfudge(string) ):
        currency = moneyfudge(string)[amount.spans['num'][0]-1]
        if currency in currencies.keys():
            response['amounts'].append(amount['num'])
            response['currency'].append(currencies[currency])
    return response

In [47]:
getamounts(moneystring)


Out[47]:
{'amounts': [1.25, 50000.0], 'currency': ['GBP', 'GBP']}

In [48]:
print( getamounts('I got £5,000, okay?') )
print( getamounts('£5,000') )
print( getamounts('£5001') )
print( getamounts('They paid me, in 2015, £5000') )
print( getamounts('£1,250,000.00. #loadsamoney') )
print( getamounts('The sum of £2.75 for a coffee which cost £2.75') )

print( getamounts('I got £1000. In two lots: £200 and £800') )
print( getamounts('Gaming the system: I got £1.25 then £50,000 on top') )
print( getamounts("I got 50000 you won't pick up on") )


{'amounts': [5000.0], 'currency': ['GBP']}
{'amounts': [5000.0], 'currency': ['GBP']}
{'amounts': [5001.0], 'currency': ['GBP']}
{'amounts': [5000.0], 'currency': ['GBP']}
{'amounts': [1250000.0], 'currency': ['GBP']}
{'amounts': [2.75, 2.75], 'currency': ['GBP', 'GBP']}
{'amounts': [1000.0, 200.0, 800.0], 'currency': ['GBP', 'GBP', 'GBP']}
{'amounts': [1.25, 50000.0], 'currency': ['GBP', 'GBP']}
{'amounts': [], 'currency': []}

Right - so we can pull out amounts, if preceded by a currency symbol. (We could perhaps also catch other numbers that don't look like years around the reporting period?) into an "other numbers" list for further investigation?) Let's start to think about generating some new items for our extracteditems record. We'll create function that returns the maximum, summed and itemised amounts that we can use variously if we want to go digging in the data.

We can also try to be clever, and where more that two items are listed, calculate the difference between the total sum and the the maximum amount. If these are equal (or if we are being more elaborate, nearly equal) then we might want to check the text to see if the larger amount is specifying the sum of the smaller amounts (in which case, the total sum item is meaningless).

Where there are mutliple amounts, we can create a serialised version of the list.


In [49]:
amounts = [100, 250.25, 1000]
#serialise the amounts - making sure they are represented as strings first
'::'.join([str(amount) for amount in amounts])


Out[49]:
'100::250.25::1000'

When doing the sums, we need to make sure that the sums are calculated on the same currencies. It's easier to do this using a tabular data representation built for doing spreadsheet like operations. The pandas package is the go to package for working with tabular datasets, conventionally imported using the name pd, so let's load it in.


In [50]:
#!~/anaconda3/bin/pip install pandas
import pandas as pd

We can generate a dataframe directly from a list of dicts. The dict keys specify the column names, and each list entry represents a row.


In [51]:
#Generate a test example dataframe
df = pd.DataFrame( getamounts('I got £1000. In two lots: £200 and £800. And then $500') )
df


Out[51]:
amounts currency
0 1000.0 GBP
1 200.0 GBP
2 800.0 GBP
3 500.0 USD

We can also filter the rows to show just the rows where one or more columns have specified values:


In [52]:
df[ df['currency']=='GBP' ]


Out[52]:
amounts currency
0 1000.0 GBP
1 200.0 GBP
2 800.0 GBP

We can check whether a dataframe, df, has at at least one row by testing whether it is empty or not:


In [53]:
print( df.empty )
print( pd.DataFrame([]).empty )


False
True

We can also return a dataframe as a dict in a variety of orientations:


In [54]:
df.to_dict(orient = 'list')


Out[54]:
{'amounts': [1000.0, 200.0, 800.0, 500.0],
 'currency': ['GBP', 'GBP', 'GBP', 'USD']}

Now let's build a function to generate some amount related measures on our extracted data, by currency.


In [55]:
def getcheckamounts(string, numpattern = '{}{num:g}{?D}'):
    amounts = getamounts(string, numpattern)
    
    #Generate a dataframe of the amounts
    df_amounts = pd.DataFrame( amounts )
    
    #If there arenlt any amounts, we can bail out now
    if df_amounts.empty: return {}

    response = {}
    
    #Generate a dataframe containing just the ukp amounts
    df_ukpamounts  = df_amounts[df_amounts['currency']=='GBP']
    
    response['maxamountGBP'] = max(df_ukpamounts['amounts']) if not df_ukpamounts.empty else 0
    response['numamountsGBP'] = len(df_ukpamounts)
    response['sumamountsGBP'] = sum(df_ukpamounts['amounts']) if not df_ukpamounts.empty else 0
    response['sumlessmaxGBP'] = response['sumamountsGBP'] - response['maxamountGBP'] if len(df_ukpamounts['amounts'])>2 else 0
    
    response['numamountsOther'] = len(df_amounts) - len(df_ukpamounts)
    
    #Record amounts for all currency items
    response['amounts'] = '::'.join([str(amount) for amount in df_amounts['amounts']])
    response['currencies'] = '::'.join([str(currency) for currency in df_amounts['currency']])

    return response

In [56]:
tests = ['I got £1.25 then £50,000 on top in 2015',
         '£1.25',
         'No money',
         'Four payments, £100, £200 and £400 to give £701 total, and then $5000 more' ]

for test in tests:
    print( test, getcheckamounts(test) , '\n')


I got £1.25 then £50,000 on top in 2015 {'maxamountGBP': 50000.0, 'numamountsGBP': 2, 'sumamountsGBP': 50001.25, 'sumlessmaxGBP': 0, 'numamountsOther': 0, 'amounts': '1.25::50000.0', 'currencies': 'GBP::GBP'} 

£1.25 {'maxamountGBP': 1.25, 'numamountsGBP': 1, 'sumamountsGBP': 1.25, 'sumlessmaxGBP': 0, 'numamountsOther': 0, 'amounts': '1.25', 'currencies': 'GBP'} 

No money {} 

Four payments, £100, £200 and £400 to give £701 total, and then $5000 more {'maxamountGBP': 701.0, 'numamountsGBP': 4, 'sumamountsGBP': 1401.0, 'sumlessmaxGBP': 700.0, 'numamountsOther': 1, 'amounts': '100.0::200.0::400.0::701.0::5000.0', 'currencies': 'GBP::GBP::GBP::GBP::USD'} 

This may look overly complex, but this is just a beginning if we are to try to buid logic in that will detect amounts submitted, for whatever reason, in arbitrary ways.

An Aside - Foreign Currency Exchange Rates

If amounts are submitted in other currencies, we can use date information to calculate equivalent sterling amounts using historical exchange rate data.

The forex-python package provides one way of doing this.


In [57]:
#!~/anaconda3/bin/pip install forex-python

In [58]:
testdate = parsedate('February 10th 2015')
testdate


Out[58]:
datetime.datetime(2015, 2, 10, 0, 0)

In [59]:
from forex_python.converter import CurrencyRates
c = CurrencyRates()
#Convert $100 to UKP using exchange rate as of February, 2015
c.convert('USD', 'GBP', 100, testdate)


Out[59]:
65.681

Adding Amounts Data to the extracteditems record

Let's have a go at adding that information to the extracteditems record. The text we want to parse is in the txt field, and perhaps also in simplified form in the cleanertxt field.


In [60]:
extracteditems['txt']


Out[60]:
'£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'

In [61]:
# The .update() method updates the contents of the dict directly
extracteditems.update( getcheckamounts( extracteditems['txt']) )
extracteditems


Out[61]:
{'addr': 'Balmoral Park, Aberdeen AB12 3GY',
 'amounts': '2000.0',
 'cleanertxt': '£2,000 to support my Primary School Christmas Card Competition',
 'currencies': 'GBP',
 'date': '09 December 2016',
 'dateAccd': '8 December 2016',
 'dateAccd_f': datetime.datetime(2016, 12, 8, 0, 0),
 'dateRxd': '8 December 2016',
 'dateRxd_f': datetime.datetime(2016, 12, 8, 0, 0),
 'date_f': datetime.datetime(2016, 12, 9, 0, 0),
 'maxamountGBP': 2000.0,
 'name': 'Balmoral Tanks Ltd',
 'numamountsGBP': 1,
 'numamountsOther': 0,
 'status': 'company, registration 300656',
 'sumamountsGBP': 2000.0,
 'sumlessmaxGBP': 0,
 'txt': '£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016'}

Finding More Structure From the Original Page

Looking back at the original page, we notice that the entries are grouped acorrding to different sorts of interest, such as 2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation or 3. Gifts, benefits and hospitality from UK sources.

We can detect when we enter a section by detecting a paragraph that starts with one of these headings:


In [62]:
for p in ptags:
    #The .startswith() method accepts a tuple (enumeration of things in a pair of brackets) to check against
    #If the text startswith any of the strings in the tuple, the condition evaluates true for that text
    if p.text and p.text.startswith(('1.','2.','3.','4.','5.','6.','7.','8.')):
        print(p.text)


1. Employment and earnings
2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation
2. (b) Any other support not included in Category 2(a)
3. Gifts, benefits and hospitality from UK sources
4. Visits outside the UK
8. Miscellaneous

A Semi-Structured Parser for Employment Information

Looking through some of the member pages, there are other elements of structure that we should be able to pull on. For example, many pages have content of the following form in the Employment and earnings section which we should be able to pull out:

1. Employment and earnings

Payments from ComRes, 4 Millbank, London SW1P 3JA:
26 January 2016, £75 for participating in Parliamentary Panel Survey. Hours: 15 mins. (Registered 03 August 2016)
4 April 2016, £75 for participating in Parliamentary Panel Survey. Hours: 20 mins. (Registered 03 August 2016)
20 May 2016, £100 for participating in Parliamentary Panel Survey. Hours: 15 mins. (Registered 03 August 2016)

Payments from YouGov, 50 Featherstone St, London EC1Y 8RT:
13 Jan 2016, £50 for participating in an online survey. Hours: 15 mins. (Registered 05 December 2016)
4 Mar 2016, £30 for participating in an online survey. Hours: 15 mins. (Registered 05 December 2016)

In [63]:
#Example text
ptags[2].text


Out[63]:
'13 December 2016, received £2,500 from Hampshire Cricket, The Ageas Bowl, Botley Road, West End, Southampton SO30 3XH, for media and communications training. Hours: 16 hrs including travel and preparation. (Registered 19 December 2016)'

In [64]:
employmentpattern1 = '''{empdate},{txt2} Hours: {hours}(Registered {date})'''

In [65]:
todict2( employmentpattern1, ptags[2].text )


Out[65]:
{'date': '19 December 2016',
 'empdate': '13 December 2016',
 'hours': '16 hrs including travel and preparation. ',
 'txt2': ' received £2,500 from Hampshire Cricket, The Ageas Bowl, Botley Road, West End, Southampton SO30 3XH, for media and communications training.'}

In [66]:
string = '26 January 2016, £75 for participating in Parliamentary Panel Survey. Hours: 15 mins. (Registered 03 August 2016)'
todict2( employmentpattern1,string )


Out[66]:
{'date': '03 August 2016',
 'empdate': '26 January 2016',
 'hours': '15 mins. ',
 'txt2': ' £75 for participating in Parliamentary Panel Survey.'}

In [67]:
employmentsubpattern1 = '''{emptxt} for {empfor} from {empfrom}'''
employmentsubpattern2 = '''{emptxt} from {empfrom} for {empfor}'''
employmentsubpattern3 = '''{emptxt} for {empfor}'''
employmentsubpattern4 = '''{emptxt} from {empfrom}'''

In [68]:
teststrings = ['£75 for participating in Parliamentary Panel Survey',
               '£75 from this or that org for participating in Parliamentary Panel Survey',
               '£75 for participating in Parliamentary Panel Survey from this or that org',
               '£75 from this or that org',
               '£75 from the Society for Whatever for doing a thing',
               '£75 from the Society for Whatever',
               '£75 for doing whatever from the Society for Whatever',
               '£75 from the Society for Whatever and for Whenever for doing whatever ']

for string in teststrings:
    test = todict2( employmentsubpattern1, string)
    if not test:
        test = todict2( employmentsubpattern2, string)
    if not test:
        test = todict2( employmentsubpattern3, string)
    if not test:
        test = todict2( employmentsubpattern4, string)
    print(string, test,'\n')


£75 for participating in Parliamentary Panel Survey {'emptxt': '£75', 'empfor': 'participating in Parliamentary Panel Survey'} 

£75 from this or that org for participating in Parliamentary Panel Survey {'emptxt': '£75', 'empfrom': 'this or that org', 'empfor': 'participating in Parliamentary Panel Survey'} 

£75 for participating in Parliamentary Panel Survey from this or that org {'emptxt': '£75', 'empfor': 'participating in Parliamentary Panel Survey', 'empfrom': 'this or that org'} 

£75 from this or that org {'emptxt': '£75', 'empfrom': 'this or that org'} 

£75 from the Society for Whatever for doing a thing {'emptxt': '£75', 'empfrom': 'the Society', 'empfor': 'Whatever for doing a thing'} 

£75 from the Society for Whatever {'emptxt': '£75', 'empfrom': 'the Society', 'empfor': 'Whatever'} 

£75 for doing whatever from the Society for Whatever {'emptxt': '£75', 'empfor': 'doing whatever', 'empfrom': 'the Society for Whatever'} 

£75 from the Society for Whatever and for Whenever for doing whatever  {'emptxt': '£75', 'empfrom': 'the Society', 'empfor': 'Whatever and for Whenever for doing whatever '} 

One of the issues we have here is identifying whether or nor a for is part of the name or not.

If a for is extracted into a "for" attribute, we can perhaps assume that it needs pushing back into the "from" attribute. But things can get messy as the above examples show,

So this is largely parked for now other than to bludgeon the hadnling of a "for in for".


In [69]:
def forcatcher(record):
    #Try to catch the "Society for Whatever"
    #Also things like the Society for This and for That
    if 'for' in record and ' for ' in record['for']: 
        record['from'] = ' for '.join([record['from']] + record['for'].split('for')[:-1]).strip()
        record['for'] = record['for'].split('for')[-1].strip()
        
        #We can get rid of any double spaces by splitting and rejoining a string on a space
        record['from'] = ' '.join(record['from'].split())
        record['for'] = ' '.join(record['for'].split())

    return record

In [70]:
def getemploymentsubdetails(string):
    empdetail = todict2( employmentsubpattern1, string)
    if not empdetail:
        empdetail = todict2( employmentsubpattern2, string)
    if not empdetail:
        empdetail = todict2( employmentsubpattern3, string)
    if not empdetail:
        empdetail = todict2( employmentsubpattern4, string)
    return empdetail

In [71]:
for string in teststrings:
    test = getemploymentsubdetails(string)
    print(string, forcatcher(test),'\n')


£75 for participating in Parliamentary Panel Survey {'emptxt': '£75', 'empfor': 'participating in Parliamentary Panel Survey'} 

£75 from this or that org for participating in Parliamentary Panel Survey {'emptxt': '£75', 'empfrom': 'this or that org', 'empfor': 'participating in Parliamentary Panel Survey'} 

£75 for participating in Parliamentary Panel Survey from this or that org {'emptxt': '£75', 'empfor': 'participating in Parliamentary Panel Survey', 'empfrom': 'this or that org'} 

£75 from this or that org {'emptxt': '£75', 'empfrom': 'this or that org'} 

£75 from the Society for Whatever for doing a thing {'emptxt': '£75', 'empfrom': 'the Society', 'empfor': 'Whatever for doing a thing'} 

£75 from the Society for Whatever {'emptxt': '£75', 'empfrom': 'the Society', 'empfor': 'Whatever'} 

£75 for doing whatever from the Society for Whatever {'emptxt': '£75', 'empfor': 'doing whatever', 'empfrom': 'the Society for Whatever'} 

£75 from the Society for Whatever and for Whenever for doing whatever  {'emptxt': '£75', 'empfrom': 'the Society', 'empfor': 'Whatever and for Whenever for doing whatever '} 


In [72]:
def getemploymentdetails(string):
    employmentdetails = todict2( employmentpattern1, string )
    #Enrich with further extractions
    if employmentdetails:
        employmentdetails.update( getemploymentsubdetails(employmentdetails['txt2']) )
        employmentdetails.update( getcheckamounts(employmentdetails['txt2']) )
        employmentdetails = forcatcher(employmentdetails)
    return employmentdetails

In [73]:
getemploymentdetails(ptags[2].text)


Out[73]:
{'amounts': '2500.0',
 'currencies': 'GBP',
 'date': '19 December 2016',
 'empdate': '13 December 2016',
 'empfor': 'media and communications training.',
 'empfrom': 'Hampshire Cricket, The Ageas Bowl, Botley Road, West End, Southampton SO30 3XH,',
 'emptxt': ' received £2,500',
 'hours': '16 hrs including travel and preparation. ',
 'maxamountGBP': 2500.0,
 'numamountsGBP': 1,
 'numamountsOther': 0,
 'sumamountsGBP': 2500.0,
 'sumlessmaxGBP': 0,
 'txt2': ' received £2,500 from Hampshire Cricket, The Ageas Bowl, Botley Road, West End, Southampton SO30 3XH, for media and communications training.'}

Inspection of some other records suggest that sometimes a nested structure might be used to represent payments from the same source:


In [74]:
url2='https://publications.parliament.uk/pa/cm/cmregmem/170502/gray_james.htm'
r2 = session.get(url2)
ptags2 = r2.html.find('#mainTextBlock > p')

In [75]:
for i in range(2,15):
    print(ptags2[i].text)


Payments from ComRes, 4 Millbank, London SW1P 3JA:
26 January 2016, £75 for participating in Parliamentary Panel Survey. Hours: 15 mins. (Registered 03 August 2016)
4 April 2016, £75 for participating in Parliamentary Panel Survey. Hours: 20 mins. (Registered 03 August 2016)
20 May 2016, £100 for participating in Parliamentary Panel Survey. Hours: 15 mins. (Registered 03 August 2016)
20 July 2016, £75 for participating in Parliamentary Panel Survey. Hours: 15 mins. (Registered 03 August 2016)
31 October 2016, £75 for participating in Parliamentary Panel Survey. Hours: 15 mins. (Registered 31 October 2016)
16 December 2016, payment of £75 for taking part in an online survey. Hours: 15 mins. (Registered 16 December 2016)
Payments from YouGov, 50 Featherstone St, London EC1Y 8RT:
13 Jan 2016, £50 for participating in an online survey. Hours: 15 mins. (Registered 05 December 2016)
4 Mar 2016, £30 for participating in an online survey. Hours: 15 mins. (Registered 05 December 2016)
4 Apr 2016, £20 for participating in an online survey. Hours: 10 mins. (Registered 05 December 2016)
17 May 2016, £80 for participating in an online survey. Hours: 20 mins. (Registered 05 December 2016)
28 Jun 2016, £40 for participating in an online survey. Hours: 15 mins. (Registered 05 December 2016)

In [76]:
df = pd.DataFrame()
for i in range(18,24):
    empdetails = getemploymentdetails(ptags2[i].text)
    if empdetails:
        df = pd.concat([df,pd.DataFrame([empdetails])])
    else:
        #Add a row containing unparsed text
        dummy = {'txt':ptags2[i].text}
        df = pd.concat([df,pd.DataFrame([dummy])])
    extracteditems['txt'] = p.text
    
#Reset the index
df.reset_index(drop=True, inplace=True)
df[['txt','txt2']]


Out[76]:
txt txt2
0 16 December 2016, £60 for taking part in an on... NaN
1 Payments from Ipsos MORI, 79-81 Borough Road, ... NaN
2 NaN payment of £200 for participating in Ipsos MO...
3 NaN payment of £200 for participating in Ipsos MO...
4 Payments from Populus, 10 Northburgh Street, L... NaN
5 NaN £50 for participating in a research project.

Inspecting tha dataframe, we notice that we can "fill down" on the dataframe to generate a string that might provide information about who made a payment.


In [77]:
#Create a new working column
df['txt4'] = df['txt']
#Fill down
df['txt4'] = df['txt4'].fillna(method='ffill')
df['txt4']


Out[77]:
0    16 December 2016, £60 for taking part in an on...
1    Payments from Ipsos MORI, 79-81 Borough Road, ...
2    Payments from Ipsos MORI, 79-81 Borough Road, ...
3    Payments from Ipsos MORI, 79-81 Borough Road, ...
4    Payments from Populus, 10 Northburgh Street, L...
5    Payments from Populus, 10 Northburgh Street, L...
Name: txt4, dtype: object

In [78]:
emppaymentsubpattern1 = '''{} from {empfromsub}, {empaddr}'''
emppaymentsubpattern2 = '''{} via {empfromsub}, {empaddr}'''
emppaymentsubpattern3 = '''{} from {empfromsub}'''
emppaymentsubpattern4 = '''{} via {empfromsub}'''

In [79]:
def getemppaymentsubdetails(string):
    #Check that we have a valid string to parse, else return an empty dict
    if pd.isnull(string): return {}
    
    emppaymentdetail = todict2( emppaymentsubpattern1, string)
    if not emppaymentdetail:
        emppaymentdetail = todict2( emppaymentsubpattern2, string)
    if not emppaymentdetail:
        emppaymentdetail = todict2( emppaymentsubpattern3, string)
    if not emppaymentdetail:
        emppaymentdetail = todict2( emppaymentsubpattern4, string)
    return emppaymentdetail

In [80]:
df['txt4'].apply(getemppaymentsubdetails)


Out[80]:
0                                                   {}
1    {'empfromsub': 'Ipsos MORI', 'empaddr': '79-81...
2    {'empfromsub': 'Ipsos MORI', 'empaddr': '79-81...
3    {'empfromsub': 'Ipsos MORI', 'empaddr': '79-81...
4    {'empfromsub': 'Populus', 'empaddr': '10 North...
5    {'empfromsub': 'Populus', 'empaddr': '10 North...
Name: txt4, dtype: object

We can then cast the column that contains the extracted dict across new columns.


In [81]:
df['txt4'].apply(getemppaymentsubdetails).apply(pd.Series)


Out[81]:
empaddr empfromsub
0 NaN NaN
1 79-81 Borough Road, London SE1 1FY for opinion... Ipsos MORI
2 79-81 Borough Road, London SE1 1FY for opinion... Ipsos MORI
3 79-81 Borough Road, London SE1 1FY for opinion... Ipsos MORI
4 10 Northburgh Street, London EC1V 0AT: Populus
5 10 Northburgh Street, London EC1V 0AT: Populus

In [82]:
df = df.join( df['txt4'].apply(getemppaymentsubdetails).apply(pd.Series))
df


Out[82]:
amounts currencies date empdate empfor emptxt hours maxamountGBP numamountsGBP numamountsOther sumamountsGBP sumlessmaxGBP txt txt2 txt4 empaddr empfromsub
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 16 December 2016, £60 for taking part in an on... NaN 16 December 2016, £60 for taking part in an on... NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Payments from Ipsos MORI, 79-81 Borough Road, ... NaN Payments from Ipsos MORI, 79-81 Borough Road, ... 79-81 Borough Road, London SE1 1FY for opinion... Ipsos MORI
2 200.0 GBP 02 August 2016 2 August 2016 participating in Ipsos MORI's summer survey. payment of £200 30 mins. 200.0 1.0 0.0 200.0 0.0 NaN payment of £200 for participating in Ipsos MO... Payments from Ipsos MORI, 79-81 Borough Road, ... 79-81 Borough Road, London SE1 1FY for opinion... Ipsos MORI
3 200.0 GBP 20 March 2017 20 March 2017 participating in Ipsos MORI’s winter survey. payment of £200 30 mins. 200.0 1.0 0.0 200.0 0.0 NaN payment of £200 for participating in Ipsos MO... Payments from Ipsos MORI, 79-81 Borough Road, ... 79-81 Borough Road, London SE1 1FY for opinion... Ipsos MORI
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Payments from Populus, 10 Northburgh Street, L... NaN Payments from Populus, 10 Northburgh Street, L... 10 Northburgh Street, London EC1V 0AT: Populus
5 50.0 GBP 03 August 2016 23 January 2016 participating in a research project. £50 20 mins. 50.0 1.0 0.0 50.0 0.0 NaN £50 for participating in a research project. Payments from Populus, 10 Northburgh Street, L... 10 Northburgh Street, London EC1V 0AT: Populus

We'll be able to make use of this later as a post-processor on a dataframe generated from entries on the whole page.

A Structured Parser for Visits outside the UK

The section recording visits outside the UK looks as if it may have a regular structure:

4. Visits outside the UK

Name of donor: The Mamont Foundation
Address of donor: c/o Rothschild Trust Guernsey Ltd, PO Box 472, St Julian’s Court, St Julian’s Avenue, St Peter Port GY1 6AX, Guernsey, Channel Islands
Estimate of the probable value (or amount of any donation): transport, accommodation, food and drink at an estimated cost of £1,500
Destination of visit: Arkhangelsk
Date of visit: 28 – 31 March 2017
Purpose of visit: To attend the ‘Arctic: Territory of Dialogue’ International Arctic Forum.
(Registered 20 April 2017)

In [83]:
#Example text
ptags[23].text


Out[83]:
'Name of donor: HM Government of Gibraltar\nAddress of donor: Gibraltar House, 150 Strand, London WC2R 1JA\nAmount of donation (or estimate of the probable value): airfares and accommodation for a member of staff £1,296 and myself £870, total value of £2,166\nDestination of visit: Gibraltar\nDate of visit: 8-10 September 2016\nPurpose of visit: As part of the Gibraltar APPG, invited by HM Government of Gibraltar for National Day celebrations.\n(Registered 03 October 2016; updated 04 October 2016)'

By inspection of unparsed strings, theremay be several formats for recording the overseas entries visits. We can capture these use a default hierachy that tests each in turn.


In [84]:
#Note the {} fudge to account for Date/Dates
outsideukvisitpattern1 = '''Name of donor: {name}\nAddress of donor: {addr}\nEstimate of the probable value (or amount of any donation): {visitestimate}\nDestination of visit: {visitDest}\nDat{} of visit: {visitdates}\nPurpose of visit: {visitpurpose}\n(Registered {date})'''
outsideukvisitpattern2 = '''Name of donor: {name}\nAddress of donor: {addr}\nAmount of donation (or estimate of the probable value): {visitestimate}\nDestination of visit: {visitDest}\nDat{} of visit: {visitdates}\nPurpose of visit: {visitpurpose}\n(Registered {date})'''

def getoutsideUKvisit(string):
    visit = todict2(outsideukvisitpattern1, string)
    if not visit:
         visit = todict2(outsideukvisitpattern2, string)
    if visit:
        visit.update( getcheckamounts(visit['visitestimate']))
    return visit

In [85]:
for i in [23, 24, 25]:
    string = ptags[i].text
    print(string, getoutsideUKvisit(string), '\n')


Name of donor: HM Government of Gibraltar
Address of donor: Gibraltar House, 150 Strand, London WC2R 1JA
Amount of donation (or estimate of the probable value): airfares and accommodation for a member of staff £1,296 and myself £870, total value of £2,166
Destination of visit: Gibraltar
Date of visit: 8-10 September 2016
Purpose of visit: As part of the Gibraltar APPG, invited by HM Government of Gibraltar for National Day celebrations.
(Registered 03 October 2016; updated 04 October 2016) {'name': 'HM Government of Gibraltar', 'addr': 'Gibraltar House, 150 Strand, London WC2R 1JA', 'visitestimate': 'airfares and accommodation for a member of staff £1,296 and myself £870, total value of £2,166', 'visitDest': 'Gibraltar', 'visitdates': '8-10 September 2016', 'visitpurpose': 'As part of the Gibraltar APPG, invited by HM Government of Gibraltar for National Day celebrations.', 'date': '03 October 2016; updated 04 October 2016', 'maxamountGBP': 2166.0, 'numamountsGBP': 3, 'sumamountsGBP': 4332.0, 'sumlessmaxGBP': 2166.0, 'numamountsOther': 0, 'amounts': '1296.0::870.0::2166.0', 'currencies': 'GBP::GBP::GBP'} 

Name of donor: (1) Labour Friends of Israel (2) Israeli Ministry of Foreign Affairs
Address of donor: (1) BM LFI, London WC1N 3XX (2) 9 Yitzhak Rabin Blvd., Kiryat Ben-Gurion, Jerusalem 9103001
Estimate of the probable value (or amount of any donation): (1) Flights, accommodation and subsistence with an estimated value of £1,250 (2) Hospitality with a value of £125.
Destination of visit: Israel and the Palestinian Territories
Dates of visit: 14-16 November 2016
Purpose of visit: Fact finding visit. Meeting with Israeli and Palestinian co-existence campaigners, politicians, academics, journalists and diplomats.
(Registered 09 December 2016) {'name': '(1) Labour Friends of Israel (2) Israeli Ministry of Foreign Affairs', 'addr': '(1) BM LFI, London WC1N 3XX (2) 9 Yitzhak Rabin Blvd., Kiryat Ben-Gurion, Jerusalem 9103001', 'visitestimate': '(1) Flights, accommodation and subsistence with an estimated value of £1,250 (2) Hospitality with a value of £125.', 'visitDest': 'Israel and the Palestinian Territories', 'visitdates': '14-16 November 2016', 'visitpurpose': 'Fact finding visit. Meeting with Israeli and Palestinian co-existence campaigners, politicians, academics, journalists and diplomats.', 'date': '09 December 2016', 'maxamountGBP': 1250.0, 'numamountsGBP': 2, 'sumamountsGBP': 1375.0, 'sumlessmaxGBP': 0, 'numamountsOther': 0, 'amounts': '1250.0::125.0', 'currencies': 'GBP::GBP'} 

8. Miscellaneous {} 

Making a tabular dataset

Well it looks like we're scraping something!

Let's see if we can now tidy that up a bit and put it into a tabular dataframe from the pandas package.

What we're going to do is add each record as a row to a dataframe using the pandas concat() function.

Putting the Bits Together

So now we pretty much have all the bits we need in order to scrape a register page for a particular member. Let's start to put them together and see if we can make a dataset, for that member, from the pieces we've assembled above.


In [87]:
section = ''

#iterate through the paragraphs
for p in ptags[:15]:
    #Check to see if we're in a new section. If so, capture the section
    if p.text and p.text.startswith(('1.','2.','3.','4.','5.','6.','7.','8.')):
        section = p.text
        
    #Do the preliminary parsing of a paragraph
    extracteditems = todict2(pattern, p.text)
    
    #Identify the section
    extracteditems['section'] = section
    
    #Look for the data - checking first there's a txt tag that's been extracted...
    if extracteditems and 'txt' in extracteditems:
        #Dates
        todict2(datepattern, extracteditems['txt'], extracteditems)
        #Get a simmplified version of the text string, without dates, to potentially make life easier in the future
        extracteditems['cleanertxt'] = extracteditems['txt'].split('\nDate')[0]
        #Dateify any dates
        parsedates(extracteditems)
        
        #Extract any company numbers that are declared
        extracteditems.update( companynumber(extracteditems['status']) )
        
        #Money
        extracteditems.update( getcheckamounts( extracteditems['txt']) )
    
        print(extracteditems)


{'name': 'VGC Group', 'addr': 'Cardinal House, Bury Street, Ruislip HA4 7GD', 'txt': '£1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.', 'status': 'company, registration 5741473', 'date': '04 May 2016', 'section': '2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation', 'cleanertxt': '£1,800 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.', 'date_f': datetime.datetime(2016, 5, 4, 0, 0), 'status2': 'company', 'cn': '5741473', 'maxamountGBP': 1800.0, 'numamountsGBP': 1, 'sumamountsGBP': 1800.0, 'sumlessmaxGBP': 0, 'numamountsOther': 0, 'amounts': '1800.0', 'currencies': 'GBP'}
{'name': 'Edward Maurice Watkins', 'addr': 'private', 'txt': '£755 in purchasing tickets for a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally. (Registered with donation below.)', 'status': 'individual', 'date': '23 May 2016', 'section': '2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation', 'cleanertxt': '£755 in purchasing tickets for a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally. (Registered with donation below.)', 'date_f': datetime.datetime(2016, 5, 23, 0, 0), 'maxamountGBP': 755.0, 'numamountsGBP': 1, 'sumamountsGBP': 755.0, 'sumlessmaxGBP': 0, 'numamountsOther': 0, 'amounts': '755.0', 'currencies': 'GBP'}
{'name': 'Edward Maurice Watkins', 'addr': 'private', 'txt': '£2,000 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally. (Registered with donation above.)', 'status': 'individual', 'date': '23 May 2016', 'section': '2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation', 'cleanertxt': '£2,000 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally. (Registered with donation above.)', 'date_f': datetime.datetime(2016, 5, 23, 0, 0), 'maxamountGBP': 2000.0, 'numamountsGBP': 1, 'sumamountsGBP': 2000.0, 'sumlessmaxGBP': 0, 'numamountsOther': 0, 'amounts': '2000.0', 'currencies': 'GBP'}
{'name': 'Chris Chenn', 'addr': 'private', 'txt': '£1,900 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.', 'status': 'individual', 'date': '22 September 2016', 'section': '2. (a) Support linked to an MP but received by a local party organisation or indirectly via a central party organisation', 'cleanertxt': '£1,900 in a successful auction bid at a fundraising dinner for Barnsley East CLP and the office of another MP, the profits from which will be divided equally.', 'date_f': datetime.datetime(2016, 9, 22, 0, 0), 'maxamountGBP': 1900.0, 'numamountsGBP': 1, 'sumamountsGBP': 1900.0, 'sumlessmaxGBP': 0, 'numamountsOther': 0, 'amounts': '1900.0', 'currencies': 'GBP'}
{'name': 'Balmoral Tanks Ltd', 'addr': 'Balmoral Park, Aberdeen AB12 3GY', 'txt': '£2,000 to support my Primary School Christmas Card Competition\nDate received: 8 December 2016\nDate accepted: 8 December 2016', 'status': 'company, registration 300656', 'date': '09 December 2016', 'section': '2. (b) Any other support not included in Category 2(a)', 'dateRxd': '8 December 2016', 'dateAccd': '8 December 2016', 'cleanertxt': '£2,000 to support my Primary School Christmas Card Competition', 'date_f': datetime.datetime(2016, 12, 9, 0, 0), 'dateRxd_f': datetime.datetime(2016, 12, 8, 0, 0), 'dateAccd_f': datetime.datetime(2016, 12, 8, 0, 0), 'status2': 'company', 'cn': '300656', 'maxamountGBP': 2000.0, 'numamountsGBP': 1, 'sumamountsGBP': 2000.0, 'sumlessmaxGBP': 0, 'numamountsOther': 0, 'amounts': '2000.0', 'currencies': 'GBP'}
{'name': 'UK Music', 'addr': '4th Floor, 49 Whitehall, London SW1A 2BX', 'txt': 'ticket and hospitality at the Ivor Novello Awards, value £444\nDate received: 19 May 2016\nDate accepted: 19 May 2016', 'status': 'company, registration no 3245288', 'date': '27 May 2016', 'section': '3. Gifts, benefits and hospitality from UK sources', 'dateRxd': '19 May 2016', 'dateAccd': '19 May 2016', 'cleanertxt': 'ticket and hospitality at the Ivor Novello Awards, value £444', 'date_f': datetime.datetime(2016, 5, 27, 0, 0), 'dateRxd_f': datetime.datetime(2016, 5, 19, 0, 0), 'dateAccd_f': datetime.datetime(2016, 5, 19, 0, 0), 'status2': 'company', 'cn': 'no 3245288', 'maxamountGBP': 444.0, 'numamountsGBP': 1, 'sumamountsGBP': 444.0, 'sumlessmaxGBP': 0, 'numamountsOther': 0, 'amounts': '444.0', 'currencies': 'GBP'}
{'name': 'The Yorkshire County Cricket Club', 'addr': 'Headingley Cricket Ground, Leeds LS6 3DP', 'txt': 'two tickets and hospitality at Yorkshire County Cricket Club to the value of £200, a total of £400\nDate received: 21 May 2016\nDate accepted: 21 May 2016', 'status': 'company, registration IP28929R', 'date': '07 June 2016', 'section': '3. Gifts, benefits and hospitality from UK sources', 'dateRxd': '21 May 2016', 'dateAccd': '21 May 2016', 'cleanertxt': 'two tickets and hospitality at Yorkshire County Cricket Club to the value of £200, a total of £400', 'date_f': datetime.datetime(2016, 6, 7, 0, 0), 'dateRxd_f': datetime.datetime(2016, 5, 21, 0, 0), 'dateAccd_f': datetime.datetime(2016, 5, 21, 0, 0), 'status2': 'company', 'cn': 'IP28929R', 'maxamountGBP': 400.0, 'numamountsGBP': 2, 'sumamountsGBP': 600.0, 'sumlessmaxGBP': 0, 'numamountsOther': 0, 'amounts': '200.0::400.0', 'currencies': 'GBP::GBP'}

We can build some optimisation in by moving to the next paragraph if we have already completely paresed the current paragraph using the continue statement.


In [88]:
for i in ['a','b','c','d']:
    if i=='b' : continue
    print(i)


a
c
d

One of the things our parsing to date has omitted to do is grab paragraphs / entries that we have not been able to parse. Keeping a list of these is important because it shows us what entries we have not been able to parse. We will build something into out processor to capture these items, grabbing any financial amounts, if we can detect any, as before.


In [89]:
#To make the code reusable, let's wrap it in a function:

def scrapeData(ptags, omitFirstRow=True, saveFullText=True):

    #The code is pretty much as it was before...
    #...except that at first we create an empty dataframe
    df = pd.DataFrame()

    section = ''
    
    start = 1 if omitFirstRow else 0
    
    mpname = ptags[0].text
    
    #iterate through the paragraphs, omittig the first one, which is the member name
    for p in ptags[start:]:
        #Check to see if we're in a new section. If so, capture the section
        if p.text and p.text.startswith(('1.','2.','3.','4.','5.','6.','7.','8.', '9.', '10.')):
            section = p.text
            continue
        
        if saveFullText: df['fulltext'] = p.text
        
        #Do the preliminary parsing of a paragraph
        extracteditems = todict2(pattern, p.text, )

        if extracteditems:
            #Identify the section
            extracteditems['section'] = section

            #Dates
            todict2(datepattern, extracteditems['txt'], extracteditems)
            #Get a simmplified version of the text string, without dates, to potentially make life easier in the future
            extracteditems['cleanertxt'] = extracteditems['txt'].split('\nDate')[0]
            #Dateify any dates
            parsedates(extracteditems)

            #Extract any company numbers that are declared
            extracteditems.update( companynumber(extracteditems['status']) )
        
            #Money
            extracteditems.update( getcheckamounts( extracteditems['txt']) )

            #Add a column to say we have structurally arsed the row
            extracteditems['parsed'] = True
            
            #...and then we add each record to it
            df = pd.concat([df,pd.DataFrame([extracteditems])])
            continue
        
        #Visits outside UK
        extracteditems = getoutsideUKvisit(p.text)
        if extracteditems:
            #Identify the section
            extracteditems['section'] = section
     
            #Add a column to say we have structurally arsed the row
            extracteditems['parsed'] = True
            
            #...and then we add each record to it
            df = pd.concat([df,pd.DataFrame([extracteditems])])
            continue
        
        #Employment
        extracteditems = getemploymentdetails(p.text)
        if extracteditems:
            #Identify the section
            extracteditems['section'] = section
     
            #Add a column to say we have structurally arsed the row
            extracteditems['parsed'] = True
            
            #...and then we add each record to it
            df = pd.concat([df,pd.DataFrame([extracteditems])])
            continue
        
        #This is the catch all for unparsed rows that contain content
        if p.text:
            #Grab any amounts
            extracteditems = getcheckamounts( p.text )
            
            #Identify the section
            extracteditems['section'] = section

            #Keep a record of the text
            extracteditems['txt'] = p.text
            
            #Record that we haven't run this through a structured parser
            extracteditems['parsed'] = False
            
            #Add to the list of entries
            df = pd.concat([df,pd.DataFrame([extracteditems])])

    #We need to record which MP it was...
    df['mpname'] = mpname
    
    #Return the dataframe... resetting the index for the whole dataframe
    #We can ignore (drop) the original index:
    #  it contains "dummy" values created when the single row dataframe was made from each record
    return df.reset_index(drop=True)

In [90]:
#Let's if it works...
tmp = scrapeData(ptags)
tmp.head(), tmp.tail()


Out[90]:
(                                           addr amounts  \
 0                                           NaN  2500.0   
 1  Cardinal House, Bury Street, Ruislip HA4 7GD  1800.0   
 2                                       private   755.0   
 3                                       private  2000.0   
 4                                       private  1900.0   
 
                                           cleanertxt       cn currencies  \
 0                                                NaN      NaN        GBP   
 1  £1,800 in a successful auction bid at a fundra...  5741473        GBP   
 2  £755 in purchasing tickets for a fundraising d...      NaN        GBP   
 3  £2,000 in a successful auction bid at a fundra...      NaN        GBP   
 4  £1,900 in a successful auction bid at a fundra...      NaN        GBP   
 
                 date dateAccd dateAccd_f dateRxd dateRxd_f  \
 0   19 December 2016      NaN        NaT     NaN       NaT   
 1        04 May 2016      NaN        NaT     NaN       NaT   
 2        23 May 2016      NaN        NaT     NaN       NaT   
 3        23 May 2016      NaN        NaT     NaN       NaT   
 4  22 September 2016      NaN        NaT     NaN       NaT   
 
                 ...                 status2 sumamountsGBP sumlessmaxGBP  \
 0               ...                     NaN        2500.0           0.0   
 1               ...                 company        1800.0           0.0   
 2               ...                     NaN         755.0           0.0   
 3               ...                     NaN        2000.0           0.0   
 4               ...                     NaN        1900.0           0.0   
 
                                                  txt  \
 0                                                NaN   
 1  £1,800 in a successful auction bid at a fundra...   
 2  £755 in purchasing tickets for a fundraising d...   
 3  £2,000 in a successful auction bid at a fundra...   
 4  £1,900 in a successful auction bid at a fundra...   
 
                                                 txt2 visitDest visitdates  \
 0   received £2,500 from Hampshire Cricket, The A...       NaN        NaN   
 1                                                NaN       NaN        NaN   
 2                                                NaN       NaN        NaN   
 3                                                NaN       NaN        NaN   
 4                                                NaN       NaN        NaN   
 
    visitestimate visitpurpose                           mpname  
 0            NaN          NaN  Dugher, Michael (Barnsley East)  
 1            NaN          NaN  Dugher, Michael (Barnsley East)  
 2            NaN          NaN  Dugher, Michael (Barnsley East)  
 3            NaN          NaN  Dugher, Michael (Barnsley East)  
 4            NaN          NaN  Dugher, Michael (Barnsley East)  
 
 [5 rows x 34 columns],
                                                  addr                amounts  \
 14                                                NaN                  350.0   
 15       Gibraltar House, 150 Strand, London WC2R 1JA  1296.0::870.0::2166.0   
 16  (1) BM LFI, London WC1N 3XX (2) 9 Yitzhak Rabi...          1250.0::125.0   
 17                                                NaN                    NaN   
 18                                                NaN                    NaN   
 
    cleanertxt   cn     currencies                                      date  \
 14        NaN  NaN            GBP                                       NaN   
 15        NaN  NaN  GBP::GBP::GBP  03 October 2016; updated 04 October 2016   
 16        NaN  NaN       GBP::GBP                          09 December 2016   
 17        NaN  NaN            NaN                                       NaN   
 18        NaN  NaN            NaN                                       NaN   
 
    dateAccd dateAccd_f dateRxd dateRxd_f               ...                 \
 14      NaN        NaT     NaN       NaT               ...                  
 15      NaN        NaT     NaN       NaT               ...                  
 16      NaN        NaT     NaN       NaT               ...                  
 17      NaN        NaT     NaN       NaT               ...                  
 18      NaN        NaT     NaN       NaT               ...                  
 
    status2 sumamountsGBP sumlessmaxGBP  \
 14     NaN         350.0           0.0   
 15     NaN        4332.0        2166.0   
 16     NaN        1375.0           0.0   
 17     NaN           NaN           NaN   
 18     NaN           NaN           NaN   
 
                                                   txt txt2  \
 14  Name of donor: Ladbrokes Coral Group plc\nAddr...  NaN   
 15                                                NaN  NaN   
 16                                                NaN  NaN   
 17  Received on 22 March 2017, payment from Associ...  NaN   
 18  I employ my wife, Joanna Dugher, as part-time ...  NaN   
 
                                  visitDest           visitdates  \
 14                                     NaN                  NaN   
 15                               Gibraltar  8-10 September 2016   
 16  Israel and the Palestinian Territories  14-16 November 2016   
 17                                     NaN                  NaN   
 18                                     NaN                  NaN   
 
                                         visitestimate  \
 14                                                NaN   
 15  airfares and accommodation for a member of sta...   
 16  (1) Flights, accommodation and subsistence wit...   
 17                                                NaN   
 18                                                NaN   
 
                                          visitpurpose  \
 14                                                NaN   
 15  As part of the Gibraltar APPG, invited by HM G...   
 16  Fact finding visit. Meeting with Israeli and P...   
 17                                                NaN   
 18                                                NaN   
 
                              mpname  
 14  Dugher, Michael (Barnsley East)  
 15  Dugher, Michael (Barnsley East)  
 16  Dugher, Michael (Barnsley East)  
 17  Dugher, Michael (Barnsley East)  
 18  Dugher, Michael (Barnsley East)  
 
 [5 rows x 34 columns])

If everything has gone to plan, we should be able to scrape the data from another page...


In [91]:
scrapeData(ptags2).head()


Out[91]:
addr amounts cleanertxt cn currencies date dateAccd dateAccd_f dateRxd dateRxd_f ... status2 sumamountsGBP sumlessmaxGBP txt txt2 visitDest visitdates visitestimate visitpurpose mpname
0 NaN NaN NaN NaN NaN NaN NaN NaT NaN NaT ... NaN NaN NaN Payments from ComRes, 4 Millbank, London SW1P ... NaN NaN NaN NaN NaN Gray, James (North Wiltshire)
1 NaN 75.0 NaN NaN GBP 03 August 2016 NaN NaT NaN NaT ... NaN 75.0 0.0 NaN £75 for participating in Parliamentary Panel ... NaN NaN NaN NaN Gray, James (North Wiltshire)
2 NaN 75.0 NaN NaN GBP 03 August 2016 NaN NaT NaN NaT ... NaN 75.0 0.0 NaN £75 for participating in Parliamentary Panel ... NaN NaN NaN NaN Gray, James (North Wiltshire)
3 NaN 100.0 NaN NaN GBP 03 August 2016 NaN NaT NaN NaT ... NaN 100.0 0.0 NaN £100 for participating in Parliamentary Panel... NaN NaN NaN NaN Gray, James (North Wiltshire)
4 NaN 75.0 NaN NaN GBP 03 August 2016 NaN NaT NaN NaT ... NaN 75.0 0.0 NaN £75 for participating in Parliamentary Panel ... NaN NaN NaN NaN Gray, James (North Wiltshire)

5 rows × 33 columns

In this case, we see how we can use the table as a whole to extract more information, as for example in the case of the employmnet details that were extracted as a nested list.


In [92]:
# It might be tidier to run this on the dataframe as we build it up
#That is, as soon as we have processed the employment section, section 1.

def scrapeData2(ptags, omitFirstRow=True):
    df = scrapeData(ptags, omitFirstRow)
    
    #Postprocessor - add in the payment info
    if not df.empty and 'txt' in df.columns:
        df['txt4'] = df['txt']
        #Fill down
        df['txt4'] = df.loc[df['section'].str.startswith('1.')]['txt4'].fillna(method='ffill')
        tmp= df['txt4'].apply(getemppaymentsubdetails).apply(pd.Series)
        #This is getting Frankeseteinian now.... :-(
        if 'empfromsub' in df:
            tmp['empfromsubfilldown'] = tmp['empfromsub'].fillna(method='ffill')
        #The parsing can be a bit flaky so we need to trap in cases where we don't extract anything
        if not tmp.empty:
            df = df.join(tmp)
    
    return df

scrapeData2(ptags2)[['txt4', 'txt2', 'txt','empaddr','empfromsub']].head()


Out[92]:
txt4 txt2 txt empaddr empfromsub
0 Payments from ComRes, 4 Millbank, London SW1P ... NaN Payments from ComRes, 4 Millbank, London SW1P ... 4 Millbank, London SW1P 3JA: ComRes
1 Payments from ComRes, 4 Millbank, London SW1P ... £75 for participating in Parliamentary Panel ... NaN 4 Millbank, London SW1P 3JA: ComRes
2 Payments from ComRes, 4 Millbank, London SW1P ... £75 for participating in Parliamentary Panel ... NaN 4 Millbank, London SW1P 3JA: ComRes
3 Payments from ComRes, 4 Millbank, London SW1P ... £100 for participating in Parliamentary Panel... NaN 4 Millbank, London SW1P 3JA: ComRes
4 Payments from ComRes, 4 Millbank, London SW1P ... £75 for participating in Parliamentary Panel ... NaN 4 Millbank, London SW1P 3JA: ComRes

Further Work on the Single Page Scrape

As we scrape more pages, we may be able to identify more common structures in different sections of the register and construct additional structured parsers for them.

It would probably have made sense to have looked at the regulations before building the scraper to see if they describe required fields: House of Commons Guide to the Rules relating to the Conduct of Members. An FOI request about structured items might also be useful?

Doing a Big Scrape

Now let's move on to stage 2 - getting a list of URLs for all the MPs so we can do a scrape of the whole register.

We can find the list of links to individual MPs on a register page such as https://publications.parliament.uk/pa/cm/cmregmem/170502/contents.htm.

Inspecting the page using developer tools again suggests that the mainTextBlock is a good place to start. However, there is also a parse function that lets us just grab all the links on the page: r.html.absolute_links.

The URLs to the corresponding member pages have a similar path, such as:

Full register URL: https://publications.parliament.uk/pa/cm/cmregmem/170502/contents.htm 
MP register URL: https://publications.parliament.uk/pa/cm/cmregmem/170502/dugher_michael.htm

This means that we can filter all the links on the page to just the ones that start with path to that register.


In [93]:
registerListurl = 'https://publications.parliament.uk/pa/cm/cmregmem/170502/contents.htm'
registerListurl = 'https://publications.parliament.uk/pa/cm/cmregmem/180305/contents.htm'

In [94]:
registerPath = '/'.join(registerListurl.split('/')[:-1])
registerPath


Out[94]:
'https://publications.parliament.uk/pa/cm/cmregmem/180305'

In [95]:
r3 = session.get(registerListurl)
links = r3.html.absolute_links


mplinks = [link for link in links if link.startswith(registerPath)]
mplinks[:3]


Out[95]:
['https://publications.parliament.uk/pa/cm/cmregmem/180305/williamson_chris.htm',
 'https://publications.parliament.uk/pa/cm/cmregmem/180305/docherty_leo.htm',
 'https://publications.parliament.uk/pa/cm/cmregmem/180305/ford_vicky.htm']

Now we can use this set of links to scrape the whole register. We do this by concatenating sepatrate dataframes scraped for each member, from their page, into a single dataframe.

In case the big scrape throws up errors in the single page scraper, we can cache (that is, keep hold of a local copy of) the individual pages so that we don't have to keep hitting the Parliament website. It's also generally consider good practice to put in a small delay betweem page requests. Making this delay random means that the servere we are hitting is perhaps less likely to detect we're scraping it.

Cacheing Page Requests

We can use the requests-cache package to cache page loads for a specified time, either in memory, or persistently to a sqlite database.


In [96]:
#!~/anaconda3/bin/pip install requests-cache
import requests_cache
requests_cache.install_cache()

The time library lets us introduce a delay, in seconds. The random library can generate this time for us.


In [97]:
import random
import time

delay = random.uniform(4.1,4.5)

print('Wait for {}s...'.format(delay))
time.sleep(delay) 
print('...done')


Wait for 4.153079536697653s...
...done

Doing the Scrape

Okay - let's go for it...


In [98]:
df_full = pd.DataFrame()

for mpurl in mplinks:
    rr = session.get(mpurl)
    mptags = rr.html.find('#mainTextBlock > p')
    #print(mpurl)
    df_mp = scrapeData2(mptags)
    df_full = pd.concat([df_full,df_mp])
    
    #Be nice - although there should be no need to do this if the cache is working, after first run?
    #time.sleep(random.uniform(0.1,0.5))

Save the Data to a Persistent SQLIte3 Database

We can now save the data to a persistent sqlite database.


In [99]:
dbname = 'mpregister.sqlite'
dbname = 'mpregisterLatest.sqlite'

In [100]:
!rm {dbname}

In [101]:
import sqlite3
from pandas.io import sql

# Create a connection to the database
conn = sqlite3.connect(dbname)

In [102]:
#Save the df to the database

#To do this, we need to set date types to a standard text format
tablename = 'mpregfinint'

for c in df_full.columns:
    if c.endswith('_f'): df_full[c] = df_full[c].dt.strftime('%Y-%m-%d')

df_full.to_sql(tablename, conn, index = False, if_exists='replace')

In [ ]:

Summary

This notebook has walked through the creation of a recipe for scraping the contents of the register of financial interests of a single MP.

In the next installment, we'll look at how to grab the data for all the members listed in a particular register. This will be followed - possibly! - by a look at how to query the data once we have scraped it, as well as how to enrich the dataset and make it "linkable" to other datasets using common identifiers, such as MNIS / MP ids and company numbers.