UK MPs - Register of Interests - Quick Sketch

Couple of hours hack around register of interests data...

Get Data

Seems we can find some from http://www.membersinterests.org.uk/.


In [8]:
url='http://downloads.membersinterests.org.uk/register/170707.zip'
!mkdir -p tmp/
!mkdir -p data/
!wget {url} -O tmp/temp.zip; unzip tmp/temp.zip -d data/ ; rm tmp/temp.zip


--2017-07-31 13:13:55--  http://downloads.membersinterests.org.uk/register/170707.zip
Resolving downloads.membersinterests.org.uk... 191.239.203.8
Connecting to downloads.membersinterests.org.uk|191.239.203.8|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 265657 (259K) [application/octet-stream]
Saving to: 'tmp/temp.zip'

tmp/temp.zip        100%[=====================>] 259.43K  --.-KB/s   in 0.1s   

2017-07-31 13:13:57 (2.43 MB/s) - 'tmp/temp.zip' saved [265657/265657]

Archive:  tmp/temp.zip
  inflating: data/170707.csv         

In [13]:
#Preview the data
!head -n  3 data/170707.csv





In [15]:
#View data in datatable
import pandas as pd
df=pd.read_csv('data/170707.csv',header=None)
df.columns=['Name','Constituency','Party','URL','Item']
df.head()


Out[15]:
Name Constituency Party URL Item
0 Diane Abbott Hackney North and Stoke Newington Labour http://www.publications.parliament.uk/pa/cm/cm... Fees received for articles written for The Gua...
1 Diane Abbott Hackney North and Stoke Newington Labour http://www.publications.parliament.uk/pa/cm/cm... 3 November 2016, received £60. Hours: 30 mins....
2 Diane Abbott Hackney North and Stoke Newington Labour http://www.publications.parliament.uk/pa/cm/cm... 10 November 2016, received £100. Hours: 1 hr. ...
3 Diane Abbott Hackney North and Stoke Newington Labour http://www.publications.parliament.uk/pa/cm/cm... 22 December 2016, received £285. Hours: 2.5 hr...
4 Diane Abbott Hackney North and Stoke Newington Labour http://www.publications.parliament.uk/pa/cm/cm... 23 February 2017, received £410. Hours: 6.5 hr...

Simple entity extraction

Quick pass at trying to extract entities locally using simple natural language extractor.

This is not necessarily that spophisticated - but it's a start...


In [16]:
#!pip3 install spacy
#!python3 -m spacy download en

from spacy.en import English
parser = English()

In [17]:
def entities(example, show=False):
    if show: print(example)
    parsedEx = parser(example)

    print("-------------- entities only ---------------")
    # if you just want the entities and nothing else, you can do access the parsed examples "ents" property like this:
    ents = list(parsedEx.ents)
    tags={}
    for entity in ents:
        #print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))
        term=' '.join(t.orth_ for t in entity)
        if ' '.join(term) not in tags:
            tags[term]=[(entity.label, entity.label_)]
        else:
            tags[term].append((entity.label, entity.label_))
    print(tags)

In [117]:
#Get a single register line item to play with
txt=df.iloc[0]['Item']
txt


Out[117]:
'Fees received for articles written for The Guardian. Address: Guardian News & Media, Kings Place, 90 York Way, London N1 9GU: '

In [25]:
entities(txt, True)


Fees received for articles written for The Guardian. Address: Guardian News & Media, Kings Place, 90 York Way, London N1 9GU: 
-------------- entities only ---------------
{'The Guardian': [(385, 'WORK_OF_ART')], 'Guardian News & Media': [(380, 'ORG')], 'Kings Place': [(381, 'GPE')], '90': [(393, 'CARDINAL')], 'York Way': [(377, 'PERSON')], 'London': [(381, 'GPE')], '9GU': [(393, 'CARDINAL')]}

We might then try to reconcile things classed as an ORG using something like OpenCorporates API.


In [116]:
import requests
ocrecURL='http://opencorporates.com/reconcile/gb'
rq=requests.get(ocrecURL,params={'query':'Guardian News & Media'})
rq.json()


Out[116]:
{'duration': 145.190966,
 'result': [{'id': '/companies/gb/00908396',
   'match': False,
   'name': 'GUARDIAN NEWS & MEDIA LIMITED',
   'score': 69.0,
   'type': [{'id': '/organization/organization', 'name': 'Organization'}],
   'uri': 'http://opencorporates.com/companies/gb/00908396'},
  {'id': '/companies/gb/03673142',
   'match': False,
   'name': 'GUARDIAN NEWS & MEDIA (HOLDINGS) LIMITED',
   'score': 58.0,
   'type': [{'id': '/organization/organization', 'name': 'Organization'}],
   'uri': 'http://opencorporates.com/companies/gb/03673142'}]}

 Third Party Taggers

Examples of using third party taggers.

Thomson Reuters OpenCalais


In [22]:
CALAIS_KEY=""

In [118]:
import requests
import json

def calais(text, calaisKey=CALAIS_KEY):
    calais_url = 'https://api.thomsonreuters.com/permid/calais'

    headers = {'X-AG-Access-Token' : calaisKey, 'Content-Type' : 'text/raw', 'outputformat' : 'application/json'}
    
    response = requests.post(calais_url, files={'file':text}, headers=headers, timeout=80)
    return response.json()

In [119]:
def cleaner(txt):
    txt=txt.replace('Address of','. Address of')
    return txt

In [120]:
oc=calais( cleaner(txt) )

In [121]:
def ocQuickView(oc):
    items={}

    for k in oc.keys():
        if '_typeGroup' in oc[k] and oc[k]['_typeGroup'] in ['entities','relations','socialTag','topics']:
            k2=oc[k]['_typeGroup']
            if k2 not in items: items[k2]=[]
            record={}

            #if '_type' in oc[k]:
            #    record['typ']=oc[k]['_type']
            if 'instances' in oc[k]:
                record['instances']=[i['exact'] for i in oc[k]['instances'] if 'exact' in i]

            for k3 in ['name','address','_type']:
                if k3 in oc[k]: record[k3] = oc[k][k3]

            items[k2].append(record)
    return items

ocQuickView(oc)


Out[121]:
{'entities': [{'_type': 'PublishedMedium',
   'instances': ['The Guardian'],
   'name': 'The Guardian'}],
 'relations': [{'_type': 'ContactDetails',
   'address': '90 York Way, London N1 9GU',
   'instances': ['The Guardian. Address: Guardian News & Media, Kings Place, 90 York Way, London N1 9GU:']}],
 'socialTag': [{'name': 'Republicanism in the United Kingdom'},
  {'name': 'The Guardian'},
  {'name': 'Computer file'},
  {'name': 'Filename'},
  {'name': 'National Guardian'},
  {'name': 'N1 road'},
  {'name': 'Journalism'},
  {'name': 'News media'},
  {'name': 'Computing'}],
 'topics': [{'name': 'Technology_Internet'}]}

In [122]:
ix=155
txt=cleaner(df.iloc[ix]['Item'])
print('{}\n---\n{}'.format(txt, ocQuickView(calais(txt))))


Name of donor: (1) Professor Magdy Ishak; (2) Egyptian Ministry of Foreign Affairs. Address of donor: (1) private; (2) Nile Corniche, Boulaq, Cairo Governate, EgyptEstimate of the probable value (or amount of any donation): (1) Flights to a value of £1,386; (2) accommodation, food and transport to a value of £596Destination of visit: EgyptDates of visit: 16-20 March 2017Purpose of visit: Conservative Middle East Council parliamentary fact finding delegation.(Registered 31 March 2017) 
---
{'socialTag': [{'name': 'Bulaq'}, {'name': 'Magdy'}, {'name': 'Cairo'}, {'name': 'Donor'}, {'name': 'Geography of Egypt'}, {'name': 'Geography of Africa'}, {'name': 'Nile'}], 'entities': [{'instances': ['Egyptian Ministry of Foreign Affairs'], 'name': 'Egyptian Ministry of Foreign Affairs', '_type': 'Organization'}, {'instances': ['Professor'], 'name': 'Professor', '_type': 'Position'}, {'instances': ['Conservative Middle East Council'], 'name': 'Conservative Middle East Council', '_type': 'Organization'}, {'instances': ['Professor Magdy Ishak'], 'name': 'Magdy Ishak', '_type': 'Person'}, {'instances': ['food'], 'name': 'food', '_type': 'IndustryTerm'}], 'relations': [{'instances': ['Professor Magdy Ishak'], '_type': 'PersonCareer'}]}

In [123]:
ix=299
txt=cleaner(df.iloc[ix]['Item'])
print('{}\n---\n{}'.format(txt, ocQuickView(calais(txt))))


Received £75 on 25 October 2016 for survey completed on 25 August 2016. Hours: 30 mins. (Registered 01 November 2016) 
---
{'topics': [{'name': 'Social Issues'}, {'name': 'Technology_Internet'}], 'socialTag': [{'name': 'Filenames'}, {'name': 'Records management'}]}

In [124]:
ix=863
txt=cleaner(df.iloc[ix]['Item'])
print('{}\n---\n{}'.format(txt, ocQuickView(calais(txt))))


Name of donor: Brian Griffiths. Address of donor: privateAmount of donation or nature and value if donation in kind: £2,000Date received: 25 May 2017Date accepted: 25 May 2017Donor status: individual(Registered 29 June 2017) 
---
{'topics': [{'name': 'Health_Medical_Pharma'}], 'socialTag': [{'name': 'Organ donation'}, {'name': 'Medicine'}, {'name': 'Donor'}, {'name': 'Donation'}, {'name': 'Fertility medicine'}, {'name': 'Health care'}, {'name': 'Transfusion medicine'}, {'name': 'Sperm donation'}], 'entities': [{'instances': ['Brian Griffiths'], 'name': 'Brian Griffiths', '_type': 'Person'}]}

Observations

The free text has items that can be parsed out - e.g. Name of donor:, Amount of donation or nature and value if donation in kind:, etc.


In [97]:
txt="Name of donor: Nael FarargyAddress of donor: privateAmount of donation or nature and value of donation in kind: £20,000 to hire a part time member of staff and meet office and staff expensesDate received: 12 April 2017Date accepted: 12 April 2017Donor status: individual(Registered 18 April 2017) "
txt


Out[97]:
'Name of donor: Nael FarargyAddress of donor: privateAmount of donation or nature and value of donation in kind: £20,000 to hire a part time member of staff and meet office and staff expensesDate received: 12 April 2017Date accepted: 12 April 2017Donor status: individual(Registered 18 April 2017) '

Define a regular expression to pull out the data in structured form if the text conforms to a conventional format.


In [105]:
extractor1='Name of donor:(?P<name>.*)Address of donor:(?P<address>.*)Amount of donation or nature and value of donation in kind:(?P<amount>.*)Date received:(?P<rxd>.*)Date accepted:(?P<accptd>.*)Donor status(?P<status>.*)'

In [106]:
import re
r=re.compile(extractor1)

In [113]:
r.match(txt).groupdict()
#Looking at the response values, we could catch for whitespace in the regex or do a cleaning pass to strip whitespace


Out[113]:
{'accptd': ' 12 April 2017',
 'address': ' private',
 'amount': ' £20,000 to hire a part time member of staff and meet office and staff expenses',
 'name': ' Nael Farargy',
 'rxd': ' 12 April 2017',
 'status': ': individual(Registered 18 April 2017) '}

We could also add in further parsing to try to identify the actual amount and the rationale for amount items, as well as further structuring the status field. Casting dates to datetimes would also make sense.

There may be other conventional forms in register entries, for which alternative regular expressions ould be defined.

Having got structured data out, we could start to put it into a database and then make queries over it.


In [ ]: