Future of the Report - Sketches 1

This notebook contains notes and sketches created whilst exploring a particular committee report, the Women and Equalities Committee Gender pay gap inquiry report.

(From a cursory inspection of several other HTML published reports, there appears to be a significant amount of inconsistency in the way reports from different committees are presented online. A closer look at other reports, and the major differences that appear to arise across them, will be considered at a later date.)

Scraping the Report Home Page


In [2]:
url='https://publications.parliament.uk/pa/cm201516/cmselect/cmwomeq/584/58402.htm'

Observation - from the report contents page, I can navigate via the Back button to https://publications.parliament.uk/pa/cm201516/cmselect/cmwomeq/584/58401.htm but then it's not clear where I am at all?

It would probably make sense to be able to get back to the inquiry page for the inquiry that resulted in the report.


In [149]:
import pandas as pd

In [ ]:
import requests
import requests_cache
requests_cache.install_cache('parli_comm_cache')

from bs4 import BeautifulSoup

#https://www.dataquest.io/blog/web-scraping-tutorial-python/
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

In [23]:
#What does a ToC item look like?
soup.select('p[class*="ToC"]')[5].find('a')


Out[23]:
<a href="58404.htm#_idTextAnchor008">Is the gender pay gap disappearing?</a>

In [117]:
url_written=None
url_witnesses=None

for p in soup.select('p[class*="ToC"]'):
    #witnesses
    if 'Witnesses' in p.find('a'):
        url_witnesses=p.find('a')['href']
    #written evidence
    if 'Published written evidence' in p.find('a'):
        url_written=p.find('a')['href']
        
url_written, url_witnesses


Out[117]:
('58415.htm#_idTextAnchor145', '58414.htm#_idTextAnchor144')

In [24]:
#https://stackoverflow.com/a/34661518/454773
pages=[]
for EachPart in soup.select('p[class*="ToC"]'):
    href=EachPart.find('a')['href']
    #Fudge to collect URLs of pages asssociated with report content
    if '#_' in href:
        pages.append(EachPart.find('a')['href'].split('#')[0])
pages=list(set(pages))
pages


Out[24]:
['58414.htm',
 '58416.htm',
 '58412.htm',
 '58409.htm',
 '58415.htm',
 '58413.htm',
 '58411.htm',
 '58407.htm',
 '58405.htm',
 '58410.htm',
 '58406.htm',
 '58408.htm',
 '58404.htm']

In [7]:
#We need to get the relative path for the page...
import os.path

stub=os.path.split(url)
stub


Out[7]:
('https://publications.parliament.uk/pa/cm201516/cmselect/cmwomeq/584',
 '58402.htm')

In [25]:
#Grab all the pages in the report
for p in pages:
    r=requests.get('{}/{}'.format(stub[0],p))

Report - Page Scraper

For each HTML Page in the report, extract references to oral evidence session questions and written evidence.


In [315]:
pagesoup=BeautifulSoup(r.content, 'html.parser')
print(str(pagesoup.select('div[id="shellcontent"]')[0])[:2000])


<div id="shellcontent"><strong>Gender Pay Gap <a href="58402.htm">Contents</a></strong>
<hr>
<!-- PASTE MAIN CONTENT AFTER THIS LINE -->
<h1 class="Heading1"><a id="_idTextAnchor145"></a>Published written evidence</h1>
<p class="EvidencePara">The following written evidence was received and can be viewed on the Committee’s <a href="http://www.parliament.uk/business/committees/committees-a-z/commons-select/women-and-equalities-committee/inquiries/parliament-2015/gender-pay-gap-15-16/publications/"><span class="Hyperlink">inquiry web page</span></a>. GPG numbers are generated by the evidence processing system and so may not be complete.</p>
<p class="EvidenceList1"><span class="EvidenceList1Span">1</span>Age UK (<a href="http://data.parliament.uk/WrittenEvidence/CommitteeEvidence.svc/EvidenceDocument/Women%20and%20Equalities/Gender%20Pay%20Gap/written/25793.html"><span class="Hyperlink">GPG0054</span></a>)</p>
<p class="EvidenceList1"><span class="EvidenceList1Span">2</span>Alison Parken (<a href="http://data.parliament.uk/WrittenEvidence/CommitteeEvidence.svc/EvidenceDocument/Women%20and%20Equalities/Gender%20Pay%20Gap/written/25645.html"><span class="Hyperlink">GPG0049</span></a>)</p>
<p class="EvidenceList1"><span class="EvidenceList1Span">3</span>ARC Trade Union (<a href="http://data.parliament.uk/WrittenEvidence/CommitteeEvidence.svc/EvidenceDocument/Women%20and%20Equalities/Gender%20Pay%20Gap/written/25993.html"><span class="Hyperlink">GPG0056</span></a>)</p>
<p class="EvidenceList1"><span class="EvidenceList1Span">4</span>Barclays (<a href="http://data.parliament.uk/WrittenEvidence/CommitteeEvidence.svc/EvidenceDocument/Women%20and%20Equalities/Gender%20Pay%20Gap/written/25554.html"><span class="Hyperlink">GPG0026</span></a>)</p>
<p class="EvidenceList1"><span class="EvidenceList1Span">5</span>Behavioural Insights (<a href="http://data.parliament.uk/WrittenEvidence/CommitteeEvidence.svc/EvidenceDocument/Women%20and%20Equalities/Gender%20Pay%20Gap/written/28937.h

In [102]:
import re

def evidenceRef(pagesoup):
    qs=[]
    ws=[]
    #Grab list of questions
    for p in pagesoup.select('div[class="_idFootnote"]'):
        #Find oral question numbers
        q=re.search(r'^.*\s+(Q[0-9]*)\s*$', p.find('p').text)
        if q:
            qs.append(q.group(1))

        #Find links to written evidence
        links=p.find('p').findAll('a')
        if len(links)>1:
            if links[1]['href'].startswith('http://data.parliament.uk/WrittenEvidence/CommitteeEvidence.svc/EvidenceDocument/'):
                ws.append(links[1].text.strip('()'))
    return qs, ws

In [103]:
evidenceRef(pagesoup)


Out[103]:
(['Q2', 'Q8', 'Q25'], ['GPG0037', 'GPG0051'])

In [104]:
qs=[]
ws=[]
for p in pages:
    r=requests.get('{}/{}'.format(stub[0],p))
    pagesoup=BeautifulSoup(r.content, 'html.parser')
    pagesoup.select('div[id="shellcontent"]')[0]
    qstmp,wstmp= evidenceRef(pagesoup)
    qs += qstmp
    ws +=wstmp

In [310]:
pd.DataFrame(qs)[0].value_counts().head()


Out[310]:
Q205    4
Q39     3
Q41     2
Q244    2
Q132    2
Name: 0, dtype: int64

In [309]:
pd.DataFrame(ws)[0].value_counts().head()


Out[309]:
GPG0037    4
GPG0031    4
GPG0030    3
GPG0041    3
GPG0053    3
Name: 0, dtype: int64

Report - Oral Session Page Scraper

Is this reliably cribbed by link text Witnesses?


In [206]:
#url='https://publications.parliament.uk/pa/cm201516/cmselect/cmwomeq/584/58414.htm'

if url_witnesses is not None:
    r=requests.get('{}/{}'.format(stub[0],url_witnesses))
    pagesoup=BeautifulSoup(r.content, 'html.parser')
    
    l1=[t.text.split('\t')[0] for t in pagesoup.select('h2[class="WitnessHeading"]')]
    l2=pagesoup.select('table')
        
pd.DataFrame({'a':l1,'b':l2})


Out[206]:
a b
0 Tuesday 15 December 2015 <table class="No-Table-Style" id="table007"> <...
1 Tuesday 12 January 2016 <table class="No-Table-Style" id="table008"> <...
2 Tuesday 19 January 2016 <table class="No-Table-Style" id="table009"> <...
3 Tuesday 26 January 2016 <table class="No-Table-Style" id="table010"> <...
4 Wednesday 10 February 2016 <table class="No-Table-Style" id="table011"> <...

In [308]:
#Just as easy to do this by hand

items=[]

items.append(['Tuesday 15 December 2015','Chris Giles', 'Economics Editor', 'The Financial Times','Q1', 'Q35'])
items.append(['Tuesday 15 December 2015','Dr Alison Parken', 'Women Adding Value to the Economy (WAVE)', 'Cardiff University','Q1', 'Q35'])
items.append(['Tuesday 15 December 2015','Professor Jill Rubery','', 'Manchester University','Q1', 'Q35'])
items.append(['Tuesday 15 December 2015','Sheila Wild', 'Founder', 'Equal Pay Portal','Q1', 'Q35'])
items.append(['Tuesday 15 December 2015','Professor the Baroness Wolf of Dulwich', "King's College", 'London','Q1', 'Q35'])

items.append(['Tuesday 15 December 2015','Neil Carberry', 'Director for Employment and Skills', 'CBI','Q36','Q58'])
items.append(['Tuesday 15 December 2015','Ann Francke', 'Chief Executive', 'Chartered Management Institute','Q36','Q58'])
items.append(['Tuesday 15 December 2015','Monika Queisser',' Senior Counsellor and Head of Social Policy', 'Organisation for Economic Cooperation and Development','Q36','Q58'])

items.append(['Tuesday 12 January 2016','Amanda Brown', 'Assistant General Secretary', 'NUT','Q59','Q99'])
items.append(['Tuesday 12 January 2016','Dr Sally Davies', 'President', "Medical Women's Federation",'Q59','Q99'])
items.append(['Tuesday 12 January 2016','Amanda Fone','Chief Executive Officer', 'F1 Recruitment and Search','Q59','Q99'])
items.append(['Tuesday 12 January 2016','Audrey Williams', 'Employment Lawyer and Partner',' Fox Williams','Q59','Q99'])

items.append(['Tuesday 12 January 2016','Anna Ritchie Allan', 'Project Manager', 'Close the Gap','Q100','Q136'])
items.append(['Tuesday 12 January 2016','Christopher Brooks', 'Policy Adviser', 'Age UK','Q100','Q136'])
items.append(['Tuesday 12 January 2016','Scarlet Harris', 'Head of Gender Equality', 'TUC','Q100','Q136'])
items.append(['Tuesday 12 January 2016','Mr Robert Stephenson-Padron', 'Managing Director', 'Penrose Care','Q100','Q136'])

items.append(['Tuesday 19 January 2016','Sarah Jackson', 'Chief Executive', 'Working Families','Q137','Q164'])
items.append(['Tuesday 19 January 2016','Adrienne Burgess', 'Joint Chief Executive and Head of Research', 'Fatherhood Institute','Q137','Q164'])
items.append(['Tuesday 19 January 2016','Maggie Stilwell', 'Partner', 'Ernst & Young LLP','Q137','Q164'])

items.append(['Tuesday 26 January 2016','Michael Newman', 'Vice-Chair', 'Discrimination Law Association','Q165','Q191'])
items.append(['Tuesday 26 January 2016','Duncan Brown', '','Institute for Employment Studies','Q165','Q191'])
items.append(['Tuesday 26 January 2016','Tim Thomas', 'Head of Employment and Skills', "EEF, the manufacturers' association",'Q165','Q191'])

items.append(['Tuesday 26 January 2016','Helen Fairfoul', 'Chief Executive', 'Universities and Colleges Employers Association','Q192','Q223'])
items.append(['Tuesday 26 January 2016','Emma Stewart', 'Joint Chief Executive Officer', 'Timewise Foundation','Q192','Q223'])
items.append(['Tuesday 26 January 2016','Claire Turner','', 'Joseph Rowntree Foundation','Q192','Q223'])

items.append(['Wednesday 10 February 2016','Rt Hon Nicky Morgan MP', 'Secretary of State for Education and Minister for Women and Equalities','Department for Education','Q224','Q296'])
items.append(['Wednesday 10 February 2016','Nick Boles MP', 'Minister for Skills', 'Department for Business, Innovation and Skills','Q224','Q296'])


df=pd.DataFrame(items,columns=['Date','Name','Role','Org','Qmin','Qmax'])
#Cleaning check
df['Org']=df['Org'].str.strip()
df['n_qmin']=df['Qmin'].str.strip('Q').astype(int)
df['n_qmax']=df['Qmax'].str.strip('Q').astype(int)
df['session']=df['Qmin']+'-'+df['n_qmax'].astype(str)
df.head()


Out[308]:
Date Name Role Org Qmin Qmax n_qmin n_qmax session
0 Tuesday 15 December 2015 Chris Giles Economics Editor The Financial Times Q1 Q35 1 35 Q1-35
1 Tuesday 15 December 2015 Dr Alison Parken Women Adding Value to the Economy (WAVE) Cardiff University Q1 Q35 1 35 Q1-35
2 Tuesday 15 December 2015 Professor Jill Rubery Manchester University Q1 Q35 1 35 Q1-35
3 Tuesday 15 December 2015 Sheila Wild Founder Equal Pay Portal Q1 Q35 1 35 Q1-35
4 Tuesday 15 December 2015 Professor the Baroness Wolf of Dulwich King's College London Q1 Q35 1 35 Q1-35

Report - Written Evidence Scraper

Is this reliably cribbed by link text Published written evidence?


In [307]:
#url='https://publications.parliament.uk/pa/cm201516/cmselect/cmwomeq/584/58415.htm'

all_written=[]

if url_written is not None:
    r=requests.get('{}/{}'.format(stub[0],url_written))
    pagesoup=BeautifulSoup(r.content, 'html.parser')
    for p in pagesoup.select('p[class="EvidenceList1"]'):
        #print(p)
        #Get rid of span tags
        for match in p.findAll('span[class="EvidenceList1Span"]'):
            match.extract()
        all_written.append((p.contents[1].strip('()').strip(), p.find('a')['href'],p.find('a').text))

written_df=pd.DataFrame(all_written)
written_df.columns=['Org','URL','RefNumber']
written_df.head()


Out[307]:
Org URL RefNumber
0 Age UK http://data.parliament.uk/WrittenEvidence/Comm... GPG0054
1 Alison Parken http://data.parliament.uk/WrittenEvidence/Comm... GPG0049
2 ARC Trade Union http://data.parliament.uk/WrittenEvidence/Comm... GPG0056
3 Barclays http://data.parliament.uk/WrittenEvidence/Comm... GPG0026
4 Behavioural Insights http://data.parliament.uk/WrittenEvidence/Comm... GPG0064

In [266]:
def getSession(q):
    return df[(df['n_qmin']<=q) & (df['n_qmax']>=q)].iloc[0]['session']

getSession(33)


Out[266]:
'Q1-35'

In [282]:
#Report on sessions that included a question by count

df_qs=pd.DataFrame(qs, columns=['qn'])
df_qs['session']=df_qs['qn'].apply(lambda x: getSession(int(x.strip('Q'))) )
s_qs_cnt=df_qs['session'].value_counts()
s_qs_cnt


Out[282]:
Q224-296    19
Q100-136    12
Q192-223    11
Q36-58      10
Q1-35       10
Q165-191     9
Q137-164     8
Q59-99       8
Name: session, dtype: int64

In [289]:
pd.concat([s_qs_cnt,df.groupby('session')['Org'].apply(lambda x: '; '.join(list(x)))],
          axis=1).sort_values('session',ascending=False)


Out[289]:
session Org
Q224-296 19 Department for Education; Department for Busi...
Q100-136 12 Close the Gap; Age UK; TUC; Penrose Care
Q192-223 11 Universities and Colleges Employers Associatio...
Q1-35 10 The Financial Times; Cardiff University; Manch...
Q36-58 10 CBI; Chartered Management Institute; Organisat...
Q165-191 9 Discrimination Law Association; Institute for ...
Q137-164 8 Working Families; Fatherhood Institute; Ernst ...
Q59-99 8 NUT; Medical Women's Federation; F1 Recruitmen...

In [306]:
#Written evidence
df_ws=pd.DataFrame(ws,columns=['RefNumber'])
df_ws=df_ws.merge(written_df, on='RefNumber')
df_ws['Org'].value_counts().head()


Out[306]:
Fawcett Society                                        4
The UK Commission for Employment and Skills (UKCES)    4
Science Council                                        3
Family and Childcare Trust                             3
Timewise                                               3
Name: Org, dtype: int64

In [305]:
#Organisations that gave written and witness evidence
set(df_ws['Org']).intersection(set(df['Org']))

#Note there are more matches that are hidden by dirty data
#- e.g. NUT and National Union of Teachers are presumably the same
#- e.g. F1 Recruitment and Search and F1 Recruitment Ltd are presumably the same


Out[305]:
{'Age UK',
 'CBI',
 'Chartered Management Institute',
 'Close the Gap',
 'Department for Education',
 'Discrimination Law Association',
 'Penrose Care',
 'TUC',
 'Working Families'}

Scraping the Government Response


In [ ]:
url='https://publications.parliament.uk/pa/cm201617/cmselect/cmwomeq/963/96302.htm'

In [ ]:
#Inconsistency across different reports in terms of presentation, linking to evidence