Future of the Report - Sketches 1

This notebook contains notes and sketches created whilst exploring a particular committee report, the Women and Equalities Committee Gender pay gap inquiry report.

(From a cursory inspection of several other HTML published reports, there appears to be a significant amount of inconsistency in the way reports from different committees are presented online. A closer look at other reports, and the major differences that appear to arise across them, will be considered at a later date.)

Scraping the Report Home Page



In [2]:

    
url='https://publications.parliament.uk/pa/cm201516/cmselect/cmwomeq/584/58402.htm'

Observation - from the report contents page, I can navigate via the Back button to https://publications.parliament.uk/pa/cm201516/cmselect/cmwomeq/584/58401.htm but then it's not clear where I am at all?

It would probably make sense to be able to get back to the inquiry page for the inquiry that resulted in the report.



In [149]:

    
import pandas as pd

Report Contents Page Link Scraper



In [ ]:

    
import requests
import requests_cache
requests_cache.install_cache('parli_comm_cache')

from bs4 import BeautifulSoup

#https://www.dataquest.io/blog/web-scraping-tutorial-python/
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')



In [23]:

    
#What does a ToC item look like?
soup.select('p[class*="ToC"]')[5].find('a')









    Out[23]:





<a href="58404.htm#_idTextAnchor008">Is the gender pay gap disappearing?</a>



In [117]:

    
url_written=None
url_witnesses=None

for p in soup.select('p[class*="ToC"]'):
    #witnesses
    if 'Witnesses' in p.find('a'):
        url_witnesses=p.find('a')['href']
    #written evidence
    if 'Published written evidence' in p.find('a'):
        url_written=p.find('a')['href']
        
url_written, url_witnesses









    Out[117]:





('58415.htm#_idTextAnchor145', '58414.htm#_idTextAnchor144')



In [24]:

    
#https://stackoverflow.com/a/34661518/454773
pages=[]
for EachPart in soup.select('p[class*="ToC"]'):
    href=EachPart.find('a')['href']
    #Fudge to collect URLs of pages asssociated with report content
    if '#_' in href:
        pages.append(EachPart.find('a')['href'].split('#')[0])
pages=list(set(pages))
pages









    Out[24]:





['58414.htm',
 '58416.htm',
 '58412.htm',
 '58409.htm',
 '58415.htm',
 '58413.htm',
 '58411.htm',
 '58407.htm',
 '58405.htm',
 '58410.htm',
 '58406.htm',
 '58408.htm',
 '58404.htm']



In [7]:

    
#We need to get the relative path for the page...
import os.path

stub=os.path.split(url)
stub









    Out[7]:





('https://publications.parliament.uk/pa/cm201516/cmselect/cmwomeq/584',
 '58402.htm')



In [25]:

    
#Grab all the pages in the report
for p in pages:
    r=requests.get('{}/{}'.format(stub[0],p))

Report - Page Scraper

For each HTML Page in the report, extract references to oral evidence session questions and written evidence.



In [315]:

    
pagesoup=BeautifulSoup(r.content, 'html.parser')
print(str(pagesoup.select('div[id="shellcontent"]')[0])[:2000])









    



<div id="shellcontent"><strong>Gender Pay Gap <a href="58402.htm">Contents</a></strong>
<hr>
<!-- PASTE MAIN CONTENT AFTER THIS LINE -->
<h1 class="Heading1"><a id="_idTextAnchor145"></a>Published written evidence</h1>
<p class="EvidencePara">The following written evidence was received and can be viewed on the Committee’s <a href="http://www.parliament.uk/business/committees/committees-a-z/commons-select/women-and-equalities-committee/inquiries/parliament-2015/gender-pay-gap-15-16/publications/"><span class="Hyperlink">inquiry web page</span></a>. GPG numbers are generated by the evidence processing system and so may not be complete.</p>
<p class="EvidenceList1"><span class="EvidenceList1Span">1</span>Age UK (<a href="http://data.parliament.uk/WrittenEvidence/CommitteeEvidence.svc/EvidenceDocument/Women%20and%20Equalities/Gender%20Pay%20Gap/written/25793.html"><span class="Hyperlink">GPG0054</span></a>)</p>
<p class="EvidenceList1"><span class="EvidenceList1Span">2</span>Alison Parken (<a href="http://data.parliament.uk/WrittenEvidence/CommitteeEvidence.svc/EvidenceDocument/Women%20and%20Equalities/Gender%20Pay%20Gap/written/25645.html"><span class="Hyperlink">GPG0049</span></a>)</p>
<p class="EvidenceList1"><span class="EvidenceList1Span">3</span>ARC Trade Union (<a href="http://data.parliament.uk/WrittenEvidence/CommitteeEvidence.svc/EvidenceDocument/Women%20and%20Equalities/Gender%20Pay%20Gap/written/25993.html"><span class="Hyperlink">GPG0056</span></a>)</p>
<p class="EvidenceList1"><span class="EvidenceList1Span">4</span>Barclays (<a href="http://data.parliament.uk/WrittenEvidence/CommitteeEvidence.svc/EvidenceDocument/Women%20and%20Equalities/Gender%20Pay%20Gap/written/25554.html"><span class="Hyperlink">GPG0026</span></a>)</p>
<p class="EvidenceList1"><span class="EvidenceList1Span">5</span>Behavioural Insights (<a href="http://data.parliament.uk/WrittenEvidence/CommitteeEvidence.svc/EvidenceDocument/Women%20and%20Equalities/Gender%20Pay%20Gap/written/28937.h



In [102]:

    
import re

def evidenceRef(pagesoup):
    qs=[]
    ws=[]
    #Grab list of questions
    for p in pagesoup.select('div[class="_idFootnote"]'):
        #Find oral question numbers
        q=re.search(r'^.*\s+(Q[0-9]*)\s*$', p.find('p').text)
        if q:
            qs.append(q.group(1))

        #Find links to written evidence
        links=p.find('p').findAll('a')
        if len(links)>1:
            if links[1]['href'].startswith('http://data.parliament.uk/WrittenEvidence/CommitteeEvidence.svc/EvidenceDocument/'):
                ws.append(links[1].text.strip('()'))
    return qs, ws



In [103]:

    
evidenceRef(pagesoup)









    Out[103]:





(['Q2', 'Q8', 'Q25'], ['GPG0037', 'GPG0051'])



In [104]:

    
qs=[]
ws=[]
for p in pages:
    r=requests.get('{}/{}'.format(stub[0],p))
    pagesoup=BeautifulSoup(r.content, 'html.parser')
    pagesoup.select('div[id="shellcontent"]')[0]
    qstmp,wstmp= evidenceRef(pagesoup)
    qs += qstmp
    ws +=wstmp



In [310]:

    
pd.DataFrame(qs)[0].value_counts().head()









    Out[310]:





Q205    4
Q39     3
Q41     2
Q244    2
Q132    2
Name: 0, dtype: int64



In [309]:

    
pd.DataFrame(ws)[0].value_counts().head()









    Out[309]:





GPG0037    4
GPG0031    4
GPG0030    3
GPG0041    3
GPG0053    3
Name: 0, dtype: int64

Report - Oral Session Page Scraper

Is this reliably cribbed by link text Witnesses?



In [206]:

    
#url='https://publications.parliament.uk/pa/cm201516/cmselect/cmwomeq/584/58414.htm'

if url_witnesses is not None:
    r=requests.get('{}/{}'.format(stub[0],url_witnesses))
    pagesoup=BeautifulSoup(r.content, 'html.parser')
    
    l1=[t.text.split('\t')[0] for t in pagesoup.select('h2[class="WitnessHeading"]')]
    l2=pagesoup.select('table')
        
pd.DataFrame({'a':l1,'b':l2})









    Out[206]:







  
    
      
      a
      b
    
  
  
    
      0
      Tuesday 15 December 2015
      <table class="No-Table-Style" id="table007">
<...
    
    
      1
      Tuesday 12 January 2016
      <table class="No-Table-Style" id="table008">
<...
    
    
      2
      Tuesday 19 January 2016
      <table class="No-Table-Style" id="table009">
<...
    
    
      3
      Tuesday 26 January 2016
      <table class="No-Table-Style" id="table010">
<...
    
    
      4
      Wednesday 10 February 2016
      <table class="No-Table-Style" id="table011">
<...



In [308]:

    
#Just as easy to do this by hand

items=[]

items.append(['Tuesday 15 December 2015','Chris Giles', 'Economics Editor', 'The Financial Times','Q1', 'Q35'])
items.append(['Tuesday 15 December 2015','Dr Alison Parken', 'Women Adding Value to the Economy (WAVE)', 'Cardiff University','Q1', 'Q35'])
items.append(['Tuesday 15 December 2015','Professor Jill Rubery','', 'Manchester University','Q1', 'Q35'])
items.append(['Tuesday 15 December 2015','Sheila Wild', 'Founder', 'Equal Pay Portal','Q1', 'Q35'])
items.append(['Tuesday 15 December 2015','Professor the Baroness Wolf of Dulwich', "King's College", 'London','Q1', 'Q35'])

items.append(['Tuesday 15 December 2015','Neil Carberry', 'Director for Employment and Skills', 'CBI','Q36','Q58'])
items.append(['Tuesday 15 December 2015','Ann Francke', 'Chief Executive', 'Chartered Management Institute','Q36','Q58'])
items.append(['Tuesday 15 December 2015','Monika Queisser',' Senior Counsellor and Head of Social Policy', 'Organisation for Economic Cooperation and Development','Q36','Q58'])

items.append(['Tuesday 12 January 2016','Amanda Brown', 'Assistant General Secretary', 'NUT','Q59','Q99'])
items.append(['Tuesday 12 January 2016','Dr Sally Davies', 'President', "Medical Women's Federation",'Q59','Q99'])
items.append(['Tuesday 12 January 2016','Amanda Fone','Chief Executive Officer', 'F1 Recruitment and Search','Q59','Q99'])
items.append(['Tuesday 12 January 2016','Audrey Williams', 'Employment Lawyer and Partner',' Fox Williams','Q59','Q99'])

items.append(['Tuesday 12 January 2016','Anna Ritchie Allan', 'Project Manager', 'Close the Gap','Q100','Q136'])
items.append(['Tuesday 12 January 2016','Christopher Brooks', 'Policy Adviser', 'Age UK','Q100','Q136'])
items.append(['Tuesday 12 January 2016','Scarlet Harris', 'Head of Gender Equality', 'TUC','Q100','Q136'])
items.append(['Tuesday 12 January 2016','Mr Robert Stephenson-Padron', 'Managing Director', 'Penrose Care','Q100','Q136'])

items.append(['Tuesday 19 January 2016','Sarah Jackson', 'Chief Executive', 'Working Families','Q137','Q164'])
items.append(['Tuesday 19 January 2016','Adrienne Burgess', 'Joint Chief Executive and Head of Research', 'Fatherhood Institute','Q137','Q164'])
items.append(['Tuesday 19 January 2016','Maggie Stilwell', 'Partner', 'Ernst & Young LLP','Q137','Q164'])

items.append(['Tuesday 26 January 2016','Michael Newman', 'Vice-Chair', 'Discrimination Law Association','Q165','Q191'])
items.append(['Tuesday 26 January 2016','Duncan Brown', '','Institute for Employment Studies','Q165','Q191'])
items.append(['Tuesday 26 January 2016','Tim Thomas', 'Head of Employment and Skills', "EEF, the manufacturers' association",'Q165','Q191'])

items.append(['Tuesday 26 January 2016','Helen Fairfoul', 'Chief Executive', 'Universities and Colleges Employers Association','Q192','Q223'])
items.append(['Tuesday 26 January 2016','Emma Stewart', 'Joint Chief Executive Officer', 'Timewise Foundation','Q192','Q223'])
items.append(['Tuesday 26 January 2016','Claire Turner','', 'Joseph Rowntree Foundation','Q192','Q223'])

items.append(['Wednesday 10 February 2016','Rt Hon Nicky Morgan MP', 'Secretary of State for Education and Minister for Women and Equalities','Department for Education','Q224','Q296'])
items.append(['Wednesday 10 February 2016','Nick Boles MP', 'Minister for Skills', 'Department for Business, Innovation and Skills','Q224','Q296'])


df=pd.DataFrame(items,columns=['Date','Name','Role','Org','Qmin','Qmax'])
#Cleaning check
df['Org']=df['Org'].str.strip()
df['n_qmin']=df['Qmin'].str.strip('Q').astype(int)
df['n_qmax']=df['Qmax'].str.strip('Q').astype(int)
df['session']=df['Qmin']+'-'+df['n_qmax'].astype(str)
df.head()









    Out[308]:







  
    
      
      Date
      Name
      Role
      Org
      Qmin
      Qmax
      n_qmin
      n_qmax
      session
    
  
  
    
      0
      Tuesday 15 December 2015
      Chris Giles
      Economics Editor
      The Financial Times
      Q1
      Q35
      1
      35
      Q1-35
    
    
      1
      Tuesday 15 December 2015
      Dr Alison Parken
      Women Adding Value to the Economy (WAVE)
      Cardiff University
      Q1
      Q35
      1
      35
      Q1-35
    
    
      2
      Tuesday 15 December 2015
      Professor Jill Rubery
      
      Manchester University
      Q1
      Q35
      1
      35
      Q1-35
    
    
      3
      Tuesday 15 December 2015
      Sheila Wild
      Founder
      Equal Pay Portal
      Q1
      Q35
      1
      35
      Q1-35
    
    
      4
      Tuesday 15 December 2015
      Professor the Baroness Wolf of Dulwich
      King's College
      London
      Q1
      Q35
      1
      35
      Q1-35

Report - Written Evidence Scraper

Is this reliably cribbed by link text Published written evidence?



In [307]:

    
#url='https://publications.parliament.uk/pa/cm201516/cmselect/cmwomeq/584/58415.htm'

all_written=[]

if url_written is not None:
    r=requests.get('{}/{}'.format(stub[0],url_written))
    pagesoup=BeautifulSoup(r.content, 'html.parser')
    for p in pagesoup.select('p[class="EvidenceList1"]'):
        #print(p)
        #Get rid of span tags
        for match in p.findAll('span[class="EvidenceList1Span"]'):
            match.extract()
        all_written.append((p.contents[1].strip('()').strip(), p.find('a')['href'],p.find('a').text))

written_df=pd.DataFrame(all_written)
written_df.columns=['Org','URL','RefNumber']
written_df.head()









    Out[307]:







  
    
      
      Org
      URL
      RefNumber
    
  
  
    
      0
      Age UK
      http://data.parliament.uk/WrittenEvidence/Comm...
      GPG0054
    
    
      1
      Alison Parken
      http://data.parliament.uk/WrittenEvidence/Comm...
      GPG0049
    
    
      2
      ARC Trade Union
      http://data.parliament.uk/WrittenEvidence/Comm...
      GPG0056
    
    
      3
      Barclays
      http://data.parliament.uk/WrittenEvidence/Comm...
      GPG0026
    
    
      4
      Behavioural Insights
      http://data.parliament.uk/WrittenEvidence/Comm...
      GPG0064



In [266]:

    
def getSession(q):
    return df[(df['n_qmin']<=q) & (df['n_qmax']>=q)].iloc[0]['session']

getSession(33)









    Out[266]:





'Q1-35'



In [282]:

    
#Report on sessions that included a question by count

df_qs=pd.DataFrame(qs, columns=['qn'])
df_qs['session']=df_qs['qn'].apply(lambda x: getSession(int(x.strip('Q'))) )
s_qs_cnt=df_qs['session'].value_counts()
s_qs_cnt









    Out[282]:





Q224-296    19
Q100-136    12
Q192-223    11
Q36-58      10
Q1-35       10
Q165-191     9
Q137-164     8
Q59-99       8
Name: session, dtype: int64



In [289]:

    
pd.concat([s_qs_cnt,df.groupby('session')['Org'].apply(lambda x: '; '.join(list(x)))],
          axis=1).sort_values('session',ascending=False)









    Out[289]:







  
    
      
      session
      Org
    
  
  
    
      Q224-296
      19
      Department for Education; Department for Busi...
    
    
      Q100-136
      12
      Close the Gap; Age UK; TUC; Penrose Care
    
    
      Q192-223
      11
      Universities and Colleges Employers Associatio...
    
    
      Q1-35
      10
      The Financial Times; Cardiff University; Manch...
    
    
      Q36-58
      10
      CBI; Chartered Management Institute; Organisat...
    
    
      Q165-191
      9
      Discrimination Law Association; Institute for ...
    
    
      Q137-164
      8
      Working Families; Fatherhood Institute; Ernst ...
    
    
      Q59-99
      8
      NUT; Medical Women's Federation; F1 Recruitmen...



In [306]:

    
#Written evidence
df_ws=pd.DataFrame(ws,columns=['RefNumber'])
df_ws=df_ws.merge(written_df, on='RefNumber')
df_ws['Org'].value_counts().head()









    Out[306]:





Fawcett Society                                        4
The UK Commission for Employment and Skills (UKCES)    4
Science Council                                        3
Family and Childcare Trust                             3
Timewise                                               3
Name: Org, dtype: int64



In [305]:

    
#Organisations that gave written and witness evidence
set(df_ws['Org']).intersection(set(df['Org']))

#Note there are more matches that are hidden by dirty data
#- e.g. NUT and National Union of Teachers are presumably the same
#- e.g. F1 Recruitment and Search and F1 Recruitment Ltd are presumably the same









    Out[305]:





{'Age UK',
 'CBI',
 'Chartered Management Institute',
 'Close the Gap',
 'Department for Education',
 'Discrimination Law Association',
 'Penrose Care',
 'TUC',
 'Working Families'}

Scraping the Government Response



In [ ]:

    
url='https://publications.parliament.uk/pa/cm201617/cmselect/cmwomeq/963/96302.htm'



In [ ]:

    
#Inconsistency across different reports in terms of presentation, linking to evidence

	a	b
0	Tuesday 15 December 2015	<table class="No-Table-Style" id="table007"> <...
1	Tuesday 12 January 2016	<table class="No-Table-Style" id="table008"> <...
2	Tuesday 19 January 2016	<table class="No-Table-Style" id="table009"> <...
3	Tuesday 26 January 2016	<table class="No-Table-Style" id="table010"> <...
4	Wednesday 10 February 2016	<table class="No-Table-Style" id="table011"> <...

	Date	Name	Role	Org	Qmin	Qmax	n_qmin	n_qmax	session
0	Tuesday 15 December 2015	Chris Giles	Economics Editor	The Financial Times	Q1	Q35	1	35	Q1-35
1	Tuesday 15 December 2015	Dr Alison Parken	Women Adding Value to the Economy (WAVE)	Cardiff University	Q1	Q35	1	35	Q1-35
2	Tuesday 15 December 2015	Professor Jill Rubery		Manchester University	Q1	Q35	1	35	Q1-35
3	Tuesday 15 December 2015	Sheila Wild	Founder	Equal Pay Portal	Q1	Q35	1	35	Q1-35
4	Tuesday 15 December 2015	Professor the Baroness Wolf of Dulwich	King's College	London	Q1	Q35	1	35	Q1-35

	Org	URL	RefNumber
0	Age UK	http://data.parliament.uk/WrittenEvidence/Comm...	GPG0054
1	Alison Parken	http://data.parliament.uk/WrittenEvidence/Comm...	GPG0049
2	ARC Trade Union	http://data.parliament.uk/WrittenEvidence/Comm...	GPG0056
3	Barclays	http://data.parliament.uk/WrittenEvidence/Comm...	GPG0026
4	Behavioural Insights	http://data.parliament.uk/WrittenEvidence/Comm...	GPG0064

	session	Org
Q224-296	19	Department for Education; Department for Busi...
Q100-136	12	Close the Gap; Age UK; TUC; Penrose Care
Q192-223	11	Universities and Colleges Employers Associatio...
Q1-35	10	The Financial Times; Cardiff University; Manch...
Q36-58	10	CBI; Chartered Management Institute; Organisat...
Q165-191	9	Discrimination Law Association; Institute for ...
Q137-164	8	Working Families; Fatherhood Institute; Ernst ...
Q59-99	8	NUT; Medical Women's Federation; F1 Recruitmen...