01 - Raw Scraping

What we do here is the scraping of the Swiss Parliament website from the hierarchy described in the metadata we were given. We were able to create a typical URL from which we start our queries for the different fields, as we will describe below. This URL is saved in the base_url.txt and has the form

https://ws.parlament.ch/odata.svc/[]?$filter=()%20gt%20{0}L%20and%20()%20lt%20{1}L%20and%20Language%20eq%20%27FR%27

with a lot of missing fields that will be completed for a specific query further down.

0. Usual Imports

A lot of libraries are required to successfully scrap the data and put them to csv files. To install

pip install xmljson

In [1]:
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
import html5lib 
from lxml import *
import numpy as np
import xmljson 
from xmljson import badgerfish as bf
from json import *
import xml.etree.ElementTree as ET
from io import StringIO
import webbrowser
import requests
import os as os

1. Key Parsing Functions

Before starting, we just check whether the path we will use as our data folder exists. If it does, we store it there.


In [2]:
if not os.path.exists("../datas"):
    os.makedirs("../datas")
if not os.path.exists("../datas/scrap"):
    os.makedirs("../datas/scrap")
directory = '../datas/scrap'

parse_data is the first parsing function that we use. Given a completed url, it fetches a dictionary which contains everything related to a field, and formats it into a DataFrame. For instance, given a get query associated to the LegislativePeriod, it will get all the fields in it (e.g. ID, Language, LegislativePeriodNumber, ...), then format it in a DataFrame where the ID will be the row, and the other fields the column.

Important note : The Language field, present in every field, will systematically be filtered with the French entries (FR), because the only thing that it changes is for instance the name of the partys that will be in German instead of french and so on. The same goes for the other languages.


In [3]:
def parse_data(base_url) :
    """
        Fetches and parses data from the base_url given as parameter. Then formats it into a DataFrame
        which is returned. The quadruple for loop is due to the particular form of the website. 
        @param base_url : the precise get request we want to formulate.
        @return data : a DataFrame which contains the formatted result from the query.
    """
    
    with urllib.request.urlopen(base_url) as url:
        s = url.read()
    
    root = ET.fromstring(s)
    
    dict_ = {}
    base = "{http://www.w3.org/2005/Atom}"
    for child in root.iter(base+'entry'):
        for children in child.iter(base+'content') :
            for properties in children :
                for subject in properties : 
                        #print(subject.text)
                    s = subject.tag.split('}')
                    if s[1] in dict_ :
                        dict_[s[1]].append(subject.text)
                    else : 
                        dict_[s[1]] = [subject.text]
    data = pd.DataFrame(dict_)
    return data

save_data is our most important function here. It allows us to form the url with which we want to make our query, then calls the parse_data function and finally stores the resuting DataFrame to the corresponding directory.

TODO : describe why the completing is done like that TODO : understand the id_ "game".


In [4]:
def save_data(parent_id, directory, id_name, parent_name = None, subject = None, url = None) :
    """
        Forms the correct url to use to query the website. 
        Then, fetches the data and parses them into a usable csv file.
        @param parent_id :  the ID of the parent field 
                            (i.e. the last one we parsed and from which we got the ID 
                            -> necessary to make the query on the correct IDs) 
        @param directory : the directory in which the parsing will be saved
        @param id_name : the name of the parent field for exporting purposes
        @param parent_name : the name of the parent field formatted for the query
        @param subject : The topic we're currently parsing.
        @param url :   the specific url for the topic we're treating, if we need a special one 
                       (otherwise we load the base_url)
        @return index : the indices on which our data range
        @return data : the formatted DataFrame containing all the infos that were scraped.
    """
    if url == None :
        with open('base_url.txt', 'r') as myfile:
            url=myfile.read()
    if subject != None :
        url = url.replace('[]',subject)
    if parent_name != None :   
        url = url.replace('()',parent_name)
    url = url.replace('{0}',str(np.maximum(min(parent_id)-1,0)))
    url = url.replace('{1}',str(max(parent_id)+1))
    print(url)
    
    data = parse_data(url)
    
    # The website might return empty data. In the case where that happens, we return nothing.
    # In the case where useful data is returned, we save it to a specific location with a name given
    # by the parameters we passed.
    if not data.empty :
        if not os.path.exists(directory):
            os.makedirs(directory)
        index = list(map(int, data['ID'].unique().tolist()))
        data.to_csv(directory+'/'+id_name+ 'id_'+str(min(parent_id))+'-'+str(max(parent_id))+'.csv')
        return index ,data
    else :
        return None

2. Actual Parsing of the Data

Now that our parsing functions were defined, we can use them to retrieve the data we need from the website. Every cell will perform the query for one specific field, and the data we retrieve will allow us to go further down into the data tree (cf. the visualisation of the hierarchy of the data with XOData) and retrieve the fields we are interested with. We especially need to get the Transcripts and Voting fields, as the first describes all the discussions that happen during a session at the Parliament and the second all the results of the votes that happen during a session. Those are the two essential components to the machine learning we'll do later on. We process from the highest "branch" of the tree of data, namely the Legislative Period. This is the year during which the Parliament met, which each time two sessions.

2.1. Saving Legislative Data

Every cell will roughly work the same, we give first the specific link to access the field if it requires it (otherwise we use the base_url that we described before). We set the directory in which we save our data and then retrieve it. We need to retrieve the IDs from the Legislative Period in order to be able to keep on querying, as querying with an invalid Legislative Period ID will lead to the crashing of the request.


In [ ]:
legislative_url ="https://ws.parlament.ch/odata.svc/LegislativePeriod?$filter=LegislativePeriodNumber%20gt%20{0}%20and%20LegislativePeriodNumber%20lt%20{1}"
base_legi_directory = directory+ "/legi"
legi_periode_id, _  = save_data([0,100], base_legi_directory,'legi',None,None,legislative_url)
print(legi_periode_id)

We see that the IDs are continuous, ranging from 37 to 50. We will then query all the data which have their LegislativePeriodID between those two bounds. The first line gives us the link we build to make the query.

2.2 Saving Vote Data

A close field that we can access from the LegislativePeriod is the Vote field. It is the interface between an object that is voted at the parliament at a specific time and the results of this vote, which are available in the Voting field. We will then query all the votes that happened for the given Legislative Periods we're considering. We see below that the IDs range from 1 to 17983, which isn't unique as an object will be voted several times, go between the National Council and the State Council iteratively until the issue is accepted.


In [ ]:
base_vote_directory= directory + "/Vote"
vote_id , _ = save_data(legi_periode_id,base_vote_directory,'legi','IdLegislativePeriod','Vote')
print(str(min(vote_id))+' '+str(max(vote_id)))

2.3. Saving Session Data

The Session field helps us identify precisely when an object was voted. It is due to the fact that, for a given Legislative Period, which basically is a year, there are several Sessions, usually a winter and a summer one, and there might be some special ones as well.


In [ ]:
base_session_directory= directory+ "/Session"
session_id, _ = save_data(legi_periode_id,base_session_directory,'Legi','LegislativePeriodNumber','Session')
print(str(min(session_id))+' '+str(max(session_id)))

2.4 Saving Voting Data.

The routine below is not very efficient and is operated manually, being ran several times in order to obtain all the Voting items. This is why it shouldn't be ran just in its current state. The complications come from the fact that are the Voting IDs aren't contiguous, and that querying an unexisting ID will make the query crash. Moreover, we can only query the IDs two by two, we would otherwise receive a timeout as a response. This is why we need to do the following. It is due to the fact that each ID encapsulates a lot of Data.


In [ ]:
base_voting_directory= directory+ "/Voting"

# Iterate over some specific range ot data
for i in range(np.int16((5005-4811)/2)+1) :
    # Particular URL to get the Voting field from.
    url ="https://ws.parlament.ch/odata.svc/Voting/$count?$filter=Language%20eq%20%27FR%27%20and%20IdSession%20ge%20[1]%20and%20IdSession%20le%20[2]"
    session_id.sort()
    
    # ID of the Voting we query
    id_ = 4811+2*i
    
    # Query items two by two
    take_id = [id_,id_+1]
    url = url.replace('[1]',str(min(take_id)))
    url = url.replace('[2]',str(max(take_id)))
    with urllib.request.urlopen(url) as url:
        s = url.read()
    print("count equals ====>" + str(s))
    vote_id , _ = save_data(take_id,base_voting_directory,'Session','IdSession','Voting')
print(str(min(vote_id))+' '+str(max(vote_id)))

2.5 Saving Meeting Data.

This field depends on the session, and records every Meeting that each chambers of the parliament has during a session. It is necessary for us to have it in order to be able to access the Transcript field later, which is the transcription of every Subject that is discussed during a Meeting of any of the chambers during a Session of a Legislative Period. Nothing very surprising here. We access it from the Session field, that's why we need to record the session_id.


In [ ]:
base_transcript_directory= directory+ "/Meeting"
meeting_id,_ = save_data(session_id,base_transcript_directory,'Session','IdSession','Meeting')
print(str(min(meeting_id))+' '+str(max(meeting_id)))

2.6 Saving Subject Data.

The field, as we described above, contains all the Subjects which are discussed during a single Meeting, and we hence need the meeting_id list to be able to query it. It is the last step before being able to access to the much desired Transcript data and retrieve it in a coherent way..


In [ ]:
base_subject_directory= directory+ "/Subject"
subject_id, _ = save_data(meeting_id,base_subject_directory,'Meeting','IdMeeting','Subject')
print(str(min(Subject_id))+' '+str(max(Subject_id)))

2.7. Saving Transcript Data

Now that we have the list of the Subject_id, we are finally able to get the Transcript field, a record that everything discussed at the parliament, on which we will base our Natural Language Processing Analysis later on. The query is a bit complicated.

TODO : Explain why


In [ ]:
base_transcript_directory= directory+ "/Transcript"
max_transcript_id = 206649
transcript_id = [0]
while max(transcript) < max_transcript_id :
    transcript, transcript = save_data(subject_id,base_transcript_directory,'Subject','IdSubject','Transcript')
    max_id = max(list(map(int,transcript['IdSubject'])))
    subject_id = [i for i in subject_id if i > max_id]
    print(str(min(transcript))+' '+str(max(transcript)))

In [6]:
legislative_url ="https://ws.parlament.ch/odata.svc/[]?$filter=()%20gt%20{0}L%20and%20()%20lt%20{1}L%20and%20Language%20eq%20%27FR%27"
base_legi_directory = directory+ "/MemberCouncil"
legi_periode_id =[5000]
for i in range(10):
    legi_periode_id, _  = save_data([max(legi_periode_id),max(legi_periode_id)+1000], base_legi_directory,'MemberCouncil','ID','MemberCouncil',legislative_url)
    #print(legi_periode_id)


https://ws.parlament.ch/odata.svc/MemberCouncil?$filter=ID%20gt%204999L%20and%20ID%20lt%206001L%20and%20Language%20eq%20%27FR%27
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-9ceab321f36e> in <module>()
      3 legi_periode_id =[5000]
      4 for i in range(10):
----> 5     legi_periode_id, _  = save_data([max(legi_periode_id),max(legi_periode_id)+1000], base_legi_directory,'MemberCouncil','ID','MemberCouncil',legislative_url)
      6     #print(legi_periode_id)

TypeError: 'NoneType' object is not iterable

In [ ]: