In [14]:
import pandas as pd
import glob
import os
import numpy as np
import matplotlib.pyplot as plt
from helpers import load_vote_df, load_voting,format_voting_session, split_df_dict
%matplotlib inline
%load_ext autoreload
%autoreload 2
# There's a lot of columns in the DF.
# Therefore, we add this option so that we can see more columns
pd.options.display.max_columns = 100
In [2]:
voting_df = format_voting_session(load_voting())
We see here a disparity, as in our Voting file, the smallest IdVote
is 6392, which removes a third of the laws we actually have. This is due to the scraping not retrieving everything, but this is not a bad thing, as it removes the oldest entries, which are not the most relevant ones.
In [3]:
voting_df.IdVote.sort_values().unique()
Out[3]:
In [15]:
vote_df = load_vote_df()
vote_df.head()
Out[15]:
Indeed, we see here that here, the ID
, which corresponds to the IdVote
on the Voting file, starts from 1, but the subject is about the oldest votes. We will now generate an "epurated" version of the Vote dataframe, with only the votes which have a counterpart in the Voting file.
Moreover, we know that there are some empty BillTitle
entries, so we want to put the BusinessTitle
as BillTitle
to be able to handle the subject generally.
In [16]:
vote_df = vote_df.loc[vote_df.ID>=6392]
# Setting the entries with a null BillTitle to have BusinessTitle as their entry
vote_df.loc[vote_df.BillTitle.isnull(),'BillTitle'] = vote_df.loc[vote_df.BillTitle.isnull(),'BusinessTitle']
directory = '../../datas/treated_data/Vote/'
if not os.path.exists(directory):
os.makedirs(directory)
vote_df.to_csv(directory+'legiid_47-50.csv')
In [17]:
vote_df = format_voting_session(vote_df)
#Filling the NaN with some text so the javascript does not crash later on.
vote_df = vote_df.fillna('Not specified')
vote_df.head()
Out[17]:
First of all, we associate a unique ID to each BillTitle
. To do so, we take as ID the one that a given Bill has at the last time it appears in our DataFrame.
In [18]:
def map_BillTitle_Vote(vote_df):
df_link = vote_df[['BillTitle','ID']]
df_link.columns=['BillTitle','ID_Bill']
df_link = df_link.drop_duplicates(['BillTitle'], keep = 'last')
df_link.set_index('BillTitle',inplace=True)
vote_df = vote_df.join(df_link, on='BillTitle')
return df_link, vote_df
df_link,vote_df = map_BillTitle_Vote(vote_df)
directory = '../../datas/analysis/'
if not os.path.exists(directory):
os.makedirs(directory)
df_link.to_csv(directory+'map_bill_ID.csv')
We must first have a table in which we can link the BillTitle
to the corresponding file, which will contain all the informations regarding a certain BillTitle
.
In [19]:
bills_dict = split_df_dict(vote_df, 'ID_Bill')
In [20]:
directory = '../../datas/analysis/bill_link/'
if not os.path.exists(directory):
os.makedirs(directory)
for ID_Bill, df in bills_dict.items():
df.to_csv(directory+'bill_'+str(ID_Bill)+'.csv')
The last task that we have to do is linking the ID of a given iteration of the law (in the file in the bill_link
folder) to the votes that happened on the subject. We simply need to group each voting by its IdVote
and export it to a single file.
In [21]:
voting_df.head()
Out[21]:
Splitting the voting DataFrame by the Idvote
field in order to get many subdataframes.
In [22]:
voting_dict = split_df_dict(voting_df, 'IdVote')
Before exporting it, we remove a lot of redundant columns. Indeed, as we will access this file through the .csv
in the bill_link
folder, we will already have access to all the information about the vote itself. We just need to have what each member voted.
In [24]:
directory = '../../datas/analysis/bill_voting/'
if not os.path.exists(directory):
os.makedirs(directory)
for IdVote, df in voting_dict.items():
df = df.drop(['BillTitle','BusinessShortNumber', 'IdSession', 'VoteEnd', 'SessionName', 'Date'],axis=1)
df.to_csv(directory+'voting_'+str(IdVote)+'.csv')