What we want to do here is work with our voting_with_topics.csv file, which records the votes that every person made with respect to each subjet. We want to train a model and see whether, given a new text of law, our model will successfully be able to predict what a person would have voted. We will start by exploring our data first.
In [ ]:
import pandas as pd
import glob
import os
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split, cross_val_predict, learning_curve
from ML_helpers import *
import sklearn.metrics
%matplotlib inline
%load_ext autoreload
%autoreload 2
# There's a lot of columns in the DF.
# Therefore, we add this option so that we can see more columns
pd.options.display.max_columns = 100
Importing the quite heavy DataFrame with the voting fields and the results. We drop a useless column and create a Name field, which will contain both the first and the last name of a person, so we can then create a model for each unique deputee at the parliament.
In [ ]:
path = '../datas/nlp_results/'
voting_df = pd.read_csv(path+'voting_with_topics_sentiment.csv')
print('Entries in the DataFrame',voting_df.shape)
#Dropping the useless column
#voting_df = voting_df.drop('Unnamed: 0',1)
#Putting numerical values into the columns that should have numerical values
num_cols = ['BillTitle', 'BusinessTitle','text','text_eng','FirstName','LastName']
voting = voting_df.drop(num_cols,axis=1).apply(pd.to_numeric)
voting['text'] = voting_df.text
#Inserting the full name at the second position
voting.insert(1,'Name', voting_df['FirstName'] + ' ' + voting_df['LastName'])
voting.head(10)
The first reduction of our DataFrame is to suppress all the entries of the Decision field, which contain either a 4, a 5, a 6 or a 7. It basically means that the person did not take part to the voting, and that is hence not useful to our purpose.
In [ ]:
voting_df = voting[((voting.Decision != 4) & (voting.Decision != 5) & (voting.Decision != 6) & (voting.Decision != 7))]
print(voting_df.shape)
#print('Top number of entries in the df :\n', voting_df.Name.value_counts())
We now want to slice to the DataFrame into multiple smaller DataFrames which contains all the entries for a single person. This is done in order to be able to apply machine Learning to each person. The function split_df below splits the DataFrame into a dictionary which contains all the unique entries with respect to a given field.
In [ ]:
def split_df(df, field):
"""
Splits the input df along a certain field into multiple dictionaries which links each unique
entry of the field to the entries in the dataframe
"""
# Retrieve first all the unique Name entries
unique_field = df[field].unique()
print('Number of unique entries in',field,':',len(unique_field))
#Create a dictionary of DataFrames which stores all the info relative to a single deputee
df_dict = {elem : pd.DataFrame for elem in unique_field}
for key in df_dict.keys():
df_dict[key] = df.loc[df[field] == key]
return df_dict
voting_dict = split_df(voting_df, 'Name')
Now on, we will work on an example, to see whether we are able to perform anything slightly correct with a machine learning prediction. We will work with the data on our former member of the National Council and now member of the Federal Council, Guy Parmelin. Note that as the process of voting a law is iterative, going from one chamber to the other until the project is accepter, we first take the simple assumption of taking the vote of the person as the last vote he made on a given subject. This reduces quite a lot the size of the data we're working on, but we still have enough of them.
In [ ]:
#df_deputee = voting_dict['Guy Parmelin'].drop_duplicates('text', keep = 'last')
df_deputee = voting_dict['Silvia Schenker'].drop_duplicates(['text','Name'], keep = 'last')#
print(df_deputee.shape)
df_deputee.head(10)
In [ ]:
df_deputee['sentiment'] = 1*(df_deputee['positive']>df_deputee['negative'])-1*(df_deputee['negative']>df_deputee['positive'])
In [ ]:
print(df_deputee.Decision.value_counts())
We see a way smaller number of abtensions that yes and no, this is why we choose to ignore them at first. We rescale the decision output to 0 and 1, otherwise the algorithm will not understand that it is a classification problem.
In [ ]:
df_deputee = df_deputee[df_deputee.Decision!=3]
df_deputee.shape
We will now format the data and keep the relevant columns only, as well as split them into a training set. The X DataFrame will contain the probabilities we got from the nlp, X_text will be the textual data, that we will store for visualisation of the results later on. The Y vector contains the Decision taken by the person, this is what we want to predict. We will use the Random Forest Classifier as we did in the homework 4 of the course as our prediction algorithm.
In [ ]:
no_pred_field = ['Decision','ParlGroupCode', 'text', 'Name','sentiment','compound']
no_scaled = ['positive','negative','neutral','compound']
df_deputee = df_deputee[df_deputee['compound'] !=0]
pred_field = df_deputee.columns-no_pred_field
X = df_deputee.drop(no_pred_field,axis=1)
X[X[pred_field-no_scaled]>0.2]=1
X[X[pred_field-no_scaled]<0.2]=0
X = X[X[' assurances']==1]
X = X[['positive','negative','neutral']]
#X[pred_field-no_scaled]=X[pred_field-no_scaled].multiply(df_deputee.sentiment,axis=0)
X_text = df_deputee[['Name','text']]
Y = df_deputee[df_deputee[' assurances']>0.2]['Decision'] -1
The classifier we will use is the Random Forest. In order to evaluate the performance of our results, we will use several tools, which will help us to understand better the results we obtain.
cross_val_score
method returns the percentage of accuracy of our classification (average of the testing error of each iteration of the cross-validation. The cv
field allows us to chose the number of folds of cross-validation we want to perform (e.g. cv=5 -> 5-fold cross-validation).If we turn now to the coding, we have several functions implemented in the ML-helpers.py
file, which are as follows :
cross_validation
function does perform the cv_param-fold cross-validation, and outputs the cross-validation result, along with F1-score and the confusion matrix, in order for us to understand the shape of our results. We perform by default a 20-fold cross validation, as we want a result as stable as possible.plot_feature_importances
function does rank each feature of X accordingly to the role it plays into the classification of the data we give it. This allows us to see whether a key subset of our features would turn out to be outstandingly better than the rest at determining the outcome of the vote of the deputeeprint_confusion_matrix
is a helper function that helps us visualising the confusion matrix in a more elegant way than the usual way of displaying a 2D numpy array.It turns out that the learning curves are useful to be able to see whether our model is massively overfitting, and to help tune the best parameters on which to run our model. We will then plot the learning curves here and will pick the best model we have for a given deputee.
In [ ]:
estimator = RandomForestClassifier()
title = "Learning Curves (Random Forest Classifier with 2 classes)"
plot_learning_curve(estimator,X,Y,title,20)
The plot describes the classification into binary output. The default RandomForestClassifier method clearly overfits our data, this is notably due to the fact of having an unfixed depth for the depth of the tree. This is why we will iterate over different depth and fix it to a value that yields a good result, and the fact of having a max depth will mitigate overfitting.
We now compute the result of the cross validation given a certain depth, varying from 1 to 20 and plot the result
In [ ]:
cv_score = np.zeros(20)
tr_score = np.zeros(20)
cv_param = 20
for i in range(1,21):
forest = RandomForestClassifier(max_depth = i)
cv_score[i-1] = cross_val_score(forest,X,Y,cv = cv_param).mean()
forest.fit(X,Y)
tr_score[i-1] = forest.score(X,Y)
plot_fig(cv_score,tr_score,
"Cross validation score against the depth of the random forest",
"Max depth of the random forest","Cross validation score")
Now, we plot the learning curve having a max depth of 3
In [ ]:
estimator = RandomForestClassifier(max_depth = 10)
title = "Learning Curves (Random Forest Classifier with 2 classes)"
plot_learning_curve(estimator,X,Y,title,20)
In [ ]:
Y_predicted = cross_validation(X, Y, cv_param=20, max_depth=10)
features_2 = plot_feature_importances(X,Y,pred_field, max_depth=10)
We format the output DataFrame in a useful way, storing the name of the deputee who voted, their actual Decision, our Predicted Decision and the fields that were used in the learning.
In [ ]:
df_out = df_deputee[np.r_[['Decision'],pred_field]]
df_out.insert(1,'Predicted Decision',Y_predicted + 1)
#Optional line to round the probabilities to 3 decimals (will not sum to one, but simpler to look at it.)
df_out = df_out.apply(np.around,decimals=3)
if not os.path.exists("../datas/treated_data/Voting_prediction"):
os.makedirs("../datas/treated_data/Voting_prediction")
df_out.to_csv('../datas/treated_data/Voting_prediction/'+'prediction_'+
df_deputee['Name'].unique()[0].lower().replace(' ','_')+'.csv',index=False)
df_out.head(100)
TODO : What could be improved :