01 - Machine Learning

What we want to do here is work with our voting_with_topics.csv file, which records the votes that every person made with respect to each subjet. We want to train a model and see whether, given a new text of law, our model will successfully be able to predict what a person would have voted. We will start by exploring our data first.

1.0 - Imports and Loading of the Data


In [ ]:
import pandas as pd
import glob
import os
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split, cross_val_predict, learning_curve
from ML_helpers import *
import sklearn.metrics

%matplotlib inline
%load_ext autoreload
%autoreload 2

# There's a lot of columns in the DF. 
# Therefore, we add this option so that we can see more columns
pd.options.display.max_columns = 100

Importing the quite heavy DataFrame with the voting fields and the results. We drop a useless column and create a Name field, which will contain both the first and the last name of a person, so we can then create a model for each unique deputee at the parliament.


In [ ]:
path = '../datas/nlp_results/'
voting_df = pd.read_csv(path+'voting_with_topics_sentiment.csv')
print('Entries in the DataFrame',voting_df.shape)

#Dropping the useless column
#voting_df = voting_df.drop('Unnamed: 0',1)

#Putting numerical values into the columns that should have numerical values

num_cols = ['BillTitle', 'BusinessTitle','text','text_eng','FirstName','LastName']
voting = voting_df.drop(num_cols,axis=1).apply(pd.to_numeric)
voting['text'] = voting_df.text
#Inserting the full name at the second position
voting.insert(1,'Name', voting_df['FirstName'] + ' ' + voting_df['LastName'])

voting.head(10)

The first reduction of our DataFrame is to suppress all the entries of the Decision field, which contain either a 4, a 5, a 6 or a 7. It basically means that the person did not take part to the voting, and that is hence not useful to our purpose.


In [ ]:
voting_df = voting[((voting.Decision != 4) & (voting.Decision != 5) & (voting.Decision != 6) & (voting.Decision != 7))]
print(voting_df.shape)
#print('Top number of entries in the df :\n', voting_df.Name.value_counts())

We now want to slice to the DataFrame into multiple smaller DataFrames which contains all the entries for a single person. This is done in order to be able to apply machine Learning to each person. The function split_df below splits the DataFrame into a dictionary which contains all the unique entries with respect to a given field.


In [ ]:
def split_df(df, field):
    """
        Splits the input df along a certain field into multiple dictionaries which links each unique
        entry of the field to the entries in the dataframe
    """
    # Retrieve first all the unique Name entries
    unique_field = df[field].unique()
    print('Number of unique entries in',field,':',len(unique_field))
    #Create a dictionary of DataFrames which stores all the info relative to a single deputee
    df_dict = {elem : pd.DataFrame for elem in unique_field}

    for key in df_dict.keys():
        df_dict[key] = df.loc[df[field] == key]
    
    return df_dict

voting_dict = split_df(voting_df, 'Name')

1.1 Machine Learning on a single deputee

Now on, we will work on an example, to see whether we are able to perform anything slightly correct with a machine learning prediction. We will work with the data on our former member of the National Council and now member of the Federal Council, Guy Parmelin. Note that as the process of voting a law is iterative, going from one chamber to the other until the project is accepter, we first take the simple assumption of taking the vote of the person as the last vote he made on a given subject. This reduces quite a lot the size of the data we're working on, but we still have enough of them.


In [ ]:
#df_deputee = voting_dict['Guy Parmelin'].drop_duplicates('text', keep = 'last')
df_deputee = voting_dict['Silvia Schenker'].drop_duplicates(['text','Name'], keep = 'last')#

print(df_deputee.shape)
df_deputee.head(10)

In [ ]:
df_deputee['sentiment'] = 1*(df_deputee['positive']>df_deputee['negative'])-1*(df_deputee['negative']>df_deputee['positive'])

1.1.1 Preparing the Features

Before moving onto Machine Learning properly, let us visualise the amount of votes into the 3 possible categories (1 : yes, 2 : no , 3 : absention).


In [ ]:
print(df_deputee.Decision.value_counts())

We see a way smaller number of abtensions that yes and no, this is why we choose to ignore them at first. We rescale the decision output to 0 and 1, otherwise the algorithm will not understand that it is a classification problem.


In [ ]:
df_deputee = df_deputee[df_deputee.Decision!=3]
df_deputee.shape

We will now format the data and keep the relevant columns only, as well as split them into a training set. The X DataFrame will contain the probabilities we got from the nlp, X_text will be the textual data, that we will store for visualisation of the results later on. The Y vector contains the Decision taken by the person, this is what we want to predict. We will use the Random Forest Classifier as we did in the homework 4 of the course as our prediction algorithm.


In [ ]:
no_pred_field = ['Decision','ParlGroupCode', 'text', 'Name','sentiment','compound']
no_scaled = ['positive','negative','neutral','compound']
df_deputee = df_deputee[df_deputee['compound'] !=0]
pred_field = df_deputee.columns-no_pred_field
X = df_deputee.drop(no_pred_field,axis=1)
X[X[pred_field-no_scaled]>0.2]=1
X[X[pred_field-no_scaled]<0.2]=0
X = X[X[' assurances']==1]
X = X[['positive','negative','neutral']]
#X[pred_field-no_scaled]=X[pred_field-no_scaled].multiply(df_deputee.sentiment,axis=0)
X_text = df_deputee[['Name','text']]
Y = df_deputee[df_deputee[' assurances']>0.2]['Decision'] -1

1.1.2 Classification of our data

The classifier we will use is the Random Forest. In order to evaluate the performance of our results, we will use several tools, which will help us to understand better the results we obtain.

  1. The cross_validation module of scikit-learn allows us to test the performance of our classification. The cross_val_score method returns the percentage of accuracy of our classification (average of the testing error of each iteration of the cross-validation. The cv field allows us to chose the number of folds of cross-validation we want to perform (e.g. cv=5 -> 5-fold cross-validation).
  2. The F1 score takes into account the false positives and the false negatives in the process of outputting the score. Hence, a model with a high prediction accuracy can get very poor results in the F1-metric if for instance it outputs everything to an output which is dominant in the population (cf. the examples "Everybody has cancer" in the lecture 07 of the course).
  3. The confusion matrix plots the detail of the classification, and allows us to visualise the false positives, false negatives. We can compute the F1-score from this matrix.
  4. The feature importances allows us to see which of the features are the most significant for the classification.

If we turn now to the coding, we have several functions implemented in the ML-helpers.py file, which are as follows :

  • The cross_validation function does perform the cv_param-fold cross-validation, and outputs the cross-validation result, along with F1-score and the confusion matrix, in order for us to understand the shape of our results. We perform by default a 20-fold cross validation, as we want a result as stable as possible.
  • The plot_feature_importances function does rank each feature of X accordingly to the role it plays into the classification of the data we give it. This allows us to see whether a key subset of our features would turn out to be outstandingly better than the rest at determining the outcome of the vote of the deputee
  • The function print_confusion_matrix is a helper function that helps us visualising the confusion matrix in a more elegant way than the usual way of displaying a 2D numpy array.

It turns out that the learning curves are useful to be able to see whether our model is massively overfitting, and to help tune the best parameters on which to run our model. We will then plot the learning curves here and will pick the best model we have for a given deputee.


In [ ]:
estimator = RandomForestClassifier()    

title = "Learning Curves (Random Forest Classifier with 2 classes)"
plot_learning_curve(estimator,X,Y,title,20)

The plot describes the classification into binary output. The default RandomForestClassifier method clearly overfits our data, this is notably due to the fact of having an unfixed depth for the depth of the tree. This is why we will iterate over different depth and fix it to a value that yields a good result, and the fact of having a max depth will mitigate overfitting.

We now compute the result of the cross validation given a certain depth, varying from 1 to 20 and plot the result


In [ ]:
cv_score = np.zeros(20)
tr_score = np.zeros(20)

cv_param = 20
for i in range(1,21):
    forest = RandomForestClassifier(max_depth = i)
    cv_score[i-1] = cross_val_score(forest,X,Y,cv = cv_param).mean()
    forest.fit(X,Y)
    tr_score[i-1] = forest.score(X,Y)

plot_fig(cv_score,tr_score,
         "Cross validation score against the depth of the random forest",
         "Max depth of the random forest","Cross validation score")

Now, we plot the learning curve having a max depth of 3


In [ ]:
estimator = RandomForestClassifier(max_depth = 10)    

title = "Learning Curves (Random Forest Classifier with 2 classes)"
plot_learning_curve(estimator,X,Y,title,20)

1.1.3 Results

Having found the depth at wich our tree does not overfit, we want to focus on understanding the results we get. We will see, given the features we have, whether our algorithm is able to classify correctly.

N.B. The less deep the tree, the dumber the classifier.


In [ ]:
Y_predicted = cross_validation(X, Y, cv_param=20, max_depth=10)
features_2 = plot_feature_importances(X,Y,pred_field, max_depth=10)

We format the output DataFrame in a useful way, storing the name of the deputee who voted, their actual Decision, our Predicted Decision and the fields that were used in the learning.


In [ ]:
df_out = df_deputee[np.r_[['Decision'],pred_field]]
df_out.insert(1,'Predicted Decision',Y_predicted + 1)


#Optional line to round the probabilities to 3 decimals (will not sum to one, but simpler to look at it.)
df_out = df_out.apply(np.around,decimals=3)

if not os.path.exists("../datas/treated_data/Voting_prediction"):
    os.makedirs("../datas/treated_data/Voting_prediction")
df_out.to_csv('../datas/treated_data/Voting_prediction/'+'prediction_'+
              df_deputee['Name'].unique()[0].lower().replace(' ','_')+'.csv',index=False)



df_out.head(100)

TODO : What could be improved :

  1. Not simply take the last iteration of the law : maybe take into account the intermediary votes
  2. Average by party