In this notebook, we perform some analysis on votation based on topic modelling results, e.g. topic evolution over time as well as statistical tests to see whether some votations about some topics are more likely to be accepted than others.


In [ ]:
import pandas as pd
import glob
import os
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split, cross_val_predict, learning_curve
import sklearn.metrics

%matplotlib inline
%load_ext autoreload
%autoreload 2

# There's a lot of columns in the DF. 
# Therefore, we add this option so that we can see more columns
pd.options.display.max_columns = 100

In [ ]:
path = '../datas/nlp_results/'
voting_df = pd.read_csv(path+'voting_with_topics.csv')
print('Entries in the DataFrame',voting_df.shape)

#Dropping the useless column
voting_df = voting_df.drop('Unnamed: 0',1)

#Putting numerical values into the columns that should have numerical values
#print(voting_df.columns.values)

num_cols = ['Decision', ' armée', ' asile / immigration', ' assurances', ' budget', ' dunno', ' entreprise/ finance',
           ' environnement', ' famille / enfants', ' imposition', ' politique internationale', ' retraite  ']
voting_df[num_cols] = voting_df[num_cols].apply(pd.to_numeric)

#Inserting the full name at the second position
voting_df.insert(2,'Name', voting_df['FirstName'] + ' ' + voting_df['LastName'])

voting_df.head(3)

Evolution of topics in the votes

We take a look at the data in order to observe which topics are most talked about in the votes, and see whether or not it changes over the years.


In [ ]:
voting_df_copy = voting_df
voting_df_copy.head(3)

In [ ]:
voting_df_copy['VoteEnd'] = [x[:4] for x in voting_df_copy['VoteEnd']]
voting_df_copy.head(3)
dates = voting_df_copy['VoteEnd'].drop_duplicates()
dates = np.sort(dates)
dates

In [ ]:
#voting_df_copy.index = voting_df_copy['VoteEnd']
voting_df_copy = voting_df_copy.set_index(['VoteEnd', 'Name'])

voting_df_copy.head(3)

In [ ]:
voting_df_copy.columns

In [ ]:
#voting_df_copy.loc['2009-09'][' armée']
#topicArmee = np.mean(voting_df_copy.loc['2009-09'][' armée'])

#print(topicArmee)
#indices = voting_df_copy.index[:].values
#indices

topics = ['armée', 'asile / immigration', 'assurances', 'budget', 'dunno', 'entreprise/ finance', 'environnement', 'famille / enfants', 'imposition', 'politique internationale', 'retraite  ']
topicEvolution = pd.DataFrame(index = dates, columns = topics)
#topicArmeeEvolution = []

for topic in topics:
    for date in dates:
        topicEvolution.loc[date][topic] = np.mean(voting_df_copy.loc[date][' '+topic])
        #topicArmeeEvolution.append(np.mean(voting_df_copy.loc[date][' '+topic]))
    #topicEvolution[topic].plot
    plt.plot(dates,topicEvolution[topic])

plt.legend(topics)
topicEvolution

In [ ]:
topicEvolution.to_json("topicEvolution.json")

Extract vote decision for each subject

We exctract here for each voted subject the final result of the votation, by comparing the number of Yes to the number of No.


In [ ]:
voting_df = pd.read_csv(path+'voting_with_topics_sentiment.csv')
voting_df_Decision = voting_df
#voting_df_TopicAcceptation = voting_df_TopicAcceptation.set_index(['IdVote', 'Name'])
voting_df_Decision.head(3)

In [ ]:
texts = voting_df_Decision['text'].unique()
print(len(texts))
texts[0]

In [ ]:
voting_df_Decision = voting_df_Decision.set_index(['text', 'LastName'])
voting_df_Decision.head()

In [ ]:
decisions_dict = {}
decisions_dict[texts[0]] = np.sum(voting_df_Decision.loc[texts[0]].Decision == 1) > np.sum(voting_df_Decision.loc[texts[0]].Decision == 2)
for t in texts:
    decisions_dict[t] = np.sum(voting_df_Decision.loc[t].Decision == 1) > np.sum(voting_df_Decision.loc[t].Decision == 2)

In [ ]:
len(decisions_dict)

In [ ]:
decisions_df = pd.DataFrame.from_dict(decisions_dict, 'index')
decisions_df.columns = ['Decision']
decisions_df.head()

Ratio of acceptation for each topic

We now want to see if there is a significant correlation between a vote being accepted and the topic of the vote.


In [ ]:
path = '../datas/nlp_results/'
voting_df = pd.read_csv(path+'voting_with_topics_unique_sentiment.csv')
voting_df_TopicAcceptation = voting_df
#voting_df_TopicAcceptation = voting_df_TopicAcceptation.set_index(['IdVote', 'Name'])
voting_df_TopicAcceptation.head(3)

In [ ]:
voting_df_TopicAcceptation['Accepted'] = decisions_df.loc[voting_df_TopicAcceptation['text']].Decision.values
voting_df_TopicAcceptation.head(3)

In [ ]:
topics = [' armée', ' asile / immigration', ' assurances', ' budget', ' dunno', ' entreprise/ finance', ' environnement', ' famille / enfants', ' imposition', ' politique internationale', ' retraite  ']

voting_df_TopicAcceptation['Topic'] = voting_df_TopicAcceptation[topics].idxmax(axis=1)

voting_df_TopicAcceptation.head(3)

In [ ]:
decisions_df2 =voting_df[['text','Accepted','Topic']]
decisions_df2 = decisions_df2.set_index('text')
decisions_df2.to_csv('topic_accepted.csv')

In [ ]:
meanAcceptation = np.mean(voting_df_TopicAcceptation.loc[:, 'Accepted'])
print('Precentage of acceptation of a votation : ', meanAcceptation)

In [ ]:
voting_df_TopicAcceptation = voting_df_TopicAcceptation[['Topic', 'Accepted']]
voting_df_MeanAcceptation = voting_df_TopicAcceptation.groupby(by='Topic', axis=0).mean()
voting_df_MeanAcceptation

We observe qualitatively that votations about army, assurances and environnement are less accepted than the votations about other topics in general. Conversly, votations about asile / immigration, enterprise / finance and imposition are more likely to be accepted. This analysis is performed based on a quick look to the data. The aim of the next part is to determine whether or not these observed variatiations are statistically significant.


In [ ]:
#voting_df_MeanAcceptation['Accepted'].plot.hist(by = 'Topic')
#voting_df_MeanAcceptation.hist()
#plt.show()

In [ ]:
voting_df_TopicAcceptation = voting_df_TopicAcceptation.set_index('Topic')
voting_df_TopicAcceptation.head(3)

The aim of the statistical test that we want to use is to determine whether a sample of votations has a mean acceptation that differs from the global mean acceptation. The samples correspond here to the votations about each topic. Thus we apply one statistical test per topic, whose null hypothesis is that the sample about this topic has the same mean acceptation that the global mean, ie 65.5%. The test that we use is a One-sample T test.


In [ ]:
import scipy.stats as stats

for t in topics:
    votationAboutTopic = voting_df_TopicAcceptation.loc[t]
    print('Topic ' + t,'- Number of votations',len(votationAboutTopic.loc[:, 'Accepted']))
    print(stats.ttest_1samp(a= np.array(votationAboutTopic.loc[:,'Accepted']), popmean=meanAcceptation))
    print()

We consider a tolerance threshold for the p-value of 5%. Thus we can reject the null hypothesis for several topics :

  • armée : the acceptation mean is lower than the global mean
  • asile / immigration : the acceptation mean is higher than the global mean
  • assurances : the acceptation mean is lower than the global mean
  • entreprise/ finance : the acceptation mean is higher than the global mean
  • environnement : the acceptation mean is lower than the global mean
  • imposition : the acceptation mean is higher than the global mean