In this notebook, we try to apply an unsupervised learning algorithm to votation profile of every people in order to detect clusters, and observe whether they match with the political partites. To do so, we first create a network with people as nodes, and connect each node to their k (e.g 3) nearest neighbours based on the matrix distance computed previously. The ML algorithm is a spectral clustering algorithm which uses the adjacency matrix of this network.
In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import sklearn
import sklearn.ensemble
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
import csv
In [2]:
path = '../../datas/nlp_results/'
voting_df = pd.read_csv(path+'voting_with_topics.csv')
print('Entries in the DataFrame',voting_df.shape)
#Dropping the useless column
voting_df = voting_df.drop('Unnamed: 0',1)
#Putting numerical values into the columns that should have numerical values
#print(voting_df.columns.values)
num_cols = ['Decision', ' armée', ' asile / immigration', ' assurances', ' budget', ' dunno', ' entreprise/ finance',
' environnement', ' famille / enfants', ' imposition', ' politique internationale', ' retraite ']
voting_df[num_cols] = voting_df[num_cols].apply(pd.to_numeric)
#Inserting the full name at the second position
voting_df.insert(2,'Name', voting_df['FirstName'] + ' ' + voting_df['LastName'])
voting_df = voting_df.drop_duplicates(['Name'], keep = 'last')
voting_df = voting_df.set_index(['Name'])
voting_df.head(3)
Out[2]:
In [3]:
profileMatrixFile = 'profileMatrix.csv'
profileMatrix = pd.read_csv(profileMatrixFile, index_col = 0)
profileArray = profileMatrix.values
print(profileArray.shape)
profileMatrix.head()
Out[3]:
In [24]:
distanceMatrixFile = 'distanceMatrix.csv'
distances = pd.read_csv(distanceMatrixFile, index_col = 0)
distances = distances.replace(-0.001, 0)
distancesArray = distances.values
print(distancesArray.shape)
distances.head()
Out[24]:
In [78]:
k = 4 # number of nearest neighbours that we take into account in the adjacency matrix
for i in distances:
d = distances.loc[i]
np.sort(d)
threshold = d[k-1]
for j in distances:
if distances.loc[i][j] > threshold:
distances.loc[i][j] = 0
else:
distances.loc[i][j] = 1
distances.head()
Out[78]:
In [113]:
nbClust = 4
clusterDist = sklearn.cluster.spectral_clustering(affinity = distances.values, n_clusters = nbClust)
clusterDist
Out[113]:
In [121]:
ratio_df = pd.DataFrame(index = voting_df.ParlGroupName.unique())
ratio_df['ratio'] = 0
np.array(ratio_df.index)
Out[121]:
In [122]:
def ratioPartite(cluster, clusterDist):
# Compute the partites distribution for all people within this cluster
people = distances.index[clusterDist == cluster]
size = len(people)
ratio_df = pd.DataFrame(index = voting_df.ParlGroupName.unique())
ratio_df['ratio'] = 1.0
for group in np.array(ratio_df.index):
print(group)
peopleGroup = [p for p in people[voting_df.loc[people].ParlGroupName == group]]
print(len(peopleGroup) / float(size))
ratio_df.set_value(group, 'ratio', len(peopleGroup) / float(size))
return ratio_df
In [126]:
ratio_df = pd.DataFrame(index = voting_df.ParlGroupName.unique(), columns = range(nbClust))
ratio_df[0] = range(8)
ratio_df
Out[126]:
In [125]:
ratio_df = pd.DataFrame(index = voting_df.ParlGroupName.unique(), columns = range(nbClust))
for cluster in range(nbClust):
ratio = ratioPartite(cluster, clusterDist)
ratio_df[cluster] = ratio.values
ratio_df
Out[125]:
We observe that when we cluster people in 4 clusters, each partites are well separated :
Note that we could also separate the data in 3 clusters. In this case, we observe that clusters 2 and 3 are merged together.