Nowadays social medias like Twitter and Facebook are means by which people continuously express their opinion on any matter. Thus, data mining combined with data analysis could be a great way to undersand the general feeling of the users on a certian matter. With this in mind, we decided to try to predict the result of an election using data from Twitter.
We focused on the US Senate Election of 2016, since this election would provide enough meaningful data for our analysis. In fact there are many candidates and a lot of people are tweeting about them, allowing us to have a big dataset to train our algorithm with. We started by mining data from Twitter, collecting every Tweet regarding most of the Republican and Democrate Senate candidates during the two months preceeding the elections. We collected tweets aimed at each candidate, meaning tweets containing the candidate's twitter name .
The second step was to perform a sentiment analysis on these tweets to understand if the user was expressing a positive or negative feeling towards the candidate. Sentiment analysis combines natural language processing, text analysis and computational linguistics to assess the attitude of a writer towards a topic. There are various tools for it, we picked one called Pattern developed by the University of Antwerp, in Belgium.
Afterwards we trained two variations of the same machine learning algorithm with the sentiment analysis data and some other features like number of followers of the candidate.
We started by finding the list of candidates for the United States Senate Elections in 2016 and we looked for their twitter accounts. We decided to only use Democratic and Republican candidates, as the other parties are not very relevant in the US.
Twitter APIs only allow data mining in the past week, thus we used a tool to bypass this limitation and collect data from the 8th of September 2016 to the 8th of November 2016, the day of the elections.
The tool is a Python script found on GitHub (link in the Resources section). It enabled us to get all the tweets in a desired time frame according to some custom parameters. For example to mine tweets about Evan Bayh from the 8th of Septemebr 2016 to the 8th of November 2016 we simply ran the following command:
python Exporter.py --querysearch "SenEvanBayh" --since 2016-09-08 --until 2016-11-08
The outputs of this analysis are kept in the data folder of our project in .csv format. The files contain the date of the tweet, the username of the author of the tweet, the tweet itself, the number of likes and retweets for the tweet and some other information.
In [2]:
# Imported libraries
from pattern.en import sentiment
import numpy as np
import os
import csv
from sklearn import linear_model
import time
from statsmodels.tsa.ar_model import AR
Sentiment analysis is used to measure the polarity of a text. By polarity it is intended whether the text leans to a negative or positive attitute towards the topic it contains.
Pattern, used to perform sentiment analysis for this project, is a natural language processing toolkit. In particular, we used pattern.en, which was made for the English language.
The sentiment( ) function returns a (polarity, subjectivity)-tuple for the given sentence, based on the adjectives it contains, where polarity is a value between -1.0 and +1.0 and subjectivity between 0.0 and 1.0.
Before performing the sentiment analysis on the data, we got rid of the hashtags and of the @ to facilitate the sentiment analysis in case a meaningful word followed these symbols.
Important note : this sentiment analysis tool requires the usage of Python 2
In [3]:
# Use all the .csv files inside the data folder
for file in os.listdir("./data"):
if file.endswith(".csv"):
path = r'data/'+file
print(file)
var = os.path.basename(path)
var= str.split(var,'_')
var=str.split(var[1],'.')
sentiment_path = 'sentim/sentiment_'+var[0]+'.csv'
# read the text, the date and the number of retwetts and favourites from the data files
text=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(4,))
date=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(1,))
retweet=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(2,))
favourites=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(3,))
# remove hashtags and @ from the tweets
text=np.core.defchararray.replace(text,'#',' ')
text=np.core.defchararray.replace(text,'@',' ')
array_sentiment=np.zeros((len(text),2))
# perform sentiment analysis and filter out values that are too close to zero
for i in range(len(text)):
array_sentiment[i] = sentiment(text[i])
if (array_sentiment[i][0]<0.0000000000000001 and array_sentiment[i][0]> -0.0000000000000001 and array_sentiment[i][0]!=0.0):
array_sentiment[i]=0
# save sentiment analysis to output file
np.savetxt(sentiment_path,np.transpose([date,array_sentiment[:,0],array_sentiment[:,1],retweet,favourites]),fmt="%s;%s;%s;%s;%s",delimiter=';',header="date;sentiment;objectivity;retweets;favourites")
In [4]:
# Use all the .csv files inside the data folder
for file in os.listdir("./data"):
if file.endswith(".csv"):
cand = r'data/'+file
mydata = np.loadtxt(cand, dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(0,))
# Collect the number of tweets for one deputee and number of unique authors of tweets about this candidate
print(file + " " + str(len(mydata)) + " " + str(len(np.unique(mydata))))
We decided to do a daily mean of the polarity value given by the sentiment analysis of the tweets. The mean is weighted on the number of likes and retweets that every tweet received. In fact we assumed that likes and retweets meant that people agreed with the content of the tweet and thus such tweets deserved to have a higher weight.
In [5]:
# Use all the .csv files inside the sentim folder
for file in os.listdir("./sentim"):
if file.endswith(".csv"):
path = r'sentim/'+file
print(file)
var = os.path.basename(path)
var= str.split(var,'_')
var=str.split(var[1],'.')
mean_path = 'means/mean_'+var[0]+'.csv'
#read sentiment, date and number of retweets and favourites from the sentiment analysis files
sentim=np.loadtxt(path, comments='++++',delimiter=';',skiprows=1,usecols=(1,))
date=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(0,))
retweet=np.loadtxt(path, comments='++++',delimiter=';',skiprows=1,usecols=(3,))
favourites=np.loadtxt(path, comments='++++',delimiter=';',skiprows=1,usecols=(4,))
array_length=62
mean=[0]*array_length
#create day and month arrays
array_day=np.append(np.arange(8,0,-1),(np.append(np.arange(31,0,-1),np.arange(30,7,-1))))
array_month=np.append(np.ones(8)*11,(np.append(np.ones(31)*10,np.ones(23)*9)))
# parse the date and the month and create arrays for them
day =np.zeros(len(date))
month=np.zeros(len(date))
for i in range(len(date)):
day[i]=np.datetime64(date[i]).astype(object).day
month[i]=np.datetime64(date[i]).astype(object).month
#create array of the weights (based on likes and favourites)
array_weight = np.zeros(len(sentim))
for i in range(len(retweet)):
if(retweet[i]!=0.0 or favourites[i]!=0.0):
array_weight[i]=retweet[i]+favourites[i]
else:
array_weight[i]= 1
#compute the mean
cnt_date=0
cnt_mean=[0]*array_length
for i in range(len(sentim)):
if (sentim[i]!=0.0):
if (array_day[cnt_date]==day[i]and array_month[cnt_date]==month[i]):
mean[cnt_date] = mean[cnt_date]+array_weight[i]*sentim[i]
cnt_mean[cnt_date]=cnt_mean[cnt_date]+array_weight[i]
else :
while (array_day[cnt_date]!=day[i] or array_month[cnt_date]!= month[i]):
if (cnt_date <len(array_day)-1):
cnt_date=cnt_date + 1
else:
break
mean[cnt_date]=mean[cnt_date]+array_weight[i]*sentim[i]
cnt_mean[cnt_date]=cnt_mean[cnt_date]+array_weight[i]
weigthed_mean=[None]*array_length
for i in range(len(mean)):
if (mean[i]!=0.0):
weigthed_mean[i]=mean[i]/cnt_mean[i]
else:
weigthed_mean[i]=0
# save output on file
np.savetxt(mean_path,np.transpose([array_day,array_month,mean,cnt_mean,weigthed_mean]),fmt="%d;%d;%s;%s;%s",delimiter=';',header="day;month;sum;weight;mean")
After creating files containing the daily mean values, we built the data structures necessary to proceed with the machine learning algorithm. In addition to the previous features, we collect by hand the number of follower for every candidates. Concerning the structure, we used dictionaries with the twitter name of the candidate as a key identifier.
D, name_dict and id_dict are dictionaries with the same keys as identifiers. For each key, d's values from 0 to 61 are the daily means, value 62 is the # of follower, value 63 is the amount of unique ids and value 64 in the total amount of tweets. For each key, id_dict's value 0 is the election result (1 for win, 0 for loss), value 1 is the percentage of votes that the candidate received and value 2 the pair identifier that pairs up candidates running in the same election. For each key, name_dict's value 0 is the name of the candidate and value 1 is the last name of the candidate
In [6]:
d={}
id_dict={}
name_dict={}
# Use all the .csv files inside the means folder
for file in os.listdir("./means"):
if file.endswith(".csv"):
path = r'means/'+file
var = os.path.basename(path)
var= str.split(var,'_got')
var=str.split(var[1],'.')
key = var[0]
d.setdefault(key,[])
d[key]=np.loadtxt(path, comments='++++',delimiter=';',skiprows=1,usecols=(4,))
id_dict.setdefault(key,[])
name_dict.setdefault(key,[])
with open('listDeputee.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=';', quotechar=';')
for row in reader:
if (row[0]!='Id'):
d[row[0]]= np.append(d[row[0]],[float(row[5]),float(row[6]),float(row[7])])
id_dict[row[0]]=np.append(id_dict[row[0]],[float(row[3]),float(row[4]),float(row[8])])
name_dict[row[0]]=np.append(name_dict[row[0]],[row[1],row[2],])
The first step of the machine learning algorithm that we used is to fit the 62 daily mean values with an autoregressive model. The goal of this step is to reduce the number of features that we will use for the prediction because we have too many of them compared to the amount of data available. We tried multiple orders for the AR fitting and chose the one that outputs the best prediction result.
The second step consists in the prediction itself. First we concatenate the AR model coefficients with the extra features which are number of followers, number of unique author of tweets and total amount of tweets. The idea of using these extra features came from some papers we read about the Election Prediction topic. Then we fit a linear regression model between the features and the actual percentage of votes that the candidate received.
The data set was devided in the training set and the test set. The splitting of the data is different between the two algorithms. It will be described individually.
Concerning the training phase, we used the leave-one-out procedure to train the coefficients. This means that each time we train the coefficients on a reduced training set and test it on the value that was left out. If the prediction is correct, we keep the coefficients. At the end we average all the coefficients. After the model is trained, we apply it on the test set and we count the number of successful predictions and the total number of tries. The success of the prediction depends on the algorithm.
The process described above is repeated multiple times, each time the division between the training and the test sets is done randomly. After all the repetitions, we estimate the accuracy by dividing the total number of successful predictions by the total number of tries.
In this algorithm, we consider every deputee individually and the goal is to predict whether they won or lost. In order to do this we follow the procedure described above with the output of the linear regression being the percentage of vote received by the candidate. The prediction is considered correct if a winning candidate is receiving a score higher than 50 from the algorithm. It is also considered correct in case a losing candidate gets a score lower than 50.
In [9]:
# Rearranging the result of the election
Y_all_init =[]
for keys in id_dict :
Y_all_init=np.append(Y_all_init,id_dict[keys][1])
orders = np.array(range(1,25)) #variable to test different order of the AR model
for order in orders: # looping the whole algorithm with different orders for the AR
i=0
coeffs = np.zeros([38,order+4]) #initialization of autoregression coefficients
for keys in d:
mean = d[keys]
ar_mod = AR(mean[0:61]) #initialoization of the AR model with the mean sentiment value of one candidate
ar_res = ar_mod.fit(maxlag = order,method = 'cmle',ic='aic',trend = 'c',tol = 1e-2) #fitting the AR model
for n in range (len(ar_res.params)):
coeffs[i][n] = ar_res.params[n] #assignment of the AR model coefficients with zero padding in case of not max order
for m in range (3):
coeffs[i][m+order+1] = mean[62+m] # appending extra values
i += 1
shapeMean = np.shape(coeffs)
nbFeature = shapeMean[1] #getting the number of features
MEANS_all_init_2 = coeffs
nbRight = 0
nbAll = 0
#Number of time we want to try our prediction with a different set of training and test data.
for iteration in range(1000):
MEANS_all = MEANS_all_init_2
Y_all = Y_all_init
#vector containing the data to test
Y_predict_final = np.zeros((8,1))
MEANS_predict_final = np.zeros((8,nbFeature))
for i in range(8):
#Choose randomly 8 person for the test data
selected = np.random.randint(0,38-i, 1)
Y_predict_final[i] = Y_all[selected]
MEANS_predict_final[i] = MEANS_all[selected]
#Supress the test data from the full vectors to create the training data
MEANS_all = np.delete(MEANS_all, selected, 0)
Y_all = np.delete(Y_all, selected, 0)
coef_all = np.zeros((nbFeature))
nbKeep = 0
#We loop on all the data of the training set
for i in range(30):
#Prediction phase - We create the prediction model
clf = linear_model.LinearRegression(fit_intercept=False)
#We remove one of the data from the training set
MEANS_fit = np.delete(MEANS_all, i, 0)
Y_fit = np.delete(Y_all, i, 0)
# We fit the data of 29 deputy to the model and we keep 1 for the testing.
clf.fit(MEANS_fit, Y_fit)
#The prediction with the data we reomoved
predIt = clf.predict(MEANS_all[i].reshape(1, -1))
#If the prediction works, we keep the coeficients of the linear regression. (We add them to an array)
if (Y_all[i] >50 and predIt > 50 ) or (Y_all[i] < 50 and predIt < 50) :
nbKeep += 1
coef_all += clf.coef_
#The average of all the coeficients we want to keep
coef_all = coef_all/nbKeep
#The preidiction using the average coefficient we computed before.
#We count the number of prediction tries and the number of succesful ones.
for i in range(len(Y_predict_final)):
nbAll +=1
pred1 = np.dot(MEANS_predict_final[i], coef_all)
if (Y_predict_final[i][0] > 50 and pred1 > 50 ) or (Y_predict_final[i][0] < 50 and pred1 < 50) :
nbRight += 1
print(str(order) + " " + str(float(nbRight)/float(nbAll)))
In this case we consider the deputees by pair (meaning that they were opponents in the same state). It means that the features involve the concatenation of two candidates and the prediction should give the percentage of each candidate with respect to the other. To determine if the prediction is successful, a winning candidate must be the one with the highest percentage in the pair of prediction.
In [10]:
Y_all_init =[]
MEANS_all_init=[]
for i in range(1,20): #creating the set of data by pair of deputees
first=0
first_array = []
first_predict=[]
for keys in id_dict:
if id_dict[keys][2]==i :
if first==0:
first=1
first_array = d[keys]
first_predict = id_dict[keys][1]
else:
MEANS_all_init=np.append(MEANS_all_init,np.append(first_array,d[keys]))
Y_all_init = np.append(Y_all_init,np.append(first_predict,id_dict[keys][1]))
Y_all_init = np.reshape(Y_all_init,[19,2])
newmeans = np.reshape(MEANS_all_init,[38,65]) #list of all candidates individualy
orders = np.array(range(1,25)) #variable to test different order of the AR model
for order in orders:
coeffs = np.zeros([38,order+4]) #initialisation of autoregression coefficients
for i in range(38):
mean = newmeans[i]
ar_mod = AR(mean[0:61]) # initialisation of the AR model with the mean sentiment values of 1 candidate
ar_res = ar_mod.fit(maxlag = order,method = 'cmle',ic='aic',trend = 'c',tol = 1e-2) #fitting of the AR model
for n in range (len(ar_res.params)):
coeffs[i][n] = ar_res.params[n] #assignement of the AR model coefficients, with 0 padding in case not max order
for m in range (3):
coeffs[i][m+order+1] = mean[62+m] #appending extra values (followers, total tweets, unique ID)
AR_coeff = np.reshape(coeffs,[19,2*order+8]) #reshaping coeff to reform pairs win/lose
#Prediction testing, By pairs, Using AR model
nbRight = 0 # correct predictions
nbAll = 0 # total predictions
for iteration in range(1000):
MEANS_all = AR_coeff #initialisation
Y_all = Y_all_init
Y_predict_final = np.zeros((4,2))
MEANS_predict_final = np.zeros((4,2*order+8))
#removing data of some candidates to use as test set
for i in range(4):
selected = np.random.randint(0,19-i, 1)
Y_predict_final[i] = Y_all[selected]
MEANS_predict_final[i] = MEANS_all[selected]
MEANS_all = np.delete(MEANS_all, selected, 0)
Y_all = np.delete(Y_all, selected, 0)
coef_all = np.zeros((2,2*order+8))
nbKeep = 0
#Training linear regression model using leave-one-out technique
for i in range(15):
#Prediction phase - We create the prediction model
clf = linear_model.LinearRegression(fit_intercept=False)
MEANS_fit = np.delete(MEANS_all, i, 0)
Y_fit = np.delete(Y_all, i, 0)
# We fit the data of 14 deputy to the model and we keep 1 for the testing.
clf.fit(MEANS_fit, Y_fit)
predIt = clf.predict(MEANS_all[i].reshape(1, -1))
if (Y_all[i][0] > Y_all[i][1] and predIt[0][0] > predIt[0][1] ) or (Y_all[i][0] < Y_all[i][1] and predIt[0][0] < predIt[0][1]) :
nbKeep += 1
coef_all += clf.coef_ #keeping only coefficients that result in a correct prediction
#averaging the coefficients that gave a correct prediction
coef_all = coef_all/nbKeep
#testing the linear regression model on test set data
for i in range(len(Y_predict_final)):
nbAll +=1
pred1 = np.dot(MEANS_predict_final[i],coef_all[0])
pred2 = np.dot(MEANS_predict_final[i], coef_all[1])
if (Y_predict_final[i][0] > Y_predict_final[i][1] and pred1 > pred2 ) or (Y_predict_final[i][0] < Y_predict_final[i][1] and pred1 < pred2) :
nbRight += 1
print(str(order) + " " + str(float(nbRight)/float(nbAll)))
Sentiment Analysis, like any other natural language processing tool, is hard to perform and does not give extremely accurate results. Performing it on Tweets is tricky because many slang words are used and often the analysis does not output anything. Moreover the sentiment analysis of tweets like "I hate how @Trump denies the work of @Obama" and "I hate how @Obama denies the work of @Trump" give the same result. However the real meaning is opposite. If a tweet is perceived as negative, it does not mean that the negativity is towards the topic of tweet itself. We were aware of this problem, but thought that it would have been interesting to play with this tool anyway.
Another issue that we encountered was that the number of tweets differed a lot depending on the popularity of the candidate. However, since we looked at big elections in a big country, we were able to have a quite big dataset even though we did not consider every candidate, since some of them were not active at all on Twitter.
Both algorithms reach with the right tunning a level of accuracy of 65%, which means that there is a weak correlation between the Twitter data and the result of the election. This poor result has multiple explainations, one of them is the imprecision of the sentiment analysis. Another reason is that the Twitter population is not a good sample of the voting people. Moreover tweets do not always represent the point of view of the author, sometimes they can just be provocative. Finally the author of tweets may not have the right to vote in the election they are talking about. All these reasons, combined with a relatively small dataset, contributed in the obtained result.
For further improvements of this work, a bigger dataset would be helpful to train more the machine learning algorithm. Moreover, a different sentiment analysis tool could be exploited.
In [ ]: