Predicting an Election from Tweets

Michaël Juillard, Mikhail Vorobiev, Chiara Ercolani

1 Problem Definition

Nowadays social medias like Twitter and Facebook are means by which people continuously express their opinion on any matter. Thus, data mining combined with data analysis could be a great way to undersand the general feeling of the users on a certian matter. With this in mind, we decided to try to predict the result of an election using data from Twitter.

We focused on the US Senate Election of 2016, since this election would provide enough meaningful data for our analysis. In fact there are many candidates and a lot of people are tweeting about them, allowing us to have a big dataset to train our algorithm with. We started by mining data from Twitter, collecting every Tweet regarding most of the Republican and Democrate Senate candidates during the two months preceeding the elections. We collected tweets aimed at each candidate, meaning tweets containing the candidate's twitter name .

The second step was to perform a sentiment analysis on these tweets to understand if the user was expressing a positive or negative feeling towards the candidate. Sentiment analysis combines natural language processing, text analysis and computational linguistics to assess the attitude of a writer towards a topic. There are various tools for it, we picked one called Pattern developed by the University of Antwerp, in Belgium.

Afterwards we trained two variations of the same machine learning algorithm with the sentiment analysis data and some other features like number of followers of the candidate.

2 Resources

For the project we used a variety of tools, here are the links to their websites.

Links used for Twitter data mining

Tool used for the sentiment analysis

Sentiment Analysis Tool

Papers about election prediction with tweets

3 Web scraping

We started by finding the list of candidates for the United States Senate Elections in 2016 and we looked for their twitter accounts. We decided to only use Democratic and Republican candidates, as the other parties are not very relevant in the US.

Twitter APIs only allow data mining in the past week, thus we used a tool to bypass this limitation and collect data from the 8th of September 2016 to the 8th of November 2016, the day of the elections.

The tool is a Python script found on GitHub (link in the Resources section). It enabled us to get all the tweets in a desired time frame according to some custom parameters. For example to mine tweets about Evan Bayh from the 8th of Septemebr 2016 to the 8th of November 2016 we simply ran the following command:

python Exporter.py --querysearch "SenEvanBayh" --since 2016-09-08 --until 2016-11-08

The outputs of this analysis are kept in the data folder of our project in .csv format. The files contain the date of the tweet, the username of the author of the tweet, the tweet itself, the number of likes and retweets for the tweet and some other information.

4 Data Analysis

This section will present the analysis that was performed on the data to make it usable for the machine learning part.



In [2]:

    
# Imported libraries
from pattern.en import sentiment
import numpy as np 
import os
import csv
from sklearn import linear_model
import time
from statsmodels.tsa.ar_model import AR

4.1 Sentiment Analysis

Sentiment analysis is used to measure the polarity of a text. By polarity it is intended whether the text leans to a negative or positive attitute towards the topic it contains.

Pattern, used to perform sentiment analysis for this project, is a natural language processing toolkit. In particular, we used pattern.en, which was made for the English language.

The sentiment( ) function returns a (polarity, subjectivity)-tuple for the given sentence, based on the adjectives it contains, where polarity is a value between -1.0 and +1.0 and subjectivity between 0.0 and 1.0.

Before performing the sentiment analysis on the data, we got rid of the hashtags and of the @ to facilitate the sentiment analysis in case a meaningful word followed these symbols.

Important note : this sentiment analysis tool requires the usage of Python 2



In [3]:

    
# Use all the .csv files inside the data folder
for file in os.listdir("./data"):
    if file.endswith(".csv"):
        path = r'data/'+file
        print(file)
        var = os.path.basename(path)
        var= str.split(var,'_')
        var=str.split(var[1],'.')
        sentiment_path = 'sentim/sentiment_'+var[0]+'.csv'

        # read the text, the date and the number of retwetts and favourites from the data files
        text=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(4,))
        date=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(1,))
        retweet=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(2,))
        favourites=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(3,))

        # remove hashtags and @ from the tweets 
        text=np.core.defchararray.replace(text,'#',' ')
        text=np.core.defchararray.replace(text,'@',' ')

        array_sentiment=np.zeros((len(text),2))


        # perform sentiment analysis and filter out values that are too close to zero
        for i in range(len(text)):
            array_sentiment[i] = sentiment(text[i])
            if (array_sentiment[i][0]<0.0000000000000001 and array_sentiment[i][0]> -0.0000000000000001 and array_sentiment[i][0]!=0.0):
                array_sentiment[i]=0
                

        # save sentiment analysis to output file
        np.savetxt(sentiment_path,np.transpose([date,array_sentiment[:,0],array_sentiment[:,1],retweet,favourites]),fmt="%s;%s;%s;%s;%s",delimiter=';',header="date;sentiment;objectivity;retweets;favourites")









    



output_gotCampbellforLa.csv






    



pattern/text/__init__.py:1943: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if w in imap(lambda e: e.lower(), e):
pattern/text/__init__.py:979: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  and tokens[j] in ("'", "\"", u"”", u"’", "...", ".", "!", "?", ")", EOS):






    



output_gotCatherineForNV.csv
output_gotChrisvance123.csv
output_gotChrisVanHollen.csv
output_gotChuckGrassley.csv
output_gotDanCarterCT.csv
output_gotGovernorHassan.csv
output_gotJasonKander.csv
output_gotJerryMoran.csv
output_gotjimbarksdale.csv
output_gotJimGrayLexKY.csv
output_gotJohnKennedyLA.csv
output_gotKamalaHarris.csv
output_gotKathyforMD.csv
output_gotKellyAyotte.csv
output_gotLorettaSanchez.csv
output_gotmarcorubio.csv
output_gotMikeCrapo.csv
output_gotPatrickMurphyFL.csv
output_gotpattyforiowa.csv
output_gotPattyMurray.csv
output_gotRandPaul.csv
output_gotRepJoeHeck.csv
output_gotRepKirkpatrick.csv
output_gotRoyBlunt.csv
output_gotSenatorIsakson.csv
output_gotSenatorKirk.csv
output_gotSenBlumenthal.csv
output_gotSenEvanBayh.csv
output_gotSenJohnMcCain.csv
output_gotsenrobportman.csv
output_gotSenSchumer.csv
output_gotSturgill4Idaho.csv
output_gotTammyforIL.csv
output_gotTedStrickland.csv
output_gotToddYoungIN.csv
output_gotWendyLongNY.csv
output_gotwiesner4senate.csv

4.2 Unique User Identification

We decided to identify the number of unique users who tweeted about a certain candidate and use this number as a feature for our machine learning algorithm



In [4]:

    
# Use all the .csv files inside the data folder
for file in os.listdir("./data"):
    if file.endswith(".csv"):
        cand = r'data/'+file
        mydata = np.loadtxt(cand, dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(0,))
        # Collect the number of tweets for one deputee and number of unique authors of tweets about this candidate
        print(file + " " + str(len(mydata)) + " " + str(len(np.unique(mydata))))









    



output_gotCampbellforLa.csv 904 230
output_gotCatherineForNV.csv 6595 2145
output_gotChrisvance123.csv 723 252
output_gotChrisVanHollen.csv 719 518
output_gotChuckGrassley.csv 7642 2839
output_gotDanCarterCT.csv 320 128
output_gotGovernorHassan.csv 1296 672
output_gotJasonKander.csv 6148 2573
output_gotJerryMoran.csv 770 483
output_gotjimbarksdale.csv 1058 354
output_gotJimGrayLexKY.csv 505 306
output_gotJohnKennedyLA.csv 722 259
output_gotKamalaHarris.csv 5928 3200
output_gotKathyforMD.csv 960 411
output_gotKellyAyotte.csv 25556 11030
output_gotLorettaSanchez.csv 1333 885
output_gotmarcorubio.csv 73326 27106
output_gotMikeCrapo.csv 3028 2105
output_gotPatrickMurphyFL.csv 21705 5978
output_gotpattyforiowa.csv 1596 540
output_gotPattyMurray.csv 5059 1694
output_gotRandPaul.csv 26216 13883
output_gotRepJoeHeck.csv 2547 1471
output_gotRepKirkpatrick.csv 968 534
output_gotRoyBlunt.csv 2309 1202
output_gotSenatorIsakson.csv 1626 834
output_gotSenatorKirk.csv 6537 4546
output_gotSenBlumenthal.csv 1568 911
output_gotSenEvanBayh.csv 459 351
output_gotSenJohnMcCain.csv 40786 21604
output_gotsenrobportman.csv 3770 1567
output_gotSenSchumer.csv 4953 2751
output_gotSturgill4Idaho.csv 148 79
output_gotTammyforIL.csv 4584 2853
output_gotTedStrickland.csv 6301 2566
output_gotToddYoungIN.csv 2959 1088
output_gotWendyLongNY.csv 1872 1007
output_gotwiesner4senate.csv 23 13

4.3 Mean

We decided to do a daily mean of the polarity value given by the sentiment analysis of the tweets. The mean is weighted on the number of likes and retweets that every tweet received. In fact we assumed that likes and retweets meant that people agreed with the content of the tweet and thus such tweets deserved to have a higher weight.



In [5]:

    
# Use all the .csv files inside the sentim folder
for file in os.listdir("./sentim"):
    if file.endswith(".csv"):
        path = r'sentim/'+file
        print(file)
        var = os.path.basename(path)
        var= str.split(var,'_')
        var=str.split(var[1],'.')
        mean_path = 'means/mean_'+var[0]+'.csv'
        
        #read sentiment, date and number of retweets and favourites from the sentiment analysis files
        sentim=np.loadtxt(path, comments='++++',delimiter=';',skiprows=1,usecols=(1,))
        date=np.loadtxt(path,dtype=str, comments='++++',delimiter=';',skiprows=1,usecols=(0,))
        retweet=np.loadtxt(path, comments='++++',delimiter=';',skiprows=1,usecols=(3,))
        favourites=np.loadtxt(path, comments='++++',delimiter=';',skiprows=1,usecols=(4,))

        array_length=62
        mean=[0]*array_length


        #create day and month arrays
        array_day=np.append(np.arange(8,0,-1),(np.append(np.arange(31,0,-1),np.arange(30,7,-1))))
        array_month=np.append(np.ones(8)*11,(np.append(np.ones(31)*10,np.ones(23)*9)))

        # parse the date and the month and create arrays for them
        day =np.zeros(len(date))
        month=np.zeros(len(date))
        for i in range(len(date)):
            day[i]=np.datetime64(date[i]).astype(object).day
            month[i]=np.datetime64(date[i]).astype(object).month


        #create array of the weights (based on likes and favourites)
        array_weight = np.zeros(len(sentim))
        for i in range(len(retweet)):
            if(retweet[i]!=0.0 or favourites[i]!=0.0):
                array_weight[i]=retweet[i]+favourites[i]
            else:
                array_weight[i]= 1

        #compute the mean
        cnt_date=0
        cnt_mean=[0]*array_length

        for i in range(len(sentim)):
            if (sentim[i]!=0.0):
                if (array_day[cnt_date]==day[i]and array_month[cnt_date]==month[i]): 
                        mean[cnt_date] = mean[cnt_date]+array_weight[i]*sentim[i]
                        cnt_mean[cnt_date]=cnt_mean[cnt_date]+array_weight[i]

                else :
                    
                    while (array_day[cnt_date]!=day[i] or array_month[cnt_date]!= month[i]): 
                        if (cnt_date <len(array_day)-1):
                            cnt_date=cnt_date + 1
                        else:
                            break
                        
                    mean[cnt_date]=mean[cnt_date]+array_weight[i]*sentim[i]
                    cnt_mean[cnt_date]=cnt_mean[cnt_date]+array_weight[i]



        weigthed_mean=[None]*array_length

        for i in range(len(mean)):
            if (mean[i]!=0.0):
                weigthed_mean[i]=mean[i]/cnt_mean[i]
            else:
                weigthed_mean[i]=0
                
        # save output on file
        np.savetxt(mean_path,np.transpose([array_day,array_month,mean,cnt_mean,weigthed_mean]),fmt="%d;%d;%s;%s;%s",delimiter=';',header="day;month;sum;weight;mean")









    



sentiment_gotCampbellforLa.csv
sentiment_gotCatherineForNV.csv
sentiment_gotChrisvance123.csv
sentiment_gotChrisVanHollen.csv
sentiment_gotChuckGrassley.csv
sentiment_gotDanCarterCT.csv
sentiment_gotGovernorHassan.csv
sentiment_gotJasonKander.csv
sentiment_gotJerryMoran.csv
sentiment_gotjimbarksdale.csv
sentiment_gotJimGrayLexKY.csv
sentiment_gotJohnKennedyLA.csv
sentiment_gotKamalaHarris.csv
sentiment_gotKathyforMD.csv
sentiment_gotKellyAyotte.csv
sentiment_gotLorettaSanchez.csv
sentiment_gotmarcorubio.csv
sentiment_gotMikeCrapo.csv
sentiment_gotPatrickMurphyFL.csv
sentiment_gotpattyforiowa.csv
sentiment_gotPattyMurray.csv
sentiment_gotRandPaul.csv
sentiment_gotRepJoeHeck.csv
sentiment_gotRepKirkpatrick.csv
sentiment_gotRoyBlunt.csv
sentiment_gotSenatorIsakson.csv
sentiment_gotSenatorKirk.csv
sentiment_gotSenBlumenthal.csv
sentiment_gotSenEvanBayh.csv
sentiment_gotSenJohnMcCain.csv
sentiment_gotsenrobportman.csv
sentiment_gotSenSchumer.csv
sentiment_gotSturgill4Idaho.csv
sentiment_gotTammyforIL.csv
sentiment_gotTedStrickland.csv
sentiment_gotToddYoungIN.csv
sentiment_gotWendyLongNY.csv
sentiment_gotwiesner4senate.csv

After creating files containing the daily mean values, we built the data structures necessary to proceed with the machine learning algorithm. In addition to the previous features, we collect by hand the number of follower for every candidates. Concerning the structure, we used dictionaries with the twitter name of the candidate as a key identifier.

D, name_dict and id_dict are dictionaries with the same keys as identifiers. For each key, d's values from 0 to 61 are the daily means, value 62 is the # of follower, value 63 is the amount of unique ids and value 64 in the total amount of tweets. For each key, id_dict's value 0 is the election result (1 for win, 0 for loss), value 1 is the percentage of votes that the candidate received and value 2 the pair identifier that pairs up candidates running in the same election. For each key, name_dict's value 0 is the name of the candidate and value 1 is the last name of the candidate



In [6]:

    
d={}
id_dict={}
name_dict={}

# Use all the .csv files inside the means folder
for file in os.listdir("./means"):
    if file.endswith(".csv"):
        path = r'means/'+file
        var = os.path.basename(path)
        var= str.split(var,'_got')
        var=str.split(var[1],'.')
        key = var[0]
        d.setdefault(key,[])
        d[key]=np.loadtxt(path, comments='++++',delimiter=';',skiprows=1,usecols=(4,))
        id_dict.setdefault(key,[])
        name_dict.setdefault(key,[])

with open('listDeputee.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter=';', quotechar=';')
    for row in reader:
        if (row[0]!='Id'):
            d[row[0]]= np.append(d[row[0]],[float(row[5]),float(row[6]),float(row[7])])
            id_dict[row[0]]=np.append(id_dict[row[0]],[float(row[3]),float(row[4]),float(row[8])])
            name_dict[row[0]]=np.append(name_dict[row[0]],[row[1],row[2],])

5 Machine Learning Algorithm

The first step of the machine learning algorithm that we used is to fit the 62 daily mean values with an autoregressive model. The goal of this step is to reduce the number of features that we will use for the prediction because we have too many of them compared to the amount of data available. We tried multiple orders for the AR fitting and chose the one that outputs the best prediction result.

The second step consists in the prediction itself. First we concatenate the AR model coefficients with the extra features which are number of followers, number of unique author of tweets and total amount of tweets. The idea of using these extra features came from some papers we read about the Election Prediction topic. Then we fit a linear regression model between the features and the actual percentage of votes that the candidate received.

The data set was devided in the training set and the test set. The splitting of the data is different between the two algorithms. It will be described individually.

Concerning the training phase, we used the leave-one-out procedure to train the coefficients. This means that each time we train the coefficients on a reduced training set and test it on the value that was left out. If the prediction is correct, we keep the coefficients. At the end we average all the coefficients. After the model is trained, we apply it on the test set and we count the number of successful predictions and the total number of tries. The success of the prediction depends on the algorithm.

The process described above is repeated multiple times, each time the division between the training and the test sets is done randomly. After all the repetitions, we estimate the accuracy by dividing the total number of successful predictions by the total number of tries.

5.1 Algorithm 1

In this algorithm, we consider every deputee individually and the goal is to predict whether they won or lost. In order to do this we follow the procedure described above with the output of the linear regression being the percentage of vote received by the candidate. The prediction is considered correct if a winning candidate is receiving a score higher than 50 from the algorithm. It is also considered correct in case a losing candidate gets a score lower than 50.



In [9]:

    
# Rearranging the result of the election
Y_all_init =[]
for keys in id_dict :
    Y_all_init=np.append(Y_all_init,id_dict[keys][1])


orders = np.array(range(1,25)) #variable to test different order of the AR model
for order in orders:           # looping the whole algorithm with different orders for the AR
    i=0
    coeffs = np.zeros([38,order+4]) #initialization of autoregression coefficients
    for keys in d:    
        mean = d[keys]
        ar_mod = AR(mean[0:61])  #initialoization of the AR model with the mean sentiment value of one candidate
        ar_res = ar_mod.fit(maxlag = order,method = 'cmle',ic='aic',trend = 'c',tol = 1e-2) #fitting the AR model
        for n in range (len(ar_res.params)):
            coeffs[i][n] = ar_res.params[n]  #assignment of the AR model coefficients with zero padding in case of not max order
        for m in range (3):
            coeffs[i][m+order+1] = mean[62+m] # appending extra values
        i += 1
    shapeMean = np.shape(coeffs)
    nbFeature = shapeMean[1]     #getting the number of features 
    MEANS_all_init_2 = coeffs
    nbRight = 0
    nbAll = 0

    #Number of time we want to try our prediction with a different set of training and test data.
    for iteration in range(1000):
        MEANS_all = MEANS_all_init_2
        Y_all = Y_all_init

        #vector containing the data to test
        Y_predict_final = np.zeros((8,1))
        MEANS_predict_final = np.zeros((8,nbFeature))
        for i in range(8):
            #Choose randomly 8 person for the test data
            selected = np.random.randint(0,38-i, 1)
            Y_predict_final[i] = Y_all[selected]
            MEANS_predict_final[i] = MEANS_all[selected]

            #Supress the test data from the full vectors to create the training data
            MEANS_all = np.delete(MEANS_all, selected, 0)
            Y_all = np.delete(Y_all, selected, 0)


        coef_all = np.zeros((nbFeature))
        nbKeep = 0
        #We loop on all the data of the training set
        for i in range(30):
            #Prediction phase - We create the prediction model
            clf = linear_model.LinearRegression(fit_intercept=False)

            #We remove one of the data from the training set
            MEANS_fit = np.delete(MEANS_all, i, 0)
            Y_fit = np.delete(Y_all, i, 0)

            # We fit the data of 29 deputy to the model and we keep 1 for the testing.
            clf.fit(MEANS_fit, Y_fit)

            #The prediction with the data we reomoved
            predIt = clf.predict(MEANS_all[i].reshape(1, -1))
            #If the prediction works, we keep the coeficients of the linear regression. (We add them to an array)
            if (Y_all[i] >50 and predIt > 50 ) or (Y_all[i] < 50 and predIt < 50) :
                nbKeep += 1
                coef_all += clf.coef_


        #The average of all the coeficients we want to keep
        coef_all = coef_all/nbKeep

        #The preidiction using the average coefficient we computed before.
        #We count the number of prediction tries and the number of succesful ones.
        for i in range(len(Y_predict_final)):
            nbAll +=1
            pred1 = np.dot(MEANS_predict_final[i], coef_all)
            if (Y_predict_final[i][0] > 50 and pred1 > 50 ) or (Y_predict_final[i][0] < 50 and pred1 < 50) :
                nbRight += 1

    print(str(order) + " " + str(float(nbRight)/float(nbAll)))









    



1 0.586875
2 0.585125
3 0.572625
4 0.565125
5 0.56725
6 0.554625
7 0.565
8 0.541
9 0.533875
10 0.58725
11 0.588625
12 0.584125
13 0.58825
14 0.579625
15 0.579875
16 0.61825
17 0.6335
18 0.626625
19 0.645625
20 0.6145
21 0.594
22 0.639125
23 0.618
24 0.615875

5.2 Algorithm 2

In this case we consider the deputees by pair (meaning that they were opponents in the same state). It means that the features involve the concatenation of two candidates and the prediction should give the percentage of each candidate with respect to the other. To determine if the prediction is successful, a winning candidate must be the one with the highest percentage in the pair of prediction.



In [10]:

    
Y_all_init =[]

MEANS_all_init=[]

for i in range(1,20):      #creating the set of data by pair of deputees
    first=0
    first_array = []
    first_predict=[]
    for keys in id_dict:
        if id_dict[keys][2]==i :
            if first==0:
                first=1
                first_array = d[keys]  
                first_predict = id_dict[keys][1]
            else:
                MEANS_all_init=np.append(MEANS_all_init,np.append(first_array,d[keys]))
                Y_all_init = np.append(Y_all_init,np.append(first_predict,id_dict[keys][1]))

Y_all_init = np.reshape(Y_all_init,[19,2])


newmeans = np.reshape(MEANS_all_init,[38,65]) #list of all candidates individualy
orders = np.array(range(1,25)) #variable to test different order of the AR model
for order in orders:
    coeffs = np.zeros([38,order+4]) #initialisation of autoregression coefficients
    for i in range(38):    
        mean = newmeans[i]
        ar_mod = AR(mean[0:61]) # initialisation of the AR model with the mean sentiment values of 1 candidate
        ar_res = ar_mod.fit(maxlag = order,method = 'cmle',ic='aic',trend = 'c',tol = 1e-2) #fitting of the AR model
        for n in range (len(ar_res.params)):
            coeffs[i][n] = ar_res.params[n] #assignement of the AR model coefficients, with 0 padding in case not max order
        for m in range (3):
            coeffs[i][m+order+1] = mean[62+m] #appending extra values (followers, total tweets, unique ID)
    AR_coeff = np.reshape(coeffs,[19,2*order+8]) #reshaping coeff to reform pairs win/lose
    
    
    #Prediction testing, By pairs, Using AR model

    nbRight = 0 # correct predictions
    nbAll = 0 # total predictions
    for iteration in range(1000):
        MEANS_all = AR_coeff #initialisation
        Y_all = Y_all_init

        Y_predict_final = np.zeros((4,2))
        MEANS_predict_final = np.zeros((4,2*order+8))
        
        #removing data of some candidates to use as test set
        for i in range(4):
            selected = np.random.randint(0,19-i, 1)
            Y_predict_final[i] = Y_all[selected]
            MEANS_predict_final[i] = MEANS_all[selected]

            MEANS_all = np.delete(MEANS_all, selected, 0)
            Y_all = np.delete(Y_all, selected, 0)



        coef_all = np.zeros((2,2*order+8))
        nbKeep = 0
        
        #Training linear regression model using leave-one-out technique
        for i in range(15):
            #Prediction phase - We create the prediction model
            clf = linear_model.LinearRegression(fit_intercept=False)

            MEANS_fit = np.delete(MEANS_all, i, 0)
            Y_fit = np.delete(Y_all, i, 0)

            # We fit the data of 14 deputy to the model and we keep 1 for the testing.
            clf.fit(MEANS_fit, Y_fit)

            predIt = clf.predict(MEANS_all[i].reshape(1, -1))

            if (Y_all[i][0] > Y_all[i][1] and predIt[0][0] > predIt[0][1] ) or (Y_all[i][0] < Y_all[i][1] and predIt[0][0] < predIt[0][1]) :
                nbKeep += 1
                coef_all += clf.coef_ #keeping only coefficients that result in a correct prediction

        #averaging the coefficients that gave a correct prediction
        coef_all = coef_all/nbKeep
        
        #testing the linear regression model on test set data
        for i in range(len(Y_predict_final)):
            nbAll +=1
            pred1 = np.dot(MEANS_predict_final[i],coef_all[0])
            pred2 = np.dot(MEANS_predict_final[i], coef_all[1])
            if (Y_predict_final[i][0] > Y_predict_final[i][1] and pred1 > pred2 ) or (Y_predict_final[i][0] < Y_predict_final[i][1] and pred1 < pred2) :
                nbRight += 1

    print(str(order) + " " + str(float(nbRight)/float(nbAll)))

6 Conclusions

Sentiment Analysis, like any other natural language processing tool, is hard to perform and does not give extremely accurate results. Performing it on Tweets is tricky because many slang words are used and often the analysis does not output anything. Moreover the sentiment analysis of tweets like "I hate how @Trump denies the work of @Obama" and "I hate how @Obama denies the work of @Trump" give the same result. However the real meaning is opposite. If a tweet is perceived as negative, it does not mean that the negativity is towards the topic of tweet itself. We were aware of this problem, but thought that it would have been interesting to play with this tool anyway.

Another issue that we encountered was that the number of tweets differed a lot depending on the popularity of the candidate. However, since we looked at big elections in a big country, we were able to have a quite big dataset even though we did not consider every candidate, since some of them were not active at all on Twitter.

Both algorithms reach with the right tunning a level of accuracy of 65%, which means that there is a weak correlation between the Twitter data and the result of the election. This poor result has multiple explainations, one of them is the imprecision of the sentiment analysis. Another reason is that the Twitter population is not a good sample of the voting people. Moreover tweets do not always represent the point of view of the author, sometimes they can just be provocative. Finally the author of tweets may not have the right to vote in the election they are talking about. All these reasons, combined with a relatively small dataset, contributed in the obtained result.

For further improvements of this work, a bigger dataset would be helpful to train more the machine learning algorithm. Moreover, a different sentiment analysis tool could be exploited.



In [ ]: