As promised in this post, I am back to predict all of the midterm elections from 2016 using the tweets from the month leading up to the election and a model trained on the races from the 2014 midterms.
Refresher: I used a script that queries the twitter advanced search feature for all the tweets that contain each candidates full name. I saved all such tweets that were written in the month leading up to the 2014 midterm elections and now for the month leading up to the 2016 midterm elections.
Now, I will build the same model that I built in the previous post except this time I will train on the 2014 races and evaluate on the 2016 races. Let's see how the model performs!
In [1]:
import pandas as pd
import numpy as np
import json
import codecs
import warnings
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
race_metadata = pd.read_csv('~/election-twitter/elections-twitter/data/race-metadata.csv')
race_metadata_2016 = pd.read_csv('~/election-twitter/elections-twitter/data/race-metadata-2016.csv')
In [3]:
race_metadata.head()
Out[3]:
In [4]:
race_metadata_2016.head()
Out[4]:
In [5]:
## How many races in the train and test sets?
race_metadata.shape[0], race_metadata_2016.shape[0]
Out[5]:
In [6]:
## put a column in that grabs the winner out
race_metadata['winner'] = race_metadata.Result.apply(lambda x: json.loads(x)[0][0])
race_metadata_2016['winner'] = race_metadata_2016.Result.apply(lambda x: json.loads(x)[0][0])
In [7]:
race_metadata.head()
Out[7]:
In [8]:
race_metadata_2016.head()
Out[8]:
In [9]:
## how many candidates in each race in the train set?
race_metadata.Result.apply(lambda x: len(json.loads(x))).describe()
Out[9]:
In [10]:
## how many candidates in each race in the test set?
race_metadata_2016.Result.apply(lambda x: len(json.loads(x))).describe()
Out[10]:
The below is the same code as last post with a couple modifications:
In [11]:
def make_ascii(s):
return s.encode('ascii','ignore').decode('ascii')
def make_df(race_metadata,year=2014):
values = []
path = '/Users/adamwlevin/election-twitter/elections-twitter/data/tweets'
if year==2016:
path += '/t2016'
for row_ind, row in race_metadata.iterrows():
try:
with codecs.open('%s/%s.json' % (path,make_ascii(row.Race).replace(' ',''),),'r','utf-8-sig') as f:
tweets = json.load(f)
except FileNotFoundError:
print('Did not find %s ' % (row.Race,))
continue
for candidate,data in tweets.items():
if candidate in ('–','Blank/Void/Scattering','Write-Ins','Others'):
continue
record = [[]]*4
for date,data_ in data.items():
if data_ and data_!='Made 5 attempts, all unsucessful.':
data_ = np.array(data_)
for i in range(4):
record[i] = \
np.concatenate([record[i],data_[:,i].astype(int) if i!=0 else data_[:,i]])
values.append([candidate]+record+[1 if candidate==row.winner else 0,row_ind])
return pd.DataFrame(values,columns=['candidate','tweets','replies',
'retweets','favorites',
'winner','race_index'])
In [12]:
## make the train set and test set
df_train = make_df(race_metadata)
df_test = make_df(race_metadata_2016,year=2016)
In [13]:
## take a look at the result
df_train.head()
Out[13]:
In [14]:
df_test.head()
Out[14]:
In [15]:
## who has the most tweets of the 2016 candidates?
df_test.loc[df_test.tweets.apply(len).idxmax()]
Out[15]:
Now, I will do the same model building procedure as last time. There are three classes of features: metadata about the language within the tweets (i.e. average number of words per tweet), metadata about the tweets (i.e. average number of replies per tweet), and a tfidf vectorizer built using the words as tokens treating all of the tweets about a candidate concatenated together as a document (i.e. tfidf score of the word "congressman"). The model is an XGBoost classifier.
In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
In [17]:
## This is useful for selecting a subset of features in the middle of a Pipeline
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, keys, ndim):
self.keys = keys
self.ndim = ndim
def fit(self, x, y=None):
return self
def transform(self, data_dict):
res = data_dict[self.keys]
return res
## Making some features about the text itself
class TweetTextMetadata(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, x, y=None):
return self
def transform(self, docs):
ave_words_per_tweet = [sum(len(tweet.split(' '))
for tweet in tweets)\
/len(tweets)
if len(tweets) else 0
for tweets in docs]
total_number_words = [sum(len(tweet.split(' '))
for tweet in tweets)
for tweets in docs]
ave_word_len = [sum(len(word) for tweet in tweets
for word in tweet.split(' '))/\
sum(1 for tweet in tweets
for word in tweet.split(' '))
if len(tweets) else 0 for tweets in docs]
total_periods = [sum(tweet.count('.')
for tweet in tweets)
for tweets in docs]
total_q_marks = [sum(tweet.count('?')
for tweet in tweets)
for tweets in docs]
return np.column_stack([value
for key,value in locals().items()
if isinstance(value,list)])
names = ['ave_words_per_tweet','total_number_words','ave_word_len','total_periods','total_q_marks']
## Making some features about the favorites, retweets, etc.
class TweetStats(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, x, y=None):
return self
def transform(self, df):
warnings.filterwarnings("ignore",
message="Mean of empty slice.")
total_replies = df.replies.apply(sum)
total_retweets = df.retweets.apply(sum)
total_favorites = df.favorites.apply(sum)
num_tweets = df.replies.apply(len)
ave_replies_per_tweet = df.replies.apply(np.mean).fillna(0)
ave_retweets_per_tweet = df.retweets.apply(np.mean).fillna(0)
ave_favorites_per_tweet = df.favorites.apply(np.mean).fillna(0)
ninety_eighth_percentile_replies = df.replies.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
ninety_eighth_percentile_retweets = df.retweets.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
ninety_eighth_percentile_favorites = df.favorites.apply(lambda x: np.percentile(x,98.) if len(x) else 0.)
return np.column_stack([value.values for key,value in locals().items() if isinstance(value,pd.Series)])
names = ['total_replies','total_retweets','total_favorites',
'num_tweets','ave_replies_per_tweet','ave_retweets_per_tweet',
'ave_favorites_per_tweet','ninety_eighth_percentile_replies',
'ninety_eighth_percentile_retweets',
'ninety_eighth_percentile_favorites']
## This inherits a TfidfVectorizer and just cleans the tweets a little before vectorizing them
## (this is probably unnecessary but haven't tested)
class CustomTfidfVectorizer(TfidfVectorizer):
def cleanse_tweets(self,tweets):
return ' '.join([word for tweet in tweets
for word in tweet.split(' ')
if 'http://' not in word
and 'www.' not in word
and '@' not in word
and 'https://' not in word
and '.com' not in word
and '.net' not in word])
def fit(self, x, y=None):
return super().fit(x.apply(self.cleanse_tweets).values)
def transform(self, x):
return super().transform(x.apply(self.cleanse_tweets).values)
def fit_transform(self, x, y=None):
self.fit(x,y)
return self.transform(x)
## This takes in a XGBClassifier and finds the optimal number of trees using CV
def get_num_trees(clf,X,y,cv,eval_metric='logloss',early_stopping_rounds=10):
n_trees = []
for train,test in cv.split(X,y):
clf.fit(X[train], y[train],
eval_set=[[X[test],y[test]]],
eval_metric=eval_metric,
early_stopping_rounds=early_stopping_rounds,
verbose=False)
n_trees.append(clf.best_iteration)
print('Number of trees selected: %d' % \
(int(sum(n_trees)/len(n_trees)),))
return int(sum(n_trees)/len(n_trees))
In [18]:
names = [name_.lower() for result in race_metadata.Result
for name,_,_ in json.loads(result) for name_ in name.split()]
stop_words = names + list(ENGLISH_STOP_WORDS)
In [19]:
## I did grid search some of the below hyperparameters using grouped CV
features = FeatureUnion(
[
('tfidf',Pipeline([
('selector',ItemSelector(keys='tweets',ndim=1)),
('tfidf',CustomTfidfVectorizer(use_idf=False,
stop_words=stop_words,
ngram_range=(1,1),
min_df=.05))
])),
('tweet_metadata',Pipeline([
('selector',ItemSelector(keys='tweets',ndim=1)),
('metadata_extractor',TweetTextMetadata())
])),
('tweet_stats',Pipeline([
('selector',ItemSelector(keys=['replies','retweets',
'favorites'],
ndim=2)),
('tweet_stats_extractor',TweetStats())
]))
])
clf = XGBClassifier(learning_rate=.01,n_estimators=100000,
subsample=.9,max_depth=2)
In [20]:
## make train matrix, fit model on train set
X = features.fit_transform(df_train[['tweets','replies',
'retweets','favorites']])
y = df_train['winner'].values
cv = StratifiedKFold(n_splits=6,shuffle=True)
n_estimators = get_num_trees(clf,X,y,cv)
clf.n_estimators = n_estimators
clf.fit(X,y)
feature_names = sorted(['WORD_%s' % (word,)
for word in features.get_params()['tfidf'].get_params()['tfidf'].vocabulary_.keys()]) +\
TweetTextMetadata.names +\
TweetStats.names
In [21]:
## print top 10 importances and their names
importances = clf.feature_importances_
importances = {u:val for u,val in enumerate(importances)}
for ind in sorted(importances,key=importances.get,reverse=True)[:10]:
print(feature_names[ind],importances[ind])
Looking at this feature importances a second time, it looks like the words chosen as important are proxies for whether the candidate is the incumbent. This makes sense, from the little that I know about politics.
Let's look at the training accuracy:
In [22]:
preds = clf.predict_proba(X)[:,1]
## put the raw predictions in the dataframe so we can use df.groupy
df_train['pred_raw'] = preds
In [23]:
df_train.head()
Out[23]:
In [24]:
## get dictionaries mappying race index to index of predicted and true winners
preds = df_train.groupby('race_index').pred_raw.apply(lambda x: x.idxmax()).to_dict()
true = df_train.groupby('race_index').winner.apply(lambda x: x.idxmax()).to_dict()
In [25]:
## get train accuracy on race
acc = np.mean([preds[race_ind]==true[race_ind] for race_ind in df_train.race_index.unique()])
acc
Out[25]:
Now let's test the model on the 2016 races to see how it performs. I will produce the features using the same function as earlier, make the predictions using the trained model and then make a plot and compute the accuracy.
In [26]:
## get test matrix and predictions
X_test = features.transform(df_test[['tweets','replies','retweets','favorites']])
preds_test = clf.predict_proba(X_test)[:,1]
In [27]:
## make a plot
fig,ax = plt.subplots(1,1,figsize=(13,5))
plt.hist(preds_test[(df_test.winner==1).values],alpha=.5,
label='predictions for winners');
plt.hist(preds_test[(df_test.winner==0).values],alpha=.5,
label='predictions for non-winners');
plt.legend();
plt.title('Test Set Predictions');
In [28]:
## put the raw predictions in the test dataframe so we can use df.groupy
df_test['pred_raw'] = preds_test
In [29]:
## get dictionaries mappying race index to index of predicted and true winners, this time on test set
preds_test = df_test.groupby('race_index').pred_raw.apply(lambda x: x.idxmax()).to_dict()
true_test = df_test.groupby('race_index').winner.apply(lambda x: x.idxmax()).to_dict()
In [30]:
## get test accuracy on race level
acc = np.mean([preds_test[race_ind]==true_test[race_ind] for race_ind in df_test.race_index.unique()])
acc
Out[30]:
88% accuracy is not that bad! Considering I used nothing but tweets and imbued no prior knowledge.
Let's take a quick look at where the model failed. First, here's the highest raw prediction (the probability that the candidate will win) for a non-winner:
In [31]:
df_test[~df_test.winner.astype(bool)].sort_values('pred_raw',ascending=False).head(1)
Out[31]:
In [32]:
## take a look at 30 of the tweets for Sarah Lloyd
df_test[~df_test.winner.astype(bool)].sort_values('pred_raw',ascending=False).tweets.iloc[0][0:30]
Out[32]:
In [33]:
## the race Lloyd lost
df_test[df_test.race_index==151]
Out[33]:
In [34]:
print(race_metadata_2016.loc[151])
print(race_metadata_2016.loc[151].Result)
So this looks like a race where Democratic enthusiasm (or twitter activism) was high but the Republican won. It could also have a little to do with the fact that Sarah Lloyd is also the name of a British travel writer but not sure.
Now let's take a look at the lowest raw prediction for a winner:
In [35]:
df_test[df_test.winner.astype(bool)].sort_values('pred_raw').head(1)
Out[35]:
In [36]:
## this race
df_test[df_test.race_index==69]
Out[36]:
In [37]:
print(race_metadata_2016.loc[69])
print(race_metadata_2016.loc[69].Result)
This one makes more sense to me - "Jason T. Smith" is what I have as the candidate's name (from Wikipedia). Since twitter search looks for an exact string match, it makes sense that the model would not have a good read on this candidate since it's unlikely for people to tweet a name with a middle initial.
Stay tuned: as a next step, I plan to collect the tweets for the month leading up to the 2018 congressional races and post my predictions on election day.