US Presidential Race 2016

I was curious about the different candidates' speeches and whether we could see any differences or trends among the actual text for their speeches.

Methodology, sources

The text was retrieved from various public sources, most were from UCSB's presidency site. I focused strictly on monologues and not debates or interviews. This was based on a desire for a dataset with uniform qualities among the candidates. Some text were "as-prepared" and some were transcribed from audio. The latter included some bracketed references to audio or other speakers. Unfortunately, by focusing on monologues, I had to leave out one prominent candidate -- Donald Trump. All of the public data I found on his speeches were debates, interviews or Q&A sessions where the audience had an impact on the subject.

Hillary Clinton had far more speeches that met these criteria available than the others.



In [254]:

    
%matplotlib inline

from __future__ import print_function
from __future__ import unicode_literals

# Speeches dataset:
import candidates

from itertools import chain
import nltk
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

sns.set_context("notebook", font_scale=2.5, )

BRACKETED = re.compile(r'\[.*?\]')
st = LancasterStemmer()
st = PorterStemmer()
def stem_tokens(tokens):
    stemmed = []
    for item in tokens:
        stemmed.append(st.stem(item))
    return stemmed

def text_filter(text):
    '''Delete bracketed references from transcripts, usually just
       descriptions of inaudible activity or audience reaction.
    '''
    de_bracketed = BRACKETED.sub(u'', text)
    tokens = nltk.word_tokenize(de_bracketed)

    return ' '.join(stem_tokens(tokens))

def fix_labels(ax, align='right'):
    labels = [l.get_text() for l in ax.get_xticklabels()]
    ax.set_xticklabels(labels, rotation=40, ha=align)


dem_speeches = [candidate.values() for candidate in candidates.DEMOCRAT.values()]
rep_speeches = [candidate.values() for candidate in candidates.REPUBLICAN.values()]
PARTIES = {'Democrat': dem_speeches, 'Republican': rep_speeches, }


def feature_by_party():
    vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=6,
                                 ngram_range=(2, 5),
                                 stop_words='english')

    for name, speeches in PARTIES.items():
        party = [text_filter(speech) for speech in list(chain(*speeches))]
        tfidf = vectorizer.fit_transform(party)
        feature_names = vectorizer.get_feature_names()
        print('\t', name, feature_names)

def feature_by_speaker():
    frames = []
    speech_text = {}
    for party, candidates_ in candidates.all_.items():
        for name, candidate in candidates_.items():
            speeches = [speech for date, speech in candidate.items()]
            content = ''.join([text_filter(speech) for speech in speeches])
                            
            speech_text[name] = content
            
            blob = TextBlob(content)

            speeches_sentiment = [sent.sentiment.polarity for sent in blob.sentences]
            speeches_sentiment_subj = [sent.sentiment.subjectivity for sent in blob.sentences]
            
            speeches_sentence_len = [len(sent.words) for sent in blob.sentences]
 
            new_frame = pd.DataFrame({'sentence_words': speeches_sentence_len,
                                      'sentiment': speeches_sentiment,
                                      'candidate': name,
                                      'party': party,
                                      })
            frames.append(new_frame)

    return speech_text, pd.concat(frames)

Feature extraction

Let's look at the topics discussed by each party. It's pretty interesting, some of these trends seem pretty much what you might traditionally expect from each party. We'll look for the most popular bi-or-greater-grams. These "bigrams" are consecutive word pairs.

In these speeches, Republicans tend to discuss the state's influence and nationalism. Democrats talk about topics like the middle class, youth, and nationalism as well.



In [255]:

    
feature_by_party()
speech_text, df = feature_by_speaker()









    



	 Republican [u'feder govern', u'god bless', u'imagin presid', u'polit class', u'thi countri', u'thi nation']
	 Democrat [u'middl class', u'million american', u'small busi', u'thi countri', u'unit state', u'young peopl']

Abbreviations

The text above is abbreviated in odd ways because I used a "stemmer" to find the roots of the words. If we omit the stemmer, we'll get distinct bigrams like "small business" and "small businesses."

Sentence length

Research by Braun (et al) showed that "Lying politicians used more words ... than truth-tellers ..." Can we see whether there are any trends of sentence length among these candidates?

Braun's research was based on Politifact entries. The selection of those entries ends up being very different from general speech. Much of these speeches shown below are not claims to be evaluated, like in Politifact. So it might not really apply here at all.



In [256]:

    
plt.figure(figsize=(15,15))
ax = sns.violinplot(x='candidate', y='sentence_words', data=df)
fix_labels(ax)

Word count distribution

It's interesting to note that Senator Sanders trends mildly higher than any other candidate. Dr. Ben Carson trends lower than any other candidate. Some of the outliers on the high end are arguably skewed by TextBlob's sentence tokenizer. It recognizes a many-clause part of Sanders and others' monologues as a single sentence with very many words. The transcriptions with these anomalies have those parts of the speech delimited with ellipses and semicolons.



In [257]:

    
from nltk import SimpleGoodTuringProbDist, FreqDist
letter_count = 26
LETTERS = [chr(l) for l in range(ord('a'), ord('a') + letter_count)]

def letters_by_speaker(df):
    f = plt.figure(figsize=(15,15))
    f.add_axes()

    letter_frequencies = {}
    for name, text in speech_text.items():
        freq = FreqDist(text.lower())
        p = SimpleGoodTuringProbDist(freq)
        lett_freq = [p.prob(letter) for letter in LETTERS]

        letter_frequencies[name] = lett_freq
        plt.plot(lett_freq, label=name)

    f.gca().set_xticks(range(len(LETTERS)))
    f.gca().set_xticklabels(LETTERS)
    plt.legend()

    return letter_frequencies

Letter Frequencies

How about letter frequencies by speaker? Well, I'd hoped that this might reveal something interesting about the words chosen by each candidate.

But as you can see below, they're barely distinguishable. This graph is really just showing us the letter frequency of English.

Maaaaybe if we're feeling generous -- it's possible that the word choice is a function of the speaker's perception of the audience, and the audiences among the candidates are similar.



In [258]:

    
freq = letters_by_speaker(df)

Scrabble

To follow through on this letter frequency concept -- what if we were to score their speeches using scrabble tile scores?

Well, since the graph above shows ever-so-slight differences in some letters, you can see the resulting difference in their ultimate score. Again, here, they're not very distinct. But Clinton comes out on top and Jeb Bush on the bottom.

Ok, this particular exercise was a little silly, but fun.



In [259]:

    
SCRABBLE_TILES = {
    'e': 1, 'a': 1, 'i': 1, 'o': 1, 'n': 1, 'r': 1, 't': 1, 'l': 1, 's': 1, 'u': 1,
    'd': 2, 'g': 2,
    'b': 3, 'c': 3, 'm': 3, 'p': 3,
    'f': 4, 'h': 4, 'v': 4, 'w': 4, 'y': 4,
    'k': 5,
    'j': 8, 'x': 8,
    'q': 10, 'z': 10,
}
    
def score(letters_freq): return sum((freq*SCRABBLE_TILES[l]) for l, freq in zip(LETTERS, letters_freq))
CANDIDATES = freq.keys()
g = plt.bar(range(len(CANDIDATES)),
        [score(freq[cand]) for cand in CANDIDATES]
        )
ax = plt.gca()
ax.set_xticks(range(len(CANDIDATES)))
ax.set_xticklabels(CANDIDATES)
ax.set_ylim((1.25, 1.35))
ax.set_ylabel('Relative scrabble score')
fix_labels(ax, 'center')

Sentiment analysis

What kind of things are these candidates saying? I took TextBlob's sentiment for each candidates' sentences and plotted the sentiment polarity (ignoring the subjectivity). At first, I didnt't see much of note. Most of the candidates have a more positive polarity than negative.

On closer examination, two candidates stand out -- Ben Carson and Carly Fiorina. Both are different from their peers as not having had previous political experience. And both Carson and Fiorina have very little polarity in their sentences. Perhaps board meetings and morbidity/mortality conferences are different from traditional stump speeches in their structure. In any case, it was an interesting alignment among the candidates.



In [260]:

    
plt.figure(figsize=(15,15))
ax = sns.violinplot(x='candidate', y='sentiment', data=df)
fix_labels(ax)
ax.set_ylabel('Sentiment Polarity per sentence')









    Out[260]:





<matplotlib.text.Text at 0x7f770be69250>