arrows: Yet Another Twitter/Python Data Analysis

Geospatially, Temporally, and Linguistically Analyzing Tweets about Top U.S. Presidential Candidates with Pandas, TextBlob, Seaborn, and Cartopy

Hi, I'm Raj. For my internship this summer, I've been using data science and geospatial Python libraries like xray, numpy, rasterio, and cartopy. A week ago, I had a discussion about the relevance of Bernie Sanders among millenials - and so, I set out to get a rough idea by looking at recent tweets.

I don't explain any of the code in this document, but you can skip the code and just look at the results if you like. If you're interested in going further with this data, I've posted source code and the dataset at https://github.com/raj-kesavan/arrows.

If you have any comments or suggestions (oneither code or analysis), please let me know at rajk@berkeley.edu. Enjoy!

First, I used Tweepy to pull down 20,000 tweets for each of Hillary Clinton, Bernie Sanders, Rand Paul, and Jeb Bush [retrieve_tweets.py].

I've also already done some calculations, specifically of polarity, subjectivity, influence, influenced polarity, and longitude and latitude (all explained later) [preprocess.py].


In [2]:
from arrows.preprocess import load_df

Just adding some imports and setting graph display options.


In [3]:
from textblob import TextBlob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import cartopy
pd.set_option('display.max_colwidth', 200)
pd.options.display.mpl_style = 'default'
matplotlib.style.use('ggplot')
sns.set_context('talk')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = [12.0, 8.0]
% matplotlib inline

Let's look at our data!

load_df loads it in as a pandas.DataFrame, excellent for statistical analysis and graphing.


In [ ]:
df = load_df('arrows/data/results.csv')

In [11]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 80000 entries, 0 to 80025
Data columns (total 21 columns):
candidate               80000 non-null object
coordinates             87 non-null object
created_at              80000 non-null datetime64[ns]
favorite_count          80000 non-null object
geo                     87 non-null object
id                      80000 non-null float64
lang                    80000 non-null object
place                   743 non-null object
retweet_count           80000 non-null float64
text                    80000 non-null object
user_followers_count    79994 non-null float64
user_location           53628 non-null object
user_name               79973 non-null object
user_screen_name        79974 non-null object
user_time_zone          50386 non-null object
polarity                80000 non-null float64
subjectivity            80000 non-null float64
influence               79994 non-null float64
influenced_polarity     79994 non-null float64
latitude                743 non-null float64
longitude               743 non-null float64
dtypes: datetime64[ns](1), float64(9), object(11)
memory usage: 13.4+ MB

We'll be looking primarily at candidate, created_at, lang, place, user_followers_count, user_time_zone, polarity, and influenced_polarity, and text.


In [15]:
df[['candidate', 'created_at', 'lang', 'place', 'user_followers_count', 
    'user_time_zone', 'polarity', 'influenced_polarity', 'text']].head(1)


Out[15]:
candidate created_at lang place user_followers_count user_time_zone polarity influenced_polarity text
0 Bernie Sanders 2015-07-06 01:52:42 en NaN 1642 Eastern Time (US & Canada) 0.285714 16.378184 RT @DrTomMartinPhD: BERNIE SANDERS QUOTE ON #BILLMAHER, "Hillary Clinton &amp; I Have The Right Message. We're Both Speaking The Truth." http:/…

First I'll look at sentiment, calculated with TextBlob using the text column. Sentiment is composed of two values, polarity - a measure of the positivity or negativity of a text - and subjectivity. Polarity is between -1.0 and 1.0; subjectivity between 0.0 and 1.0.


In [16]:
TextBlob("Tear down this wall!").sentiment


Out[16]:
Sentiment(polarity=-0.19444444444444448, subjectivity=0.2888888888888889)

Unfortunately, it doesn't work too well on anything other than English.


In [17]:
TextBlob("Radix malorum est cupiditas.").sentiment


Out[17]:
Sentiment(polarity=0.0, subjectivity=0.0)

TextBlob has a cool translate() function that uses Google Translate to take care of that for us, but we won't be using it here - just because tweets include a lot of slang and abbreviations that can't be translated very well.


In [6]:
sentence = TextBlob("Radix malorum est cupiditas.").translate()
print(sentence)
print(sentence.sentiment)


The root of evil.
Sentiment(polarity=-1.0, subjectivity=1.0)

All right - let's figure out the most (positively) polarized English tweets.


In [19]:
english_df = df[df.lang == 'en']
english_df.sort('polarity', ascending = False).head(3)[['candidate', 'polarity', 'subjectivity', 'text']]


Out[19]:
candidate polarity subjectivity text
2287 Bernie Sanders 1 1.0 Republicans Welcomed Bernie Sanders to Wisconsin By Calling Him an Extremist. His Response? Perfect. http://t.co/ksaN85UuS8
810 Bernie Sanders 1 0.3 BEST OF SUNDAY TALK CNN SOTU #Sanders draws 2016 record crowd in Iowa-http://t.co/XnSbMweMbW http://t.co/wWqpmsVdEI
31467 Hillary Clinton 1 0.3 @whitehouse but there is one thing i want to be known by all the world: my best wish goes to Lady Hillary Clinton. it's said and done

Extrema don't mean much. We might get more interesting data with mean polarities for each candidate. Let's also look at influenced polarity, which takes into account the number of retweets and followers.


In [20]:
candidate_groupby = english_df.groupby('candidate')
candidate_groupby[['polarity', 'influence', 'influenced_polarity']].mean()


Out[20]:
polarity influence influenced_polarity
candidate
Bernie Sanders 0.096348 162.142172 14.758500
Hillary Clinton 0.037577 176.315714 7.561452
Jeb Bush 0.026713 318.453703 16.174172
Rand Paul 0.086817 144.550312 10.042045

So tweets about Jeb Bush, on average, aren't as positive as the other candidates, but the people tweeting about Bush get more retweets and followers.

I used the formula influence = sqrt(followers + 1) * sqrt(retweets + 1). You can experiment with different functions if you like [preprocess.py:influence].

We can look at the most influential tweets about Jeb Bush to see what's up.


In [23]:
jeb = candidate_groupby.get_group('Jeb Bush')
jeb_influence = jeb.sort('influence', ascending = False)
jeb_influence[['influence', 'polarity', 'influenced_polarity', 'user_name', 'text', 'created_at']].head(5)


Out[23]:
influence polarity influenced_polarity user_name text created_at
52023 89594.614397 0.500000 44797.307199 CNN Breaking News Jeb Bush on Donald Trump: "His views are way out of the mainstream of what most Republicans think." http://t.co/5K3Gi7AcQG 2015-07-05 02:14:26
55849 68470.590942 0.000000 0.000000 The New York Times Jeb Bush, whose wife is Mexican, says he takes Donald Trump’s remarks personally http://t.co/9CZFg4Nbcb 2015-07-04 20:10:06
47246 53754.716258 -0.066667 -3583.647751 Donald J. Trump Flashback – Jeb Bush says illegal immigrants breaking our laws is an “act of love” http://t.co/p8yFzVuw8w He will never secure the border. 2015-07-05 15:23:20
50459 53641.142046 0.000000 0.000000 CNN Jeb Bush: Trump comments meant 'to draw attention.'\nhttp://t.co/chKrOnsntE http://t.co/bZnmLgyN7l 2015-07-05 03:55:25
47616 51601.878338 0.200000 10320.375668 Donald J. Trump Jeb Bush will never secure our border or negotiate great trade deals for American workers. Jeb doesn't see &amp; can't solve the problems. 2015-07-05 15:02:22

Side note: you can see that sentiment analysis isn't perfect - the last tweet is certainly negative toward Jeb Bush, but it was actually assigned a positive polarity. Over a large number of tweets, though, sentiment analysis is more meaningful.

As to the high influence of tweets about Bush: it looks like Donald Trump (someone with a lot of followers) has been tweeting a lot about Bush over the other candidates - one possible reason for Jeb's greater influenced_polarity.


In [26]:
df[df.user_name == 'Donald J. Trump'].groupby('candidate').size()


Out[26]:
candidate
Jeb Bush    4
dtype: int64

Looks like our favorite toupéed candidate hasn't even been tweeting about anyone else!

What else can we do? We know the language each tweet was (tweeted?) in.


In [27]:
language_groupby = df.groupby(['candidate', 'lang'])
language_groupby.size()


Out[27]:
candidate        lang
Bernie Sanders   ar          1
                 da          2
                 de         55
                 el         33
                 en      19208
                 es         55
                 et          1
                 fr        447
                 in          6
                 it          1
                 ko          1
                 nl         47
                 no          1
                 pl          7
                 pt          8
                 sk          2
                 sl          1
                 sv          7
                 tl          2
                 tr          5
                 und       107
                 vi          3
Hillary Clinton  ar          5
                 de        168
                 en      18100
                 es        841
                 et          2
                 fa          4
                 fr        202
                 hi         31
                         ...  
Jeb Bush         sl          1
                 tl          1
                 tr          1
                 und        67
                 vi          7
                 zh          2
Rand Paul        da          3
                 de         14
                 en      19607
                 es        165
                 et         12
                 fi          1
                 fr         42
                 hi          2
                 ht          6
                 in          7
                 it          5
                 ja          6
                 ko          1
                 lv          2
                 nl          2
                 pl          3
                 pt         15
                 ru          9
                 sk          2
                 sv          2
                 th          2
                 tl          1
                 tr          4
                 und        87
dtype: int64

That's a lot of languages! Let's try plotting to get a better idea, but first, I'll remove smaller language/candidate groups.

By the way, each lang value is an IANA language tag - you can look them up at https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry.


In [28]:
largest_languages = language_groupby.filter(lambda group: len(group) > 10)

I'll also remove English, since it would just dwarf all the other languages.


In [40]:
non_english = largest_languages[largest_languages.lang != 'en']
non_english_groupby = non_english.groupby(['lang', 'candidate'], as_index = False)

sizes = non_english_groupby.text.agg(np.size)
sizes = sizes.rename(columns={'text': 'count'})
sizes_pivot = sizes.pivot_table(index='lang', columns='candidate', values='count', fill_value=0)

plot = sns.heatmap(sizes_pivot)
plot.set_title('Number of non-English Tweets by Candidate', family='Ubuntu')
plot.set_ylabel('language code', family='Ubuntu')
plot.set_xlabel('candidate', family='Ubuntu')
plot.figure.set_size_inches(12, 7)


Looks like Spanish and Portuguese speakers mostly tweet about Jeb Bush, while Francophones lean more liberal, and Clinton tweeters span the largest range of languages.

We also have the time-of-tweet information - I'll plot influenced polarity over time for each candidate. I'm also going to resample the influenced_polarity values to 1 hour intervals to get a smoother graph.


In [46]:
mean_polarities = df.groupby(['candidate', 'created_at']).influenced_polarity.mean()
plot = mean_polarities.unstack('candidate').resample('60min').plot()
plot.set_title('Influenced Polarity over Time by Candidate', family='Ubuntu')
plot.set_ylabel('influenced polarity', family='Ubuntu')
plot.set_xlabel('time', family='Ubuntu')
plot.figure.set_size_inches(12, 7)


Since I only took the last 20,000 tweets for each candidate, I didn't receive as large a timespan from Clinton (a candidate with many, many tweeters) compared to Rand Paul.

But we can still analyze the data in terms of hour-of-day. I'd like to know when tweeters in each language tweet each day, and I'm going to use percentages instead of raw number of tweets so I can compare across different languages easily.

By the way, the times in the dataframe are in UTC.


In [84]:
language_sizes = df.groupby('lang').size()
threshold = language_sizes.quantile(.75)

top_languages_df = language_sizes[language_sizes > threshold]
top_languages = set(top_languages_df.index) - {'und'}
top_languages


Out[84]:
{'de', 'en', 'es', 'fr', 'in', 'it', 'nl', 'pt'}

In [85]:
df['hour'] = df.created_at.apply(lambda datetime: datetime.hour) 
for language_code in top_languages:
    lang_df = df[df.lang == language_code]
    normalized = lang_df.groupby('hour').size() / lang_df.lang.count()
    plot = normalized.plot(label = language_code)

plot.set_title('Tweet Frequency in non-English Languages by Hour of Day', family='Ubuntu')
plot.set_ylabel('normalized frequency', family='Ubuntu')
plot.set_xlabel('hour of day (UTC)', family='Ubuntu')
plot.legend()
plot.figure.set_size_inches(12, 7)


Note that English, French, and Spanish are significantly flatter than the other languages - this means that there's a large spread of speakers all over the globe.

But why is Portuguese spiking at 11pm Brasilia time / 3 am Lisbon time? Let's find out! My first guess was that maybe there's a single person making a ton of posts at that time.


In [88]:
df_of_interest = df[(df.hour == 2) & (df.lang == 'pt')]

print('Number of tweets:', df_of_interest.text.count())
print('Number of unique users:', df_of_interest.user_name.unique().size)


Number of tweets: 446
Number of unique users: 407

So that's not it. Maybe there was a major event everyone was retweeting?


In [89]:
df_of_interest.text.head(25).unique()


Out[89]:
array([ 'Desabafo de garoto homossexual com medo do futuro comove Hillary Clinton http://t.co/7I49RkKtSm via @UOLNoticias @UOL',
       'Hillary Clinton eleva tom contra a China de olho na Presidência dos EUA http://t.co/qwgr57UsD0 Governo chinês tem discurso bem menos ácido.',
       'Desabafo de garoto homossexual com medo do futuro comove Hillary Clinton - Notícias - Internacional http://t.co/pSCrQgO8aQ',
       'RT @elpais_brasil: Facebook censurou o sofrimento de um garoto gay e Hillary Clinton saiu em sua defesa http://t.co/3R2XyiXrtr http://t.co/…',
       'RT @ReutersBrazil: Hillary Clinton diz que Irã continuará a ser ameaça a Israel apesar de acordo nuclear http://t.co/4gmjpI7sSQ',
       'RT @folha: Jeb Bush diz que foi atingido por críticas de Trump a mexicanos. http://t.co/qMt7QbaTVJ',
       'RT @jr140797: #betacaralhudosan Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-can... http://t.co/d6uUMIFiIV #betac…',
       'RT deigmar: Jeb Bush diz que foi atingido por críticas de Trump a mexicanos http://t.co/XbRXsyhyOi',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candidato republicano à Casa Branca Jeb... http://t.co/J93xGpN13K',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos http://t.co/smxaTooY0M',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candidato republicano à Casa Branca Jeb... http://t.co/duz9LU8NvY',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candidato republicano à Casa Branca Jeb... http://t.co/EVtHvSInRr',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candidato republicano à Casa Branca Jeb... http://t.co/yZspQsrNSF',
       '[FOLHA S.PAULO. BRA] Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candi... http://t.co/VM0QWlH5b9 vía J.A.M.V',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos http://t.co/yZspQsrNSF',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candidato republicano à Casa Branca Jeb... http://t.co/hnEO3nSYHu',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candidato republicano à Casa Branca Jeb... http://t.co/73eDAWXsCu',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos: O pré-candidato republicano à Casa Branca Jeb... http://t.co/JK8c6w68aP',
       'RT @thiago_beta51: #SegueSigoDeVolta Jeb Bush diz que foi atingido por críticas de Trump a mexicanos:  http://t.co/Vhyb281I5c',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos:  http://t.co/CMjD1e5sgP',
       '@DREWXAVECAO @EXFLOP Jeb Bush diz que foi atingido por críticas de Trump a mexicanos:',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos:  http://t.co/mgTTwptBKe',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos http://t.co/Fct3cKj5f1 #folheando a @folha',
       'Jeb Bush diz que foi atingido por críticas de Trump a mexicanos:  http://t.co/iGnO3AnT1W',
       'Jeb Bush diz que foi atingido por críticas de Trump a\xa0mexicanos http://t.co/tMUUI5Dkpw'], dtype=object)

Seems to be a lot of these 'Jeb Bush diz que foi atingido...' tweets. How many? We can't just count unique ones because they all are different slightly, but we can check for a large-enough substring.


In [90]:
df_of_interest[df_of_interest.text.str.contains('Jeb Bush diz que foi atingido')].text.count()


Out[90]:
440

Since languages can span across different countries, we might get results if we search by location, rather than just language.

We don't have very specific geolocation information other than timezone, so let's try plotting candidate sentiment over the 4 major U.S. timezones (Los Angeles, Denver, Chicago, and New York). This is also be a good opportunity to look at a geographical map.


In [97]:
tz_df = english_df.dropna(subset=['user_time_zone'])
us_tz_df = tz_df[tz_df.user_time_zone.str.contains("US & Canada")]
us_tz_candidate_groupby = us_tz_df.groupby(['candidate', 'user_time_zone'])
us_tz_candidate_groupby.influenced_polarity.mean()


Out[97]:
candidate        user_time_zone             
Bernie Sanders   Central Time (US & Canada)     18.694226
                 Eastern Time (US & Canada)     20.221507
                 Mountain Time (US & Canada)    13.683829
                 Pacific Time (US & Canada)     16.496358
Hillary Clinton  Central Time (US & Canada)      3.302260
                 Eastern Time (US & Canada)     22.731770
                 Mountain Time (US & Canada)     0.196556
                 Pacific Time (US & Canada)      5.486306
Jeb Bush         Central Time (US & Canada)     14.766734
                 Eastern Time (US & Canada)     28.625515
                 Mountain Time (US & Canada)     6.356858
                 Pacific Time (US & Canada)     16.676979
Rand Paul        Central Time (US & Canada)      6.798783
                 Eastern Time (US & Canada)     15.359912
                 Mountain Time (US & Canada)    10.780279
                 Pacific Time (US & Canada)     12.918267
Name: influenced_polarity, dtype: float64

That's our raw data: now to plot it on a map. I got the timezone Shapefile from http://efele.net/maps/tz/world/. First, I read in the Shapefile with Cartopy.


In [95]:
tz_shapes = cartopy.io.shapereader.Reader('arrows/world/tz_world_mp.shp')
tz_records = list(tz_shapes.records())
tz_translator = {
     'Eastern Time (US & Canada)': 'America/New_York',
     'Central Time (US & Canada)': 'America/Chicago',
     'Mountain Time (US & Canada)': 'America/Denver',
     'Pacific Time (US & Canada)': 'America/Los_Angeles',
}
american_tz_records = {
    tz_name: next(filter(lambda record: record.attributes['TZID'] == tz_id, tz_records))
    for tz_name, tz_id 
    in tz_translator.items() 
}

Next, I have to choose a projection and plot it (again using Cartopy). The Albers Equal-Area is good for maps of the U.S. I'll also download some featuresets from the Natural Earth dataset to display state borders.


In [98]:
albers_equal_area = cartopy.crs.AlbersEqualArea(-95, 35)
plate_carree = cartopy.crs.PlateCarree()

states_and_provinces = cartopy.feature.NaturalEarthFeature(
    category='cultural',
    name='admin_1_states_provinces_lines',
    scale='50m',
    facecolor='none'
)

cmaps = [matplotlib.cm.Blues, matplotlib.cm.Greens, 
         matplotlib.cm.Reds, matplotlib.cm.Purples]
norm = matplotlib.colors.Normalize(vmin=0, vmax=30) 

candidates = df['candidate'].unique()

plt.rcParams['figure.figsize'] = [6.0, 4.0]
for index, candidate in enumerate(candidates):
    plt.figure()
    plot = plt.axes(projection=albers_equal_area)
    plot.set_extent((-125, -66, 20, 50))
    plot.add_feature(cartopy.feature.LAND)
    plot.add_feature(cartopy.feature.COASTLINE)
    plot.add_feature(cartopy.feature.BORDERS)
    plot.add_feature(states_and_provinces, edgecolor='gray')
    plot.add_feature(cartopy.feature.LAKES, facecolor="#00BCD4")

    for tz_name, record in american_tz_records.items():
        tz_specific_df = us_tz_df[us_tz_df.user_time_zone == tz_name]
        tz_candidate_specific_df = tz_specific_df[tz_specific_df.candidate == candidate]
        mean_polarity = tz_candidate_specific_df.influenced_polarity.mean()

        plot.add_geometries(
            [record.geometry], 
            crs=plate_carree,
            color=cmaps[index](norm(mean_polarity)),
            alpha=.8
        )
    
    plot.set_title('Influenced Polarity toward {} by U.S. Timezone'.format(candidate), family='Ubuntu')
    plot.figure.set_size_inches(6, 3.5)
    plt.show()
    print()






My friend Gabriel Wang pointed out that U.S. timezones other than Pacific don't mean much since each timezone covers both blue and red states, but the data is still interesting.

As expected, midwestern states lean toward Jeb Bush. I wasn't expecting Jeb Bush's highest polarity-tweets to come from the East; this is probably Donald Trump (New York, New York) messing with our data again.

In a few months I'll look at these statistics with the latest tweets and compare.

What are tweeters outside the U.S. saying about our candidates?

Outside of the U.S., if someone is in a major city, the timezone is often that city itself. Here are the top (by number of tweets) non-American 25 timezones in our dataframe.


In [100]:
american_timezones = ('US & Canada|Canada|Arizona|America|Hawaii|Indiana|Alaska'
                      '|New_York|Chicago|Los_Angeles|Detroit|CST|PST|EST|MST')
foreign_tz_df = tz_df[~tz_df.user_time_zone.str.contains(american_timezones)]

foreign_tz_groupby = foreign_tz_df.groupby('user_time_zone')
foreign_tz_groupby.size().sort(inplace = False, ascending = False).head(25)


Out[100]:
user_time_zone
Quito                  1719
London                  967
Amsterdam               571
Athens                  368
Bangkok                 249
Beijing                 201
Brasilia                192
New Delhi               182
Tehran                  164
Jakarta                 164
Sydney                  134
Chennai                 133
Paris                   133
West Central Africa     130
Casablanca              129
Baghdad                 119
Dublin                  118
Tijuana                 117
Caracas                 103
Bucharest               100
Berlin                   99
Rome                     96
Madrid                   89
Greenland                86
Belgrade                 79
dtype: int64

I also want to look at polarity, so I'll only use English tweets.

(Sorry, Central/South Americans - my very rough method of filtering out American timezones gets rid of some of your timezones too. Let me know if there's a better way to do this.)


In [101]:
foreign_english_tz_df = foreign_tz_df[foreign_tz_df.lang == 'en']

Now we have a dataframe containing (mostly) world cities as time zones. Let's get the top cities by number of tweets for each candidate, then plot polarities.


In [102]:
foreign_tz_groupby = foreign_english_tz_df.groupby(['candidate', 'user_time_zone'])
top_foreign_tz_df = foreign_tz_groupby.filter(lambda group: len(group) > 40)

top_foreign_tz_groupby = top_foreign_tz_df.groupby(['user_time_zone', 'candidate'], as_index = False)

mean_influenced_polarities = top_foreign_tz_groupby.influenced_polarity.mean()

pivot = mean_influenced_polarities.pivot_table(
    index='user_time_zone', 
    columns='candidate', 
    values='influenced_polarity', 
    fill_value=0
)

plot = sns.heatmap(pivot)
plot.set_title('Influenced Polarity in Major Foreign Cities by Candidate', family='Ubuntu')
plot.set_ylabel('city', family='Ubuntu')
plot.set_xlabel('candidate', family='Ubuntu')
plot.figure.set_size_inches(12, 7)


Exercise for the reader: why is Rand Paul disliked in Athens? You can probably guess, but the actual tweets causing this are rather amusing.

Greco-libertarian relations aside, the data shows that London and Amsterdam are among the most influential of cities, with the former leaning toward Jeb Bush and the latter about neutral.

In India, Clinton-supporters reside in New Delhi while Chennai tweeters back Rand Paul. By contrast, in 2014, New Delhi constituents voted for the conservative Bharatiya Janata Party while Chennai voted for the more liberal All India Anna Dravida Munnetra Kazhagam Party - so there seems to be some kind of cultural difference between the voters of 2014 and the tweeters of today.

Last thing I thought was interesting: Athens has the highest mean polarity for Bernie Sanders, the only city for which this is the case. Could this have anything to do with the recent economic crisis, 'no' vote for austerity, and Bernie's social democratic tendencies?

Finally, I'll look at specific geolocation (latitude and longitude) data. Since only about 750 out of 80,000 tweets had geolocation enabled, this data can't really be used for sentiment analysis, but we can still get a good idea of international spread.

First I'll plot everything on a world map, then break it up by candidate in the U.S.


In [106]:
df_place = df.dropna(subset=['place'])
mollweide = cartopy.crs.Mollweide()

plot = plt.axes(projection=mollweide)
plot.set_global()
plot.add_feature(cartopy.feature.LAND)
plot.add_feature(cartopy.feature.COASTLINE)
plot.add_feature(cartopy.feature.BORDERS)

plot.scatter(
    list(df_place.longitude), 
    list(df_place.latitude), 
    transform=plate_carree, 
    zorder=2
)
plot.set_title('International Tweeters with Geolocation Enabled', family='Ubuntu')
plot.figure.set_size_inches(14, 9)



In [110]:
plot = plt.axes(projection=albers_equal_area)

plot.set_extent((-125, -66, 20, 50))

plot.add_feature(cartopy.feature.LAND)
plot.add_feature(cartopy.feature.COASTLINE)
plot.add_feature(cartopy.feature.BORDERS)
plot.add_feature(states_and_provinces, edgecolor='gray')
plot.add_feature(cartopy.feature.LAKES, facecolor="#00BCD4")

candidate_groupby = df_place.groupby('candidate', as_index = False)

colors = ['#1976d2', '#7cb342', '#f4511e', '#7b1fa2']
for index, (name, group) in enumerate(candidate_groupby):
    longitudes = group.longitude.values
    latitudes = group.latitude.values
    plot.scatter(
        longitudes, 
        latitudes, 
        transform=plate_carree, 
        color=colors[index], 
        label=name,
        zorder=2
    )
plot.set_title('U.S. Tweeters by Candidate', family='Ubuntu')
plt.legend(loc='lower left')
plot.figure.set_size_inches(12, 7)


As expected, U.S. tweeters are centered around L.A., the Bay Area, Chicago, New York, and Boston. Rand Paul and Bernie Sanders tweeters are more spread out over the country.

That's all I have for now.

If you found this interesting and are curious for more, I encourage you to download the dataset (or get your own dataset based on your interests) and share your findings.

Source code is at https://github.com/raj-kesavan/arrows, and I can be reached at raj.ksvn@gmail.com for any questions, comments, or criticism. Looking forward to hearing your feedback!