Hi, I'm Raj. For my internship this summer, I've been using data science and geospatial Python libraries like xray, numpy, rasterio, and cartopy. A week ago, I had a discussion about the relevance of Bernie Sanders among millenials - and so, I set out to get a rough idea by looking at recent tweets.
I don't explain any of the code in this document, but you can skip the code and just look at the results if you like. If you're interested in going further with this data, I've posted source code and the dataset at https://github.com/raj-kesavan/arrows.
If you have any comments or suggestions (oneither code or analysis), please let me know at rajk@berkeley.edu. Enjoy!
First, I used Tweepy to pull down 20,000 tweets for each of Hillary Clinton, Bernie Sanders, Rand Paul, and Jeb Bush [retrieve_tweets.py
].
I've also already done some calculations, specifically of polarity, subjectivity, influence, influenced polarity, and longitude and latitude (all explained later) [preprocess.py
].
In [2]:
from arrows.preprocess import load_df
Just adding some imports and setting graph display options.
In [3]:
from textblob import TextBlob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import cartopy
pd.set_option('display.max_colwidth', 200)
pd.options.display.mpl_style = 'default'
matplotlib.style.use('ggplot')
sns.set_context('talk')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = [12.0, 8.0]
% matplotlib inline
Let's look at our data!
load_df
loads it in as a pandas.DataFrame
, excellent for statistical analysis and graphing.
In [ ]:
df = load_df('arrows/data/results.csv')
In [11]:
df.info()
We'll be looking primarily at candidate
, created_at
, lang
, place
, user_followers_count
, user_time_zone
, polarity
, and influenced_polarity
, and text
.
In [15]:
df[['candidate', 'created_at', 'lang', 'place', 'user_followers_count',
'user_time_zone', 'polarity', 'influenced_polarity', 'text']].head(1)
Out[15]:
First I'll look at sentiment, calculated with TextBlob using the text
column. Sentiment is composed of two values, polarity - a measure of the positivity or negativity of a text - and subjectivity. Polarity is between -1.0 and 1.0; subjectivity between 0.0 and 1.0.
In [16]:
TextBlob("Tear down this wall!").sentiment
Out[16]:
Unfortunately, it doesn't work too well on anything other than English.
In [17]:
TextBlob("Radix malorum est cupiditas.").sentiment
Out[17]:
TextBlob has a cool translate()
function that uses Google Translate to take care of that for us, but we won't be using it here - just because tweets include a lot of slang and abbreviations that can't be translated very well.
In [6]:
sentence = TextBlob("Radix malorum est cupiditas.").translate()
print(sentence)
print(sentence.sentiment)
All right - let's figure out the most (positively) polarized English tweets.
In [19]:
english_df = df[df.lang == 'en']
english_df.sort('polarity', ascending = False).head(3)[['candidate', 'polarity', 'subjectivity', 'text']]
Out[19]:
Extrema don't mean much. We might get more interesting data with mean polarities for each candidate. Let's also look at influenced polarity, which takes into account the number of retweets and followers.
In [20]:
candidate_groupby = english_df.groupby('candidate')
candidate_groupby[['polarity', 'influence', 'influenced_polarity']].mean()
Out[20]:
So tweets about Jeb Bush, on average, aren't as positive as the other candidates, but the people tweeting about Bush get more retweets and followers.
I used the formula influence = sqrt(followers + 1) * sqrt(retweets + 1)
. You can experiment with different functions if you like [preprocess.py:influence
].
We can look at the most influential tweets about Jeb Bush to see what's up.
In [23]:
jeb = candidate_groupby.get_group('Jeb Bush')
jeb_influence = jeb.sort('influence', ascending = False)
jeb_influence[['influence', 'polarity', 'influenced_polarity', 'user_name', 'text', 'created_at']].head(5)
Out[23]:
Side note: you can see that sentiment analysis isn't perfect - the last tweet is certainly negative toward Jeb Bush, but it was actually assigned a positive polarity. Over a large number of tweets, though, sentiment analysis is more meaningful.
As to the high influence of tweets about Bush: it looks like Donald Trump (someone with a lot of followers) has been tweeting a lot about Bush over the other candidates - one possible reason for Jeb's greater influenced_polarity
.
In [26]:
df[df.user_name == 'Donald J. Trump'].groupby('candidate').size()
Out[26]:
Looks like our favorite toupéed candidate hasn't even been tweeting about anyone else!
What else can we do? We know the language each tweet was (tweeted?) in.
In [27]:
language_groupby = df.groupby(['candidate', 'lang'])
language_groupby.size()
Out[27]:
That's a lot of languages! Let's try plotting to get a better idea, but first, I'll remove smaller language/candidate groups.
By the way, each lang
value is an IANA language tag - you can look them up at https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry.
In [28]:
largest_languages = language_groupby.filter(lambda group: len(group) > 10)
I'll also remove English, since it would just dwarf all the other languages.
In [40]:
non_english = largest_languages[largest_languages.lang != 'en']
non_english_groupby = non_english.groupby(['lang', 'candidate'], as_index = False)
sizes = non_english_groupby.text.agg(np.size)
sizes = sizes.rename(columns={'text': 'count'})
sizes_pivot = sizes.pivot_table(index='lang', columns='candidate', values='count', fill_value=0)
plot = sns.heatmap(sizes_pivot)
plot.set_title('Number of non-English Tweets by Candidate', family='Ubuntu')
plot.set_ylabel('language code', family='Ubuntu')
plot.set_xlabel('candidate', family='Ubuntu')
plot.figure.set_size_inches(12, 7)
Looks like Spanish and Portuguese speakers mostly tweet about Jeb Bush, while Francophones lean more liberal, and Clinton tweeters span the largest range of languages.
We also have the time-of-tweet information - I'll plot influenced polarity over time for each candidate. I'm also going to resample the influenced_polarity
values to 1 hour intervals to get a smoother graph.
In [46]:
mean_polarities = df.groupby(['candidate', 'created_at']).influenced_polarity.mean()
plot = mean_polarities.unstack('candidate').resample('60min').plot()
plot.set_title('Influenced Polarity over Time by Candidate', family='Ubuntu')
plot.set_ylabel('influenced polarity', family='Ubuntu')
plot.set_xlabel('time', family='Ubuntu')
plot.figure.set_size_inches(12, 7)
Since I only took the last 20,000 tweets for each candidate, I didn't receive as large a timespan from Clinton (a candidate with many, many tweeters) compared to Rand Paul.
But we can still analyze the data in terms of hour-of-day. I'd like to know when tweeters in each language tweet each day, and I'm going to use percentages instead of raw number of tweets so I can compare across different languages easily.
By the way, the times in the dataframe are in UTC.
In [84]:
language_sizes = df.groupby('lang').size()
threshold = language_sizes.quantile(.75)
top_languages_df = language_sizes[language_sizes > threshold]
top_languages = set(top_languages_df.index) - {'und'}
top_languages
Out[84]:
In [85]:
df['hour'] = df.created_at.apply(lambda datetime: datetime.hour)
for language_code in top_languages:
lang_df = df[df.lang == language_code]
normalized = lang_df.groupby('hour').size() / lang_df.lang.count()
plot = normalized.plot(label = language_code)
plot.set_title('Tweet Frequency in non-English Languages by Hour of Day', family='Ubuntu')
plot.set_ylabel('normalized frequency', family='Ubuntu')
plot.set_xlabel('hour of day (UTC)', family='Ubuntu')
plot.legend()
plot.figure.set_size_inches(12, 7)
Note that English, French, and Spanish are significantly flatter than the other languages - this means that there's a large spread of speakers all over the globe.
But why is Portuguese spiking at 11pm Brasilia time / 3 am Lisbon time? Let's find out! My first guess was that maybe there's a single person making a ton of posts at that time.
In [88]:
df_of_interest = df[(df.hour == 2) & (df.lang == 'pt')]
print('Number of tweets:', df_of_interest.text.count())
print('Number of unique users:', df_of_interest.user_name.unique().size)
So that's not it. Maybe there was a major event everyone was retweeting?
In [89]:
df_of_interest.text.head(25).unique()
Out[89]:
Seems to be a lot of these 'Jeb Bush diz que foi atingido...' tweets. How many? We can't just count unique ones because they all are different slightly, but we can check for a large-enough substring.
In [90]:
df_of_interest[df_of_interest.text.str.contains('Jeb Bush diz que foi atingido')].text.count()
Out[90]:
That's it!
Looks like there was a news article from a Brazilian website (http://jconline.ne10.uol.com.br/canal/mundo/internacional/noticia/2015/07/05/jeb-bush-diz-que-foi-atingido-por-criticas-de-trump-a-mexicanos-188801.php) that happened to get a lot of retweets at that time period.
A similar article in English is at http://www.nytimes.com/politics/first-draft/2015/07/04/an-angry-jeb-bush-says-he-takes-donald-trumps-remarks-personally/.
Since languages can span across different countries, we might get results if we search by location, rather than just language.
We don't have very specific geolocation information other than timezone, so let's try plotting candidate sentiment over the 4 major U.S. timezones (Los Angeles, Denver, Chicago, and New York). This is also be a good opportunity to look at a geographical map.
In [97]:
tz_df = english_df.dropna(subset=['user_time_zone'])
us_tz_df = tz_df[tz_df.user_time_zone.str.contains("US & Canada")]
us_tz_candidate_groupby = us_tz_df.groupby(['candidate', 'user_time_zone'])
us_tz_candidate_groupby.influenced_polarity.mean()
Out[97]:
That's our raw data: now to plot it on a map. I got the timezone Shapefile from http://efele.net/maps/tz/world/. First, I read in the Shapefile with Cartopy.
In [95]:
tz_shapes = cartopy.io.shapereader.Reader('arrows/world/tz_world_mp.shp')
tz_records = list(tz_shapes.records())
tz_translator = {
'Eastern Time (US & Canada)': 'America/New_York',
'Central Time (US & Canada)': 'America/Chicago',
'Mountain Time (US & Canada)': 'America/Denver',
'Pacific Time (US & Canada)': 'America/Los_Angeles',
}
american_tz_records = {
tz_name: next(filter(lambda record: record.attributes['TZID'] == tz_id, tz_records))
for tz_name, tz_id
in tz_translator.items()
}
Next, I have to choose a projection and plot it (again using Cartopy). The Albers Equal-Area is good for maps of the U.S. I'll also download some featuresets from the Natural Earth dataset to display state borders.
In [98]:
albers_equal_area = cartopy.crs.AlbersEqualArea(-95, 35)
plate_carree = cartopy.crs.PlateCarree()
states_and_provinces = cartopy.feature.NaturalEarthFeature(
category='cultural',
name='admin_1_states_provinces_lines',
scale='50m',
facecolor='none'
)
cmaps = [matplotlib.cm.Blues, matplotlib.cm.Greens,
matplotlib.cm.Reds, matplotlib.cm.Purples]
norm = matplotlib.colors.Normalize(vmin=0, vmax=30)
candidates = df['candidate'].unique()
plt.rcParams['figure.figsize'] = [6.0, 4.0]
for index, candidate in enumerate(candidates):
plt.figure()
plot = plt.axes(projection=albers_equal_area)
plot.set_extent((-125, -66, 20, 50))
plot.add_feature(cartopy.feature.LAND)
plot.add_feature(cartopy.feature.COASTLINE)
plot.add_feature(cartopy.feature.BORDERS)
plot.add_feature(states_and_provinces, edgecolor='gray')
plot.add_feature(cartopy.feature.LAKES, facecolor="#00BCD4")
for tz_name, record in american_tz_records.items():
tz_specific_df = us_tz_df[us_tz_df.user_time_zone == tz_name]
tz_candidate_specific_df = tz_specific_df[tz_specific_df.candidate == candidate]
mean_polarity = tz_candidate_specific_df.influenced_polarity.mean()
plot.add_geometries(
[record.geometry],
crs=plate_carree,
color=cmaps[index](norm(mean_polarity)),
alpha=.8
)
plot.set_title('Influenced Polarity toward {} by U.S. Timezone'.format(candidate), family='Ubuntu')
plot.figure.set_size_inches(6, 3.5)
plt.show()
print()
My friend Gabriel Wang pointed out that U.S. timezones other than Pacific don't mean much since each timezone covers both blue and red states, but the data is still interesting.
As expected, midwestern states lean toward Jeb Bush. I wasn't expecting Jeb Bush's highest polarity-tweets to come from the East; this is probably Donald Trump (New York, New York) messing with our data again.
In a few months I'll look at these statistics with the latest tweets and compare.
What are tweeters outside the U.S. saying about our candidates?
Outside of the U.S., if someone is in a major city, the timezone is often that city itself. Here are the top (by number of tweets) non-American 25 timezones in our dataframe.
In [100]:
american_timezones = ('US & Canada|Canada|Arizona|America|Hawaii|Indiana|Alaska'
'|New_York|Chicago|Los_Angeles|Detroit|CST|PST|EST|MST')
foreign_tz_df = tz_df[~tz_df.user_time_zone.str.contains(american_timezones)]
foreign_tz_groupby = foreign_tz_df.groupby('user_time_zone')
foreign_tz_groupby.size().sort(inplace = False, ascending = False).head(25)
Out[100]:
I also want to look at polarity, so I'll only use English tweets.
(Sorry, Central/South Americans - my very rough method of filtering out American timezones gets rid of some of your timezones too. Let me know if there's a better way to do this.)
In [101]:
foreign_english_tz_df = foreign_tz_df[foreign_tz_df.lang == 'en']
Now we have a dataframe containing (mostly) world cities as time zones. Let's get the top cities by number of tweets for each candidate, then plot polarities.
In [102]:
foreign_tz_groupby = foreign_english_tz_df.groupby(['candidate', 'user_time_zone'])
top_foreign_tz_df = foreign_tz_groupby.filter(lambda group: len(group) > 40)
top_foreign_tz_groupby = top_foreign_tz_df.groupby(['user_time_zone', 'candidate'], as_index = False)
mean_influenced_polarities = top_foreign_tz_groupby.influenced_polarity.mean()
pivot = mean_influenced_polarities.pivot_table(
index='user_time_zone',
columns='candidate',
values='influenced_polarity',
fill_value=0
)
plot = sns.heatmap(pivot)
plot.set_title('Influenced Polarity in Major Foreign Cities by Candidate', family='Ubuntu')
plot.set_ylabel('city', family='Ubuntu')
plot.set_xlabel('candidate', family='Ubuntu')
plot.figure.set_size_inches(12, 7)
Exercise for the reader: why is Rand Paul disliked in Athens? You can probably guess, but the actual tweets causing this are rather amusing.
Greco-libertarian relations aside, the data shows that London and Amsterdam are among the most influential of cities, with the former leaning toward Jeb Bush and the latter about neutral.
In India, Clinton-supporters reside in New Delhi while Chennai tweeters back Rand Paul. By contrast, in 2014, New Delhi constituents voted for the conservative Bharatiya Janata Party while Chennai voted for the more liberal All India Anna Dravida Munnetra Kazhagam Party - so there seems to be some kind of cultural difference between the voters of 2014 and the tweeters of today.
Last thing I thought was interesting: Athens has the highest mean polarity for Bernie Sanders, the only city for which this is the case. Could this have anything to do with the recent economic crisis, 'no' vote for austerity, and Bernie's social democratic tendencies?
Finally, I'll look at specific geolocation (latitude and longitude) data. Since only about 750 out of 80,000 tweets had geolocation enabled, this data can't really be used for sentiment analysis, but we can still get a good idea of international spread.
First I'll plot everything on a world map, then break it up by candidate in the U.S.
In [106]:
df_place = df.dropna(subset=['place'])
mollweide = cartopy.crs.Mollweide()
plot = plt.axes(projection=mollweide)
plot.set_global()
plot.add_feature(cartopy.feature.LAND)
plot.add_feature(cartopy.feature.COASTLINE)
plot.add_feature(cartopy.feature.BORDERS)
plot.scatter(
list(df_place.longitude),
list(df_place.latitude),
transform=plate_carree,
zorder=2
)
plot.set_title('International Tweeters with Geolocation Enabled', family='Ubuntu')
plot.figure.set_size_inches(14, 9)
In [110]:
plot = plt.axes(projection=albers_equal_area)
plot.set_extent((-125, -66, 20, 50))
plot.add_feature(cartopy.feature.LAND)
plot.add_feature(cartopy.feature.COASTLINE)
plot.add_feature(cartopy.feature.BORDERS)
plot.add_feature(states_and_provinces, edgecolor='gray')
plot.add_feature(cartopy.feature.LAKES, facecolor="#00BCD4")
candidate_groupby = df_place.groupby('candidate', as_index = False)
colors = ['#1976d2', '#7cb342', '#f4511e', '#7b1fa2']
for index, (name, group) in enumerate(candidate_groupby):
longitudes = group.longitude.values
latitudes = group.latitude.values
plot.scatter(
longitudes,
latitudes,
transform=plate_carree,
color=colors[index],
label=name,
zorder=2
)
plot.set_title('U.S. Tweeters by Candidate', family='Ubuntu')
plt.legend(loc='lower left')
plot.figure.set_size_inches(12, 7)
As expected, U.S. tweeters are centered around L.A., the Bay Area, Chicago, New York, and Boston. Rand Paul and Bernie Sanders tweeters are more spread out over the country.
That's all I have for now.
If you found this interesting and are curious for more, I encourage you to download the dataset (or get your own dataset based on your interests) and share your findings.
Source code is at https://github.com/raj-kesavan/arrows, and I can be reached at raj.ksvn@gmail.com for any questions, comments, or criticism. Looking forward to hearing your feedback!