Popular figures often have help managing their media presence. In the 2016 election, Twitter was an important communication medium for every major candidate. Many Twitter posts posted by the top two candidates were actually written by their aides. You might wonder how this affected the content or language of the tweets.
In this assignment, we'll look at some of the patterns in tweets by the top two candidates, Clinton and Trump. We'll start with Clinton.
Along the way, you'll get a first look at Pandas. Pandas is a Python package that provides a DataFrame
data structure similar to the datascience
package's Table
, which you might remember from Data 8. DataFrame
s are a bit harder to use than Table
s, but they provide more advanced functionality and are a standard tool for data analysis in Python.
Some of the analysis in this assignment is based on a post by David Robinson. Feel free to read the post, but do not copy from it! David's post is written in the R
programming language, which is a favorite of many data analysts, especially academic statisticians. Once you're done with your analysis, you may find it interesting to see whether R
is easier to use for this task.
To start the assignment, run the cell below to set up some imports and the automatic tests.
In [ ]:
import math
import numpy as np
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('hw2.ok')
There are instructions on using tweepy
here, but we will give you example code.
Twitter requires you to have authentication keys to access their API. To get your keys, you'll have to sign up as a Twitter developer.
Follow these instructions to get your keys:
If someone has your authentication keys, they can access your Twitter account and post as you! So don't give them to anyone, and don't write them down in this notebook. The usual way to store sensitive information like this is to put it in a separate file and read it programmatically. That way, you can share the rest of your code without sharing your keys. That's why we're asking you to put your keys in keys.json
for this assignment.
Twitter limits developers to a certain rate of requests for data. If you make too many requests in a short period of time, you'll have to wait awhile (around 15 minutes) before you can make more. So carefully follow the code examples you see and don't rerun cells without thinking. Instead, always save the data you've collected to a file. We've provided templates to help you do that.
In the example below, we have loaded some tweets by @BerkeleyData. Run it, inspect the output, and read the code.
In [ ]:
ds_tweets_save_path = "BerkeleyData_recent_tweets.pkl"
from pathlib import Path
# Guarding against attempts to download the data multiple
# times:
if not Path(ds_tweets_save_path).is_file():
import json
# Loading your keys from keys.json (which you should have filled
# in in question 1):
with open("keys.json") as f:
keys = json.load(f)
import tweepy
# Authenticating:
auth = tweepy.OAuthHandler(keys["consumer_key"], keys["consumer_secret"])
auth.set_access_token(keys["access_token"], keys["access_token_secret"])
api = tweepy.API(auth)
# Getting as many recent tweets by @BerkeleyData as Twitter will let us have:
example_tweets = list(tweepy.Cursor(api.user_timeline, id="BerkeleyData").items())
# Saving the tweets to a file as "pickled" objects:
with open(ds_tweets_save_path, "wb") as f:
import pickle
pickle.dump(example_tweets, f)
# Re-loading the results:
with open(ds_tweets_save_path, "rb") as f:
import pickle
example_tweets = pickle.load(f)
In [ ]:
# Looking at one tweet object, which has type Status:
example_tweets[0]
# You can try something like this:
# import pprint; pprint.pprint(vars(example_tweets[0]))
# ...to get a more easily-readable view.
Write code to download all the recent tweets by Hillary Clinton (@HillaryClinton). Follow our example code if you wish. Write your code in the form of four functions matching the documentation provided. (You may define additional functions as helpers.) Once you've written your functions, you can run the subsequent cell to download the tweets.
In [ ]:
def load_keys(path):
"""Loads your Twitter authentication keys from a file on disk.
Args:
path (str): The path to your key file. The file should
be in JSON format and look like this (but filled in):
{
"consumer_key": "<your Consumer Key here>",
"consumer_secret": "<your Consumer Secret here>",
"access_token": "<your Access Token here>",
"access_token_secret": "<your Access Token Secret here>"
}
Returns:
dict: A dictionary mapping key names (like "consumer_key") to
key values."""
import json
with open(path) as f:
return json.load(f)
def download_recent_tweets_by_user(user_account_name, keys):
"""Downloads tweets by one Twitter user.
Args:
user_account_name (str): The name of the Twitter account
whose tweets will be downloaded.
keys (dict): A Python dictionary with Twitter authentication
keys (strings), like this (but filled in):
{
"consumer_key": "<your Consumer Key here>",
"consumer_secret": "<your Consumer Secret here>",
"access_token": "<your Access Token here>",
"access_token_secret": "<your Access Token Secret here>"
}
Returns:
list: A list of Status objects, each representing one tweet."""
import tweepy
# Authenticating:
auth = tweepy.OAuthHandler(keys["consumer_key"], keys["consumer_secret"])
auth.set_access_token(keys["access_token"], keys["access_token_secret"])
api = tweepy.API(auth)
return list(tweepy.Cursor(api.user_timeline, id=user_account_name).items())
def save_tweets(tweets, path):
"""Saves a list of tweets to a file in the local filesystem.
This function makes no guarantee about the format of the saved
tweets, **except** that calling load_tweets(path) after
save_tweets(tweets, path) will produce the same list of tweets
and that only the file at the given path is used to store the
tweets. (That means you can implement this function however
you want, as long as saving and loading works!)
Args:
tweets (list): A list of tweet objects (of type Status) to
be saved.
path (str): The place where the tweets will be saved.
Returns:
None"""
with open(path, "wb") as f:
import pickle
pickle.dump(tweets, f)
def load_tweets(path):
"""Loads tweets that have previously been saved.
Calling load_tweets(path) after save_tweets(tweets, path)
will produce the same list of tweets.
Args:
path (str): The place where the tweets were be saved.
Returns:
list: A list of Status objects, each representing one tweet."""
with open(path, "rb") as f:
import pickle
return pickle.load(f)
In [ ]:
# When you are done, run this cell to load @HillaryClinton's tweets.
# Note the function get_tweets_with_cache. You may find it useful
# later.
def get_tweets_with_cache(user_account_name, keys_path):
"""Get recent tweets from one user, loading from a disk cache if available.
The first time you call this function, it will download tweets by
a user. Subsequent calls will not re-download the tweets; instead
they'll load the tweets from a save file in your local filesystem.
All this is done using the functions you defined in the previous cell.
This has benefits and drawbacks that often appear when you cache data:
+: Using this function will prevent extraneous usage of the Twitter API.
+: You will get your data much faster after the first time it's called.
-: If you really want to re-download the tweets (say, to get newer ones,
or because you screwed up something in the previous cell and your
tweets aren't what you wanted), you'll have to find the save file
(which will look like <something>_recent_tweets.pkl) and delete it.
Args:
user_account_name (str): The Twitter handle of a user, without the @.
keys_path (str): The path to a JSON keys file in your filesystem.
"""
save_path = "{}_recent_tweets.pkl".format(user_account_name)
from pathlib import Path
if not Path(save_path).is_file():
keys = load_keys(keys_path)
tweets = download_recent_tweets_by_user(user_account_name, keys)
save_tweets(tweets, save_path)
return load_tweets(save_path)
In [ ]:
clinton_tweets = get_tweets_with_cache("HillaryClinton", "keys.json")
In [ ]:
# If everything is working properly, this should print out
# a Status object (a single tweet). clinton_tweets should
# contain around 3000 tweets.
clinton_tweets[0]
In [ ]:
_ = ok.grade('q02')
_ = ok.backup()
Twitter gives us a lot of information about each tweet, not just its text. You can read the full documentation here. Look at one tweet to get a sense of the information we have available.
Which fields contain:
To answer the question, write functions that extract each field from a tweet. (Each one should take a single Status object as its argument.)
In [ ]:
def extract_text(tweet):
return tweet.text #SOLUTION
def extract_time(tweet):
return tweet.created_at #SOLUTION
def extract_source(tweet):
return tweet.source #SOLUTION
In [ ]:
_ = ok.grade('q03')
_ = ok.backup()
SOLUTION: Some possible answers: retweet_count
or favorite_count
might be useful if we think tweets by the candidate herself are retweeted or favorited more often. coordinates
might be useful if we can identify some pattern in the aides' or candidate's locations (for example, if the aides always tweet from the same campaign office building, which Hillary rarely visits). quoted_status
might be useful if aides are more likely to quote other tweets than the candidate herself.
JSON (and the Status object, which is just Tweepy's translation of the JSON produced by the Twitter API to a Python object) is nice for transmitting data, but it's not ideal for analysis. The data will be easier to work with if we put them in a table.
To create an empty table in Pandas, write:
In [ ]:
import pandas as pd
df = pd.DataFrame()
(pd
is the standard abbrevation for Pandas.)
Now let's make a table with useful information in it. To add a column to a DataFrame called df
, write:
df['column_name'] = some_list_or_array
(This page is a useful reference for many of the basic operations in Pandas. You don't need to read it now, but it might be helpful if you get stuck.)
Write a function called make_dataframe
. It should take as its argument a list of tweets like clinton_tweets
and return a Pandas DataFrame. The DataFrame should contain columns for all the fields in question 3 and any fields you listed in question 4. Use the field names as the names of the corresponding columns.
In [ ]:
def make_dataframe(tweets):
"""Make a DataFrame from a list of tweets, with a few relevant fields.
Args:
tweets (list): A list of tweets, each one a Status object.
Returns:
DataFrame: A Pandas DataFrame containing one row for each element
of tweets and one column for each relevant field."""
df = pd.DataFrame() #SOLUTION
df['text'] = [extract_text(t) for t in tweets] #SOLUTION
df['created_at'] = [extract_time(t) for t in tweets] #SOLUTION
df['source'] = [extract_source(t) for t in tweets] #SOLUTION
return df
Now you can run the next line to make your DataFrame.
In [ ]:
clinton_df = make_dataframe(clinton_tweets)
In [ ]:
# The next line causes Pandas to display all the characters
# from each tweet when the table is printed, for more
# convenient reading. Comment it out if you don't want it.
pd.set_option('display.max_colwidth', 150)
clinton_df.head()
In [ ]:
_ = ok.grade('q05')
_ = ok.backup()
Create a plot showing how many tweets came from each kind of source. For a real challenge, try using the Pandas documentation and Google to figure out how to do this. Otherwise, hints are provided.
Hint: Start by grouping the data by source. df['source'].value_counts()
will create an object called a Series (which is like a table that contains exactly 2 columns, where one column is called the index). You can create a version of that Series that's sorted by source (in this case, in alphabetical order) by calling sort_index()
on it.
Hint 2: To generate a bar plot from a Series s
, call s.plot.barh()
. You can also use matplotlib
's plt.barh
, but it's a little bit complicated to use.
In [ ]:
clinton_df['source'].value_counts().sort_index().plot.barh(); #SOLUTION
You should find that most tweets come from TweetDeck.
Filter clinton_df
to examine some tweets from TweetDeck and a few from the next-most-used platform. From examining only a few tweets (say 10 from each category), can you tell whether Clinton's personal tweets are limited to one platform?
Hint: If df
is a DataFrame and filter_array
is an array of booleans of the same length, then df[filter_array]
is a new DataFrame containing only the rows in df
corresponding to True
values in filter_array
.
In [ ]:
# Do your analysis, then write your conclusions in a brief comment.
tweetdeck = clinton_df[clinton_df['source'] == 'TweetDeck']
twc = clinton_df[clinton_df['source'] == 'Twitter Web Client']
import numpy as np
def rounded_linspace(start, stop, count):
import numpy as np
return np.linspace(start, stop, count, endpoint=False).astype(int)
print(tweetdeck.iloc[rounded_linspace(0, tweetdeck.shape[0], 10)]['text'])
print(twc.iloc[rounded_linspace(0, twc.shape[0], 10)]['text'])
# It does look like Twitter Web Client is used more for retweeting,
# but it's not obvious which tweets are by Hillary.
Check Hillary Clinton's Twitter page. It mentions an easy way to identify tweets by the candidate herself. All other tweets are by her aides.
Write a function called is_clinton
that takes a tweet (in JSON) as its argument and returns True
for personal tweets by Clinton and False
for tweets by her aides. Use your function to create a column called is_personal
in clinton_df
.
Hint: You might find the string method endswith
helpful.
In [ ]:
def is_clinton(tweet):
"""Distinguishes between tweets by Clinton and tweets by her aides.
Args:
tweet (Status): One tweet.
Returns:
bool: True if the tweet is written by Clinton herself."""
return extract_text(tweet).endswith("-H") #SOLUTION
clinton_df['is_personal'] = [is_clinton(t) for t in clinton_tweets] #SOLUTION
Now we have identified Clinton's personal tweets. Let us return to our analysis of sources and see if there was any pattern we could have found.
You may recall that Tables from Data 8 have a method called pivot
, which is useful for cross-classifying a dataset on two categorical attrbiutes. DataFrames support a more complicated version of pivoting. The cell below pivots clinton_df
for you.
In [ ]:
# This cell is filled in for you; just run it and examine the output.
def pivot_count(df, vertical_column, horizontal_column):
"""Cross-classifies df on two columns."""
pivoted = pd.pivot_table(df[[vertical_column, horizontal_column]], index=[vertical_column], columns=[horizontal_column], aggfunc=len, fill_value=0)
return pivoted.rename(columns={False: "False", True: "True"})
clinton_pivoted = pivot_count(clinton_df, 'source', 'is_personal')
clinton_pivoted
Do Clinton and her aides have different "signatures" of tweet sources? That is, for each tweet they send, does Clinton send tweets from each source with roughly the same frequency as her aides? It's a little hard to tell from the pivoted table alone.
In [ ]:
clinton_pivoted["aides proportion"] = clinton_pivoted['False'] / sum(clinton_pivoted['False'])
clinton_pivoted["clinton proportion"] = clinton_pivoted['True'] / sum(clinton_pivoted['True'])
clinton_pivoted[["aides proportion", "clinton proportion"]].plot.barh();
You should see that there are some differences, but they aren't large. Do we need to worry that the differences (or lack thereof) are just "due to chance"?
Statistician Ani argues as follows:
"The tweets we see are not a random sample from anything. We have simply gathered every tweet by @HillaryClinton from the last several months. It is therefore meaningless to compute, for example, a confidence interval for the rate at which Clinton used TweetDeck. We have calculated exactly that rate from the data we have."
Statistician Belinda responds:
"We are interested in whether Clinton and her aides behave differently in general with respect to Twitter client usage in a way that we could use to identify their tweets. It's plausible to imagine that the tweets we see are a random sample from a huge unobserved population of all the tweets Clinton and her aides might send. We must worry about error due to random chance when we draw conclusions about this population using only the data we have available."
SOLUTION: Here is an argument for Belinda's position. Imagine that Clinton had tweeted only 5 times. Then we would probably not think we could come to a valid conclusion about her behavior patterns. So there is a distinction between the data and an underlying parameter that we're trying to learn about. However, this does not mean it's reasonable to use methods (like the simple bootstrap) that assume the data are a simple random sample from the population we're interested in.
Assume you are convinced by Belinda's argument. Perform a statistical test of the null hypothesis that the Clinton and aide tweets' sources are all independent samples from the same distribution (that is, that the differences we observe are "due to chance"). Briefly describe the test methodology and report your results.
Hint: If you need a refresher, this section of the Data 8 textbook from Fall 2016 covered this kind of hypothesis test.
Hint 2: Feel free to use datascience.Table
to answer this question. However, it will be advantageous to learn how to do it with numpy
alone. In our solution, we used some numpy
functions you might not be aware of: np.append
, np.random.permutation
, np.bincount
, and np.count_nonzero
. We have provided the function expand_counts
, which should help you solve a tricky problem that will arise.
In [ ]:
# Use this cell to perform your hypothesis test.
def expand_counts(source_counts):
"""Blow up a list/array of counts of categories into an array of
individuals matching the counts. For example, we can generate
a list of 2 individuals of type 0, 4 of type 1, and 1 of type 3
as follows:
>>> expand_counts([2, 4, 0, 1])
array([0, 0, 1, 1, 1, 1, 3])"""
return np.repeat(np.arange(len(source_counts)), source_counts)
def tvd(a, b):
return .5*sum(np.abs(a/sum(a) - b/sum(b)))
def test_difference_in_distributions(sample0, sample1, num_trials):
num_sources = len(sample0)
individuals0 = expand_counts(sample0)
individuals1 = expand_counts(sample1)
count0 = len(individuals0)
count1 = len(individuals1)
all_individuals = np.append(individuals0, individuals1)
def simulate_under_null():
permuted_pool = np.random.permutation(all_individuals)
simulated_sample0 = np.bincount(permuted_pool[:count0], minlength=num_sources)
simulated_sample1 = np.bincount(permuted_pool[count0:], minlength=num_sources)
return tvd(simulated_sample0, simulated_sample1)
actual_tvd = tvd(sample0, sample1)
simulated_tvds = np.array([simulate_under_null() for _ in range(num_trials)])
return np.count_nonzero(simulated_tvds > actual_tvd) / num_trials
p_value = test_difference_in_distributions(clinton_pivoted['True'], clinton_pivoted['False'], 100000)
print("P-value: {:.6f}".format(p_value))
SOLUTION: We simulated many times under the null hypothesis by pooling the data and permuting the sources. We found a P-value around .04%, so we have very strong evidence against the null hypothesis that Clinton and her aides tweet from the same distribution of sources. It's important to note that strong evidence that the difference is not zero (which we have found) is very different from evidence that the difference is large (which we have not found). The next question demonstrates this.
Suppose you sample a random @HillaryClinton tweet and find that it is from the Twitter Web Client. Your visualization in question 9 should show you that Clinton tweets from this source about twice as frequently as her aides do, so you might imagine it's reasonable to predict that the tweet is by Clinton. But what is the probability that the tweet is by Clinton? (You should find a relatively small number. Clinton's aides tweet much more than she does. So even though there is a difference in their tweet source usage, it would be difficult to classify tweets this way.)
Hint: Bayes' rule is covered in this section of the Data 8 textbook.
In [ ]:
probability_clinton = clinton_pivoted.loc['Twitter Web Client']['True'] / sum(clinton_pivoted.loc['Twitter Web Client']) #SOLUTION
probability_clinton
In [ ]:
_ = ok.grade('q12')
_ = ok.backup()
Our results so far aren't Earth-shattering. Clinton uses different Twitter clients at slightly different rates than her aides.
Now that we've categorized the tweets, we could of course investigate their contents. A manual analysis (also known as "reading") might be interesting, but it is beyond the scope of this course. And we'll have to wait a few more weeks before we can use a computer to help with such an analysis.
Instead, let's repeat our analysis for Donald Trump.
In [ ]:
trump_tweets = get_tweets_with_cache("realDonaldTrump", "keys.json") #SOLUTION
trump_df = make_dataframe(trump_tweets) #SOLUTION
In [ ]:
trump_df.head()
In [ ]:
trump_df['source'].value_counts().sort_index().plot.barh(); #SOLUTION
You should find two major sources of tweets.
It is reported (for example, in this Gawker article) that Trump himself uses an Android phone (a Samsung Galaxy), while his aides use iPhones. But Trump has not confirmed this. Also, he has reportedly switched phones since his inauguration! How might we verify whether this is a way to identify his tweets?
A retweet is a tweet that replies to (or simply repeats) a tweet by another user. Twitter provides several mechanisms for this, as explained in this article. However, Trump has an unusual way of retweeting: He simply adds the original sender's name to the original message, puts everything in quotes, and then adds his own comments at the end.
For example, this is a tweet by user @melissa7889:
@realDonaldTrump @JRACKER33 you should run for president!
Here is Trump's retweet of this, from 2013:
"@melissa7889: @realDonaldTrump @JRACKER33 you should run for president!" Thanks,very nice!
Since 2015, the usual way of retweeting this message, and the method used by Trump's aides (but not Trump himself), would have been:
Thanks,very nice! RT @melissa7889: @realDonaldTrump @JRACKER33 you should run for president!
Write a function to identify Trump-style retweets, and another function to identify the aide-style retweets. Then, use them to create a function called tweet_type
that takes a tweet as its argument and returns values "Trump retweet"
, "Aide retweet"
, and "Not a retweet"
as appropriate. Use your function to add a 'tweet_type'
column to trump_df
.
Hint: Try the string method startswith
and the Python keyword in
.
In [ ]:
def is_trump_style_retweet(tweet_text):
"""Returns True if tweet_text looks like a Trump-style retweet."""
return tweet_text.startswith('"@')
def is_aide_style_retweet(tweet_text):
"""Returns True if tweet_text looks like an aide-style retweet."""
return "RT @" in tweet_text
def tweet_type(tweet_text):
"""Returns "Trump retweet", "Aide retweet", or "Not a retweet"
as appropriate."""
if is_trump_style_retweet(tweet_text):
return "Trump retweet"
elif is_aide_style_retweet(tweet_text):
return "Aide retweet"
return "Not a retweet"
trump_df['tweet_type'] = [tweet_type(t) for t in trump_df['text']]
In [ ]:
trump_df
In [ ]:
_ = ok.grade('q15')
_ = ok.backup()
In [ ]:
trump_pivoted = pivot_count(trump_df, 'source', 'tweet_type') #SOLUTION
trump_pivoted
In [ ]:
_ = ok.grade('q16')
_ = ok.backup()
Does the cross-classified table show evidence against the hypothesis that Trump and his advisors tweet from roughly the same sources? Again assuming you agree with Statistician Belinda, run an hypothesis test in the next cell to verify that there is a difference in the relevant distributions. Then use the subsequent cell to describe your methodology and results. Are there any important caveats?
In [ ]:
test_difference_in_distributions(trump_pivoted['Aide retweet'], trump_pivoted['Trump retweet'], 100000) #SOLUTION
SOLUTION: We eliminated the non-retweets and performed a test for a difference in categorical distributions as we did for Clinton. As should obvious from the table, there is a difference! (We find a P-value of 0, though this is approximate, and the true P-value is merely extremely close to 0.) One small caveat is that we are looking only at retweets. It's plausible that people behave differently when retweeting - maybe they find one device or app more convenient for retweets. A bigger caveat is that we don't just care about there being any difference, but that the difference is large. This is obvious from looking at the table - Trump almost never retweets from an iPhone and his aides never retweet from an Android phone. (Since we care about magnitudes, it would be useful to create confidence intervals for the chances of Trump and his aides tweeting from various devices. With a dataset this large, they would be narrow.)
We are really interested in knowing whether we can classify @realDonaldTrump tweets on the basis of the source. Just knowing that there is a difference in source distributions isn't nearly enough. Instead, we would like to claim something like this: "@realDonaldTrump tweets from Twitter for Android are generally authored by Trump himself. Other tweets are generally authored by his aides."
If you use bootstrap methods to compute a confidence interval for the proportion of Trump aide retweets from Android phones in "the population of all @realDonaldTrump retweets," you will find that the interval is [0, 0]. That's because there are no retweets from Android phones by Trump aides in our dataset. Is it reasonable to conclude from this that Trump aides definitely never tweet from Android phones?
SOLUTION: No, the bootstrap is misleading in this case. If we'd seen 1 million retweets by Trump aides, it might be okay to make this conclusion. But we have seen only 177, so the conclusion seems a bit premature.
In [ ]:
_ = ok.grade_all()
Now, run this code in your terminal to make a
git commit
that saves a snapshot of your changes in git
. The last line of the cell
runs git push, which will send your work to your personal Github repo.
Note: Don't add and commit your keys.json
file! git add -A
will do that, but the code we've written below won't.
# Tell git to commit your changes to this notebook
git add sp17/hw/hw2/hw2.ipynb
# Tell git to make the commit
git commit -m "hw2 finished"
# Send your updates to your personal private repo
git push origin master
Finally, we'll submit the assignment to OkPy so that the staff will know to grade it. You can submit as many times as you want, and you can choose which submission you want us to grade by going to https://okpy.org/cal/data100/sp17/.
In [ ]:
# Now, we'll submit to okpy
_ = ok.submit()
Congratulations, you're done!
We've only scratched the surface of this dataset. Twitter is a rich source of data about language and social interaction, and not only for political figures. Now you know how to access it!