This notebook explores the affordances of the Twitter API for retweets, replies, quotes, and favorites. It is motivated by questions from several George Washington University researchers who are interested in using Social Feed Manager to collect datasets for studying dialogues and interaction on Twitter.
We will not discuss affordances of the Twitter API that are perspectival, that is, depend on the Twitter account that is used to access the API. So, for example, we will not consider GET statuses/retweets_of_me.
Before proceeding, we will install Twarc. Twarc is a Twitter client. It is generally used from the commandline, but we will use it as a library.
This assumes that you have run Twarc locally and already have credentials stored in ~/.twarc.
As you are reading this, feel free to skip any of the sections of code.
In [1]:
# This installs Twarc
# !pip install twarc
# This is temporary until https://github.com/DocNow/twarc/pull/118 is merged.
!pip install git+https://github.com/justinlittman/twarc.git@retweets#egg=twarc
# This imports some classes and functions that will be used later in this notebook.
from twarc import Twarc, load_config, default_config_filename
import json
import codecs
# This creates an instance of Twarc.
credentials = load_config(default_config_filename(), 'main')
t = Twarc(consumer_key=credentials['consumer_key'],
consumer_secret=credentials['consumer_secret'],
access_token=credentials['access_token'],
access_token_secret=credentials['access_token_secret'])
# Create a summary of a tweet, only showing relevant fields.
def summarize(tweet, extra_fields = None):
new_tweet = {}
for field, value in tweet.items():
if field in ["text", "id_str", "screen_name", "retweet_count", "favorite_count", "in_reply_to_status_id_str", "in_reply_to_screen_name", "in_reply_to_user_id_str"] and value is not None:
new_tweet[field] = value
elif extra_fields and field in extra_fields:
new_tweet[field] = value
elif field in ["retweeted_status", "quoted_status", "user"]:
new_tweet[field] = summarize(value)
return new_tweet
# Print out a tweet, with optional colorizing of selected fields.
def dump(tweet, colorize_fields=None, summarize_tweet=True):
colorize_field_strings = []
for line in json.dumps(summarize(tweet) if summarize_tweet else tweet, indent=4, sort_keys=True).splitlines():
colorize = False
for colorize_field in colorize_fields or []:
if "\"{}\":".format(colorize_field) in line:
print "\x1b[31m" + line + "\x1b[0m"
break
else:
print line
In [2]:
%%html
<!-- This renders embeds a tweet in the notebook. -->
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">First day at Gelman Library. First tweet. <a href="http://t.co/Gz5ybAD6os">pic.twitter.com/Gz5ybAD6os</a></p>— Justin Littman (@justin_littman) <a href="https://twitter.com/justin_littman/status/503873833213104128">August 25, 2014</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Tweets retrieved from the Twitter API are in JSON, a simple structured text format. Below I will provide the entire tweet; in the rest of this notebook I will only provide a subset of the tweet containing the relevant fields. Twitter provides documentation on the complete set of fields in a tweet.
In [3]:
# Retrieve a single tweet from the Twitter API
tweet = list(t.hydrate(['503873833213104128']))[0]
# Pretty-print the tweet
dump(tweet, summarize_tweet=False)
Here's what the summary of that same tweet:
In [4]:
dump(tweet)
In [5]:
%%html
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">.<a href="https://twitter.com/DameWendyDBE">@DameWendyDBE</a>: Invest in data science training for librarians. In future, libraries will be data warehouses. <a href="https://twitter.com/hashtag/SaveTheWeb?src=hash">#SaveTheWeb</a></p>— Justin Littman (@justin_littman) <a href="https://twitter.com/justin_littman/status/743520583518920704">June 16, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Let's retrieve the JSON for this tweet from the Twitter API.
In [6]:
tweet = list(t.hydrate(['743520583518920704']))[0]
dump(tweet, colorize_fields=['retweet_count'])
The relevant field is retweet_count. This field provides the number of times this tweet was retweeted. Note that this number may vary over time, as additional people retweet the tweet.
In [7]:
%%html
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Reproducible Research: Citing your execution env using <a href="https://twitter.com/docker">@Docker</a> and a DOI: <a href="https://t.co/S4DChzE9Au">https://t.co/S4DChzE9Au</a> via <a href="https://twitter.com/SoftwareSaved">@SoftwareSaved</a> <a href="https://t.co/SPMcKa35J4">pic.twitter.com/SPMcKa35J4</a></p>— Docker (@docker) <a href="https://twitter.com/docker/status/720856949407940608">April 15, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
Here is the JSON from the Twitter API.
In [8]:
tweet = list(t.hydrate(['724575206937899008']))[0]
dump(tweet, colorize_fields=['retweeted_status', 'retweet_count'])
Two fields are significant. First, the retweeted_status contains the source tweet (i.e., the tweet that was retweeted). The present or absence of this field can be used to identify tweets that are retweets. Second, the retweet_count is the count of the retweets of the source tweet, not this tweet.
As a corollary to our look at a retweet, let's look at a tweet that is a retweet of a retweet. (I'll refer to this as a second order retweet.) Here's a tweet that I retweeted from my @jlittman_dev account that was a retweet from my @justin_littman account of a source tweet from @SocialFeedMgr.
In [9]:
tweet = list(t.hydrate(['794490469627686913']))[0]
dump(tweet, colorize_fields=['retweet_count', 'retweeted_status'])
The second order tweet is treated as if it is a retweet of the source tweet. The retweet_count of the source tweet is incremented and the retweeted_status that appears in the second order tweet is the source tweet. There is no indication that this is a retweet of a retweet. Thus, in reconstructing interaction, you can't determine from who a user discovered a tweet that she later retweeted.
Third, we want to consider a tweet that has been quoted. A quote tweet is a retweet that contains some additional text.
To test this, I quoted my first tweet from a different twitter account (@jlittman_dev).
In [10]:
tweet = list(t.hydrate(['503873833213104128']))[0]
dump(summarize(tweet))
There is nothing in the tweet to indicate that it has been quoted. This is similar to what you find on Twitter website: if you look at the full rendering of this tweet, there is no indication that it was quoted.
Quotes don't count as a retweet, as the retweet_count on the source tweet is 0.
In [11]:
%%html
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Let us know what can add to <a href="https://twitter.com/SocialFeedMgr">@SocialFeedMgr</a> docs to take the "crash" out of "crash course" <a href="https://twitter.com/ianmilligan1">@ianmilligan1</a>. And all other feedback welcome. <a href="https://t.co/BbjOLSvdCm">https://t.co/BbjOLSvdCm</a></p>— Justin Littman (@justin_littman) <a href="https://twitter.com/justin_littman/status/794162076717613056">November 3, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
In [12]:
tweet = list(t.hydrate(['794162076717613056']))[0]
dump(summarize(tweet, extra_fields=['quoted_status_id', 'quoted_status_id_str']), colorize_fields=['quoted_status', 'quoted_status_id', 'quoted_status_id_str'], summarize_tweet=False)
The relevant field in this quote tweet is quoted_status, which contains the source tweet. quoted_status_id and quoted_status_id_str are the tweet id of the source tweet, which is redundant of the tweet id contained in quoted_status.
In [13]:
%%html
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Yesterday I learned about the <a href="https://twitter.com/jefferson_bail">@jefferson_bail</a> test for projects: Is it sufficiently "do-goody and feel-goody"?</p>— Justin Littman (@justin_littman) <a href="https://twitter.com/justin_littman/status/789411809807572992">October 21, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
There is nothing to indicate that this tweet has a reply.
In [14]:
%%html
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/justin_littman">@justin_littman</a> Ha! I don't even remember what I was talking about. I believe that was my fifth meeting in a row starting at 7am, so...</p>— Jefferson Bailey (@jefferson_bail) <a href="https://twitter.com/jefferson_bail/status/789486128189444096">October 21, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
In [15]:
tweet = list(t.hydrate(['789486128189444096']))[0]
dump(summarize(tweet, extra_fields=['in_reply_to_status_id_str', 'in_reply_to_user_id']), colorize_fields=['in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_screen_name', 'in_reply_to_user_id', 'in_reply_to_user_id_str'], summarize_tweet=False)
The relevant fields in a reply tweet are in_reply_to_status_id, in_reply_to_status_id_str, in_reply_to_screen_name, in_reply_to_user_id, in_reply_to_user_id_str. The names of each of these fields reasonably describe their contents. The most significant of these is in_reply_to_status_id, which supports finding the tweet to which the reply tweet is a reply.
Thus, based on the metadata that is provided for a tweet, a chain of replies can be followed backwards from the reply tweet to the replied to tweet, but not vice versa, i.e., from the replied to tweet to the reply tweet.
In [16]:
%%html
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Slides from my <a href="https://twitter.com/hashtag/iipcWAC16?src=hash">#iipcWAC16</a> presentation on aligning social media archiving and web archiving: <a href="https://t.co/Rj8LEbBOp8">https://t.co/Rj8LEbBOp8</a></p>— Justin Littman (@justin_littman) <a href="https://twitter.com/justin_littman/status/720621197550071808">April 14, 2016</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
In [17]:
tweet = list(t.hydrate(['720621197550071808']))[0]
dump(tweet, colorize_fields=['favorite_count'])
The favorite_count provides the number of times the tweet has been favorited.
In the case of a retweet, favorite_count is the favorite count of the source tweet. (This is similar to retweet_count.)
GET statuses/show/:id is used to retrieve a single tweet by tweet id. GET statuses/lookup is used to retrieve multiple tweets by tweet ids.
In the above examples, GET statuses/lookup using only a single tweet id was used to retrieve the tweets.
GET statuses/user_timeline retrieves a user timeline given a screen name or user id. This is one of the primary methods for collecting social media data.
While GET statuses/user_timeline supports getting tweets from the past, it is limited to the last 3,200 tweets.
To test this, we will retrieve the user timeline of @jlittman_dev and looks for retweets, quotes, replies, favorited tweets, and retweeted tweets.
In [18]:
found_retweet = False
found_quote = False
found_reply = False
found_favorited = False
found_retweeted = False
for tweet in t.timeline(screen_name='jlittman_dev'):
if 'retweeted_status' in tweet:
print "{} is a retweet.".format(tweet['id_str'])
found_retweet = True
if 'quoted_status' in tweet:
print "{} is a quote.".format(tweet['id_str'])
found_quote = True
if tweet['in_reply_to_status_id']:
print "{} is a reply to {} by {}".format(tweet['id_str'], tweet['in_reply_to_status_id_str'], tweet['in_reply_to_screen_name'])
found_reply = True
if tweet['retweet_count'] > 0:
print "{} has been retweeted {} times.".format(tweet['id_str'], tweet['retweet_count'])
found_retweeted = True
if tweet['favorite_count'] > 0:
print "{} has been favorited {} times.".format(tweet['id_str'], tweet['favorite_count'])
found_favorited = True
print "Found retweet: {}".format(found_retweet)
print "Found quote: {}".format(found_quote)
print "Found reply: {}".format(found_reply)
print "Found favorited: {}".format(found_favorited)
print "Found retweeted: {}".format(found_retweeted)
This demonstrates that the following are available from the user timeline:
Other than the counts for favorited tweets and retweeted tweets, it does not include the tweets of other users such as quotes of this user or replies to tweets of this user.
GET statuses/retweets/:id returns the most recent retweets for a tweet. Only the most recent 100 retweets are available.
To test this, let's compare the retweet_count against the number of tweets returned by GET statuses/retweets/:id for that tweet.
In [19]:
tweet = list(t.hydrate(['743520583518920704']))[0]
print "The retweet count is {}".format(tweet['retweet_count'])
retweets = t.retweets('743520583518920704')
print "Retrieved {} retweets".format(len(list(retweets)))
GET statuses/retweeters/ids retrieves the user ids that retweeted a tweet.
GET search/tweets (also known as the Twitter Search API) allows searching "against a sampling of recent Tweets published in the past 7 days."
Some of the query parameters that are relevant to retweets, quotes, and replies are:
Because the Search API is time limited and an unknown size sample, it will not be further explored in this notebook.
POST statuses/filter allows filtering of the stream of tweets on the Twitter platform by keywords (track), users (follow), and geolocation (location).
POST statuses/filter only allows collecting tweets moving forward; it cannot be used to retrieve past tweets.
For this test, I will use the follow parameter to determine what is captured when following a user. Note that the follow parameter takes a list of user ids. User ids do not change (unlike screen names).
Because this test requires creating tweets from multiple accounts and recording the filter stream, it will not be performed live in this notebook. Rather, I used Twarc to record the filter stream of @jlittman_dev (user id 2875189485):
twarc.py --follow 2875189485 > follow.json
I then performed the following actions on the Twitter website:
We will now look at the tweets that were captured by the filter stream.
The first tweet is the tweet posted by @jlittman_dev in step 1. Thus, tweets by the followed user are captured.
In [20]:
# Load the tweets
with codecs.open('./follow.json', 'r') as f:
lines = f.readlines()
# Print the number of tweets
print len(lines)
# Print the first tweet
tweet1 = json.loads(lines[0])
dump(tweet1)
The second tweet is @jlittman_dev2's retweet of @jlittman_dev's tweet. This is step 2, showing that retweets by other users of tweets by the followed user are captured.
In [21]:
# Print the second tweet
tweet2 = json.loads(lines[1])
dump(tweet2)
The third tweet is the quote by @jlittman_dev of @jlittman_dev2 tweet. This is step 4, showing that quote tweets posted by the followed user are captured.
Note that the quoted tweet (step 3) is not captured because @jlittman_dev2 isn't being followed; however, it is available as the quoted_status of the quote tweet.
In [22]:
# Print the third tweet
tweet3 = json.loads(lines[2])
dump(tweet3)
The fourth tweet is a reply by @jlittman_dev2 to a tweet by @jlittman_dev. This is step 6. Thus, replies to the followed user are captured.
Note that the tweet from step 5 (@jlittman_dev2's quote tweet of @jlittman_dev's tweet) was not captured. Thus, quote tweets in which the followed user is quoted are not captured.
In [23]:
# Print the fourth tweet
tweet4 = json.loads(lines[3])
dump(tweet4)
The fifth tweet is from step 7, @jlittman_dev's reply to @jlittman_dev's reply. Thus, replies by the followed user to replies are captured.
In [24]:
# Print the fifth tweet
tweet5 = json.loads(lines[4])
dump(tweet5)
The final tweet is a reply by @jlittman_dev to a tweet by @jlittman_dev2. Thus, replies by the followed user are captured.
In [25]:
# Print the sixth tweet
tweet6 = json.loads(lines[5])
dump(tweet6)
The only tweet in our test of the follow parameter of the twitter filter stream that wasn't captured was the quote of a followed user's tweet by another user.
Let's see if we can capture that with the track parameter, by using the user's screen name as the keyword.
Note that a user can change her screen name, so that will need to be monitored if using this approach.
Again, I used Twarc to record the filter stream, this time tracking @jlittman_dev (as a keyword):
twarc.py --track @jlittman_dev > track.json
I then performed the following actions on the Twitter website:
Only a single tweet is captured.
In [26]:
# Load the tweets
with codecs.open('./track.json', 'r') as f:
lines = f.readlines()
# Print the number of tweets
print len(lines)
# Print the first tweet
tweet1 = json.loads(lines[0])
dump(tweet1)
This is the tweet that resulted from the mention of @jlittman_dev (step 1). Again, the tweet quoting the followed user wasn't captured.
Thus, to summarize for a given user, the following can be captured using the filter stream and the follow parameter:
but not quotes of that user's tweets by another user. The track parameter does not help with catching quotes of that user's tweets.
The Twitter API provides extensive support for retrieving data for studying dialogues and interaction on Twitter.
The following table summarizes what is available in a tweet for retweets, replies, quotes, and favorites.
For a tweet that is ... | Available |
---|---|
Retweeted | Count of retweets |
A retweet | Source tweet |
Quoted | No |
A quote | Quoted tweet |
Favorited | Count of favorites |
Replied to | No |
A reply | Replied to tweet |
The two most helpful API methods for retweets, replies, quotes, and favorites are GET statuses/user_timeline and POST statuses/filter. The following table summarizes the affordances of these methods:
Tweet type | GET statuses/user_timeline | POST statuses/filter |
---|---|---|
Tweets by the user | Yes | Yes |
Retweets by the user | Yes | Yes |
Retweets by other users of tweet by the user | No | Yes |
Quotes by the user | Yes | Yes |
Quotes by other users of tweet by the user | No | No |
Replies by user | Yes | Yes |
Replies by other users to tweet by the user | No | Yes |
Note that Social Feed Manager supports collecting using both of these methods.
In [ ]: