This tutorial presents an overview of how to use the Python programming language to interact with the Twitter API, both for acquiring data and for posting it. We're using the Twitter API because it's useful in its own right but also presents an interesting "case study" of how to work with APIs offered by commercial entities and social media sites.
The Twitter API allows you to programmatically enact many of the same actions that you can perform with the Twitter app and the Twitter website, such as searching for Tweets, following users, reading your timeline, posting tweets and direct messages, etc, though there are some parts of the Twitter user experience, like polls, that are (as of this writing) unavailable through the API. You can use the API to do things like collect data from tweets and write automated agents that post to Twitter.
In particular, we're going to be talking about Twitter's REST API. ("REST" stands for Representational State Transfer, a popular style of API design). For the kind of work we'll be doing, the streaming API is also worth a look, but left as an exercise for the reader.
All requests to the REST API—making a post, running a search, following or unfollowing an account—must be made on behalf of a user. Your code must be authorized to submit requests to the API on that user's behalf. A user can authorize code to act on their behalf through the medium of something called an application. You can see which applications you've authorized by logging into Twitter and following this link.
When making requests to the Twitter API, you don't use your username and password: instead you use four different unique identifiers: the Consumer (Application) Key, the Consumer (Application) Secret, an Access Token, and a Token Secret. The Consumer Key/Secret identify the application, and the Token/Token Secret identify a particular account as having access through a particular application. You don't choose these values; they're strings of random numbers automatically generated by Twitter. These strings, together, act as a sort of "password" for the Twitter API.
In order to obtain these four magical strings, we need to...
This site has a good overview of the steps you need to perform in order to create a Twitter application. I'll demonstrate the process in class. You'll need to have already signed up for a Twitter account!
When you're done creating your application and getting your token for the application, assign them to the variables below:
In [5]:
api_key = ""
api_secret = ""
access_token = ""
token_secret = ""
Twitter's API operates over HTTP and returns JSON objects, and technically you could use the requests
library (or any other HTTP client) to make requests to and receive responses from the API. However, the Twitter API uses a somewhat complicated authentication process called Oauth, which requires the generation of cryptographic signatures of requests in order to ensure their security. This process is a bit complicated, and not worth implementing from scratch. For this reason, most programmers making use of the Twitter API use a third-party library to do so. These libraries wrap up and abstract away the particulars of working with Oauth authentication so that programmers don't have to worry about them. As a happy side-effect, the libraries provide particular abstractions for API calls which make it slightly easier to use the API—you can just call methods with parameters instead of constructing URLs in your code "by hand".
There are a number of different libraries for accessing the Twitter API. We're going to use one called Twython. You can install Twython with pip
:
In [2]:
!pip3 install twython
In [80]:
import twython
# create a Twython object by passing the necessary secret passwords
twitter = twython.Twython(api_key, api_secret, access_token, token_secret)
response = twitter.search(q="data journalism", result_type="recent", count=20)
[r['text'] for r in response['statuses']]
Out[80]:
The .search()
method performs a Twitter search, just as though you'd gone to the Twitter search page and typed in your query. The method returns a JSON object, which Twython converts to a dictionary for us. This dictionary contains a number of items; importantly, the value for the key statuses
is a list of tweets that match the search term. Let's look at the underlying response in detail. I'm going to run the search again, this time asking for only two results instead of twenty:
In [11]:
response = twitter.search(q="data journalism", result_type="recent", count=2)
response
Out[11]:
As you can see, there's a lot of stuff in here, even for just two tweets. In the top-level dictionary, there's a key search_metadata
whose value is a dictionary with, well, metadata about the search: how many results it returned, what the query was, and what URL to use to get the next page of results. The value for the statuses
key is a list of dictionaries, each of which contains information about the matching tweets. Tweets are limited to 140 characters, but have much more than 140 characters of metadata. Twitter has a good guide to what each of the fields mean here, but here are the most interesting key/value pairs from our perspective:
id_str
: the unique numerical ID of the tweetin_reply_to_status_id_str
: the ID of the tweet that this tweet is a reply to, if applicableretweet_count
: number of times this tweet has been retweetedretweet_status
: the tweet that this tweet is a retweet of, if applicablefavorite_count
: the number of times that this tweet has been favoritedtext
: the actual text of the tweetuser
: a dictionary with information on the user who wrote the tweet, including the screen_name
key which has the Twitter screen name of the userNOTE: You can do much more with the query than just search for raw strings. The "Query operators" section on this page shows the different bits of syntax you can use to make your query more expressive.
You can form the URL of a particular tweet by combining the tweet's ID and the user's screen name using the following function. (This is helpful if you want to view the tweet in a web browser.)
In [19]:
def tweet_url(tweet):
return 'https://twitter.com/' + \
tweet['user']['screen_name'] + \
"/statuses/" + \
tweet['id_str']
In [20]:
tweet_url(response['statuses'][1])
Out[20]:
In general, Twython has one method for every "endpoint" in the Twitter REST API. Usually the Twython method has a name that resembles or is identical to the corresponding URL in the REST API. The Twython API documentation lists the available methods and which parts of the Twitter API they map to.
As a means of becoming more familiar with this, let's dig a bit deeper into search. The .search()
method of Twython takes a number of different parameters, which match up with the query string parameters of the REST API's search/tweets
endpoint as documented here. Every parameter that can be specified on the query string in a REST API call can also be included as a named parameter in the call to Twython's .search()
method. The preceding examples already show some examples of this:
In [22]:
response = twitter.search(q="data journalism", result_type="recent", count=2)
This call to .search()
includes the parameters q
(which specifies the search query), result_type
(which can be set to either popular
, recent
or mixed
, depending on how you want results to be returned) and count
(which specifies how many tweets you want returned in the response, with an upper limit of 100). Looking at the documentation, it appears that there's another interesting parameter we could play with: the geocode
parameter, which will make our search respond with tweets only within a given radius of a particular latitude/longitude. Let's use this to find the screen names of people tweeting about data journalism within a few miles of Columbia University:
In [36]:
response = twitter.search(q="data journalism",
result_type="recent",
count=100,
geocode="40.807511,-73.963265,4mi")
for resp in response['statuses']:
print(resp['user']['screen_name'])
The Twitter API provides an endpoint for fetching the tweets of a particular user. The endpoint in the API for this is statuses/user_timeline and the function in Twython is .get_user_timeline(). This function looks a lot like .search()
in the shape of its response. Let's try it out, fetching the last few tweets of the Twitter account of the Columbia Journalism School.
In [44]:
response = twitter.get_user_timeline(screen_name='columbiajourn',
count=20,
include_rts=False,
exclude_replies=True)
for item in response:
print(item['text'])
The screen_name
parameter specifies whose timeline we want; the count
parameter specifies how many tweets we want from that account. The include_rts
and exclude_replies
parameters control whether or not particular kinds of tweets are included in the response: setting include_rts
to False
means we don't see any retweets in our results, while the exclude_replies
parameter set to True
means we don't see any tweets that are replies to other users. (According to the API documentation, "Using exclude_replies
with the count
parameter will mean you will receive up-to count tweets — this is because the count parameter retrieves that many tweets before filtering out retweets and replies," which is why asking for 20 tweets doesn't necessarily return 20 tweets in this case.)
Note that the .get_user_timeline()
function returns not a dictionary with a key whose value is a list of tweets, like the .search()
function. Instead, it simply returns a JSON list:
In [46]:
response = twitter.get_user_timeline(screen_name='columbiajourn', count=1)
response
Out[46]:
The .search()
and .get_user_timeline()
functions by default only return the most recent results, up to the number specified with count
(and sometimes even fewer than that). In order to find older tweets, you need to page through the results. If you were doing this "by hand," you would use the max_id
or since_id
parameters to find tweets older than the last tweet in the current result, repeating that process until you'd exhausted the results (or found as many tweets as you need). This is delicate work and thankfully Twython includes pre-built functionality to make this easier: the .cursor()
function.
The .cursor()
function takes the function you want to page through as the first parameter, and after that the keyword parameters that you would normally pass to that function. Given this information, it can repeatedly call the given function on your behalf, going back as far as it can. The object returned from the .cursor()
function can be used as the iterable object in a for
loop, allowing you to iterate over all of the results. Here's an example using .search()
:
In [65]:
cursor = twitter.cursor(twitter.search, q='"data journalism" -filter:retweets', count=100)
all_text = list()
for tweet in cursor:
all_text.append(tweet['text'])
if len(all_text) > 500: # stop after 1000 tweets
break
This snippet finds 500 tweets containing the phrase data journalism
(excluding retweets) and stores the text of those tweets in a list. We can then use this text for data analysis, like a simple word count:
In [73]:
from collections import Counter
import re
c = Counter()
for text in all_text:
c.update([t.lower() for t in text.split()])
# most common ten words that have a length greater than three and aren't
# "data" or "journalism"
[k for k, v in c.most_common() \
if len(k) > 3 and not(re.search(r"data|journalism", k))][:25]
Out[73]:
TK!
TK!
You can also use the Twitter API to post tweets on behalf of a user. In this tutorial, we're going to use this ability of the API to make a simple bot.
When you first create a Twitter application, the credentials you have by default (i.e., the ones you get when you click "Create my access token") are for your own user. This means that you can post tweets to your own account using these credentials, and to your own account only. This isn't normally very desirable, but let's give it a shot, just to see how to update your status (i.e., post a tweet) with Twython. Here we go:
In [82]:
twitter.update_status(status="This is a test tweet for a tutorial I'm going through, please ignore")
Out[82]:
Check your account, and you'll see that your status has been updated! (You can safely delete this tweet if you'd like.) As you can see, the .update_status()
function takes a single named parameter, status
, which should have a string as its value. Twitter will update your status with the given string. The function returns a dictionary with information about the tweet that was just created.
Of course, you generally don't want to update your own status. You want to write a program that updates someone else's status, even if that someone else is a bot user of your own creation.
Before you proceed, create a new Twitter account. You'll need to log out of your current account and then open up the Twitter website, or (preferably) use your browser's "private" or "incognito" functionality. Every Twitter account requires a unique e-mail address, and you'll need to have access to the e-mail address to "verify" your account, so make sure you have an e-mail address you can use (and check) other than the one you used for your primary Twitter account. (We'll go over this process in class.)
Once you've created a new Twitter account, you'll need to have that user authorize the Twitter application we created earlier to tweet on its behalf. Doing this is a two-step process. Run the cell below (making sure that the api_key
and api_secret
variables have been set to the consumer key and consumer secret of your application, respectively), and then open the URL it prints out while you are logged into your bot's account.
In [ ]:
twitter = twython.Twython(api_key, api_secret)
auth = twitter.get_authentication_tokens()
print("Log into Twitter as the user you want to authorize and visit this URL:")
print("\t" + auth['auth_url'])
On the page that appears, confirm that you want to authorize the application. A PIN will appear. Paste this PIN into the cell below, as the value assigned to the variable pin
.
In [ ]:
pin = ""
twitter = twython.Twython(api_key, api_secret, auth['oauth_token'], auth['oauth_token_secret'])
tokens = twitter.get_authorized_tokens(pin)
new_access_token = tokens['oauth_token']
new_token_secret = tokens['oauth_token_secret']
print("your access token:", new_access_token)
print("your token secret:", new_token_secret)
Great! Now you have an access token and token secret for your bot's account. Run the cell below to create a new Twython object authorized with these credentials.
In [85]:
twitter = twython.Twython(api_key, api_secret, new_access_token, new_token_secret)
And run the following cell to post a test tweet:
In [86]:
twitter.update_status(status="hello, world!")
Out[86]:
In [88]:
import pg8000
lakes = list()
conn = pg8000.connect(database="mondial")
cursor = conn.cursor()
cursor.execute("SELECT name, area, depth, elevation, type, river FROM lake")
for row in cursor.fetchall():
lakes.append({'name': row[0],
'area': row[1],
'depth': row[2],
'elevation': row[3],
'type': row[4],
'river': row[5]})
len(lakes)
Out[88]:
The following dictionary maps each column to a sentence frame:
In [107]:
sentences = {
'area': 'The area of {} is {} square kilometers.',
'depth': 'The depth of {} is {} meters.',
'elevation': 'The elevation of {} is {} meters.',
'type': 'The type of {} is "{}."',
'river': '{} empties into a river named {}.'
}
The following cell selects a random lake from the list, and a random sentence frame from the sentences
dictionary, and attempts to fill in the frame with relevant information from the lake.
In [112]:
import random
def random_lake_sentence(lakes, sentences):
rlake = random.choice(lakes)
# get the keys in the dictionary whose value is not None; we'll only try to
# make sentences for these
possible_keys = [k for k, v in rlake.items() if v is not None and k != 'name']
rframe = random.choice(possible_keys)
return sentences[rframe].format(rlake['name'], rlake[rframe])
for i in range(10):
print(random_lake_sentence(lakes, sentences))
We can now call the .update_status()
function with the result of the random text generation function:
In [110]:
twitter.update_status(status=random_lake_sentence(lakes, sentences))
Out[110]:
In [ ]:
getting access tokens for yourself:
created an app
authorize yourself to use that app using the "create token" button
getting access tokens for someone else:
create the app
have the other user authorize the application <- ocmplicated process!!!
In [ ]:
twitter = twython.Twython(api_key, api_secret)
auth = twitter.get_authentication_tokens()
print("Log into Twitter as the user you want to authorize and visit this URL")
In [ ]: