A workshop by Dillon Niederhut and Juan Shishido.
Interacting with **tob_pohskrow**.
The W's: What, why, and probably even how.
A bot is a program that runs user defined tasks in an automated way.
For fun! For laughs. For productivity.
You can test code. You can use it to collect data. On Twitter, you can use a bot to post automated status updates. You can even use a bot can alert you when certain events happen (inside or outside of Twitter).
At the D-Lab, we pull training/workshop information from our calendar, generate a tweet, and post it to the @DLabAtBerkeley account. We're trying to add more functionality, such as including instructor usernames in the tweets or processing the descriptions and titles to come up with a short, descriptive summary.
A team of researchers led by Jacob Eisenstein used a bot to collect 107 million tweets, which were then used ot track the diffusion of linguistic variants across the United States.
There are lots of people doing interesting things with bots on Twitter. For inspiration, see: http://qz.com/279139/the-17-best-bots-on-twitter/.
Both of the world's most famous vocab-limited characters have Twitter accounts that reply with their catchphrases to anyone who mentions their name.
Read on.
API is shorthand for Application Programming Interface, which is in turn computer-ese for a middleman.
Think about it this way. You have a bunch of things on your computer that you want other people to be able to look at. Some of them are static documents, some of them call programs in real time, and some of them are programs themselves.
You publish login credentials on the internet, and let anyone log into your computer
Problems:
People will need to know how each document and program works to be able to access their data
You don't want the world looking at your browser history
You paste everything into HTML and publish it on the internet
Problems:
This can be information overload
Making things dynamic can be tricky
You create a set of methods to act as an intermediary between the people you want to help and the things you want them to have access to.
Why this is the best solution:
People only access what you want them to have, in the way that you want them to have it
People use one language to get the things they want
Why this is still not Panglossian:
Currently 8th ranked website worldwide, 7th in the US
288 million users per month
500 million tweets per day
80% of users are on mobile devices
Support for 33 languages
American Twitter users are disproprtionately from underrepresented communities
Fun Fact: the third most-searched for term leading to Twitter, after 'Twitter' and 'CNN' is the name of a porn actress
User histories
User (and tweet) location
User language
Tweet popularity
Tweet spread
Conversation chains
Mexico's government has been accused of using Twitter for false flag operations
GCHQ has a software library purportedly designed to modulate public opinion
Someone here used a Twitter bot to occupy all of State Bird Provision's table reservations
Twitter's API does not return all tweets that match your search criteria
The sampling method is not published, and can change without notice
Location information is not necessarily provided by GPS
Two years ago, approximately 20 million Twitter accounts were advertising bots
Appoximately 1/3 of any accounts followers are not humans
Of the top ten accounts (by followers), eight are celebrities (the other two are YouTube and the current President)
Simple, right? 140 characters. Done.
In [ ]:
import json
with open('data/first_tweet.json','r') as f:
a_tweet = json.loads(f.read())
In [ ]:
print a_tweet['text']
In [ ]:
from pprint import pprint
pprint(a_tweet)
You have access to more than just the text.
JSON (JavaScript Object Notation), specified by RFC 7159 (which obsoletes RFC 4627) and by ECMA-404, is a lightweight data interchange format inspired by JavaScript object literal syntax (although it is not a strict subset of JavaScript).
Sure. Think of it like Python's dict
type. Keys and values, collectively referred to as items, are separated by a colon. Multiple items are separated by commas. Keys must be immutable and unique to a particular dict
. Values can be of any type, including list
or even another dict
. Dictionaries are always surrounded by braces, {}
.
In JSON, it's common practice to have nested or hierarchical dictionaries. For example, some Reddit endpoints return JSON objects that are six dictionaries deep.
So, how can you access data within a Python dict
? You first need the keys. To get the keys, you must look at the data or use the .keys()
method, which returns a list of the key names in an arbitrary order.
In [ ]:
a_tweet.keys()
How about the values? Use the dictionary name along with the key in square brackets. We used this above to access the tweet's text. If you're interested in knowing when the tweet was created, use the following.
In [ ]:
a_tweet['user'].keys()
In [ ]:
a_tweet['created_at']
Note: This is given in UTC. The offset is shown by the +0000
.
The thing about JSON data or Python dictionaries is that they can have a nested structure. What if we want access to the values associated with the entities
key?
In [ ]:
a_tweet['entities']
It's a dictionary. To access any of those values, use the appropriate key.
In [ ]:
a_tweet['entities']['hashtags']
Of course, there are no hashtags associated with this tweet, so it's just an empty list
.
In [ ]:
type(a_tweet)
Before you proceed, you'll need four pieces of information.
While signed in to your Twitter account, go to: https://apps.twitter.com/. Follow the prompts to generate your keys and access tokens. You'll need to have a phone number associated with your account.
So, how do we actually access the Twitter API? Well, there are several ways. To search for something, you can use the search URL, which looks like: https://api.twitter.com/1.1/search/tweets.json?q=%40twitterapi. The q
is the query parameter. You can replace it with anything you want. However, if you follow this link, you'll get an error because your request was not authenticated.
For more information on the REST APIs, end points, and terms, check out: https://dev.twitter.com/rest/public. For the Streaming APIs: https://dev.twitter.com/streaming/overview.
Instead, we'll use Jonas Geduldig's TwitterAPI
module: https://github.com/geduldig/TwitterAPI. The nice thing about modules such as this one--yes, there are others--is that it handles the OAuth. TwitterAPI
supports both the REST and Streaming APIs.
To authenticate, run the following code.
In [ ]:
from TwitterAPI import TwitterAPI
consumer_key = '9cQ7SNtWsmTTfta8Gv5y8svWD'
consumer_secret = 'kjJllUPEJefFQ4Dfr6dBXDETiQaVWFXTt0zLSNMy8tY8F8IpqK'
access_token_key = '3129088320-dIfoDZOt5cIKVCFnJpS0krt3oCYPB13rk5ITavI'
access_token_secret = 'H41REM344zgKCvJenCGGsF1JbFSK8I1r1WvFrc8Fs74jg'
api = TwitterAPI(consumer_key, consumer_secret, access_token_key, access_token_secret)
We've created a Twitter account for this talk. Feel free to use the following keys and access tokens to familiarize yourself with the API. But, be aware that Twitter imposes rate limits, and that these rate limits are different for different kinds of API interactions.
Search will be rate limited at 180 queries per 15 minute window for the time being, but we may adjust that over time.
Notice that the end point is the same as in the URL example, search/tweets
.
In [ ]:
r = api.request('search/tweets', {'q':'technology'})
for item in r:
print item
The API supports what it calls query operators, which modify the search behavior. For example, if you want to search for tweets where a particular user is mentioned, include the at-sign, @
, followed by the username. To search for tweets sent to a particular user, use to:username
. For tweets from a particular user, from:username
. For hashtags, use #hashtag
.
For a complete set of options: https://dev.twitter.com/rest/public/search.
To make things clearer, let's use variables.
In [ ]:
end_point = 'search/tweets'
parameters = {
'q':'from:Engadget',
'count':1
}
r = api.request(end_point, parameters)
for item in r:
print item['text'] + '\n'
You can also search user timelines. Notice the change in the end point and parameter values.
In [ ]:
end_point = 'statuses/user_timeline'
parameters = {
'screen_name':'UCBerkeley',
'count':5
}
r = api.request(end_point, parameters)
for item in r:
print item['text']
In [ ]:
end_point = 'search/tweets'
parameters = {
'q':'technology',
'geocode':'37.871667,-122.272778,5km', # UC Berkeley
'count':1
}
r = api.request(end_point, parameters)
for item in r:
print item['text']
In [ ]:
end_point = 'search/tweets'
parameters = {
'q':'*',
'lang':'fr',
'count':1
}
r = api.request(end_point, parameters)
for item in r:
print item['text']
In [ ]:
end_point = 'statuses/filter'
parameters = {
'q':'coding',
'locations': '-180,-90,180,90'
}
r = api.request(end_point, parameters)
tweets = r.get_iterator()
for i in range(15):
t = tweets.next()
print t['place']['full_name'] + ', ' + t['place']['country'] + ': ' + t['text'], '\n'
The other half of the game is posting.
In [ ]:
end_point = 'statuses/update'
parameters = {
'status':'.IPA rettiwT eht tuoba nraeL'
}
r = api.request(end_point, parameters)
print r.status_code
Now that you know how to search for tweets, how about we save them?
In [ ]:
print r.text
In [ ]:
for item in r:
filename = item['id_str'] + '.json'
with open(filename,'w') as f:
json.dump(item,f)
Note: if you are doing a lot of these, it will be faster and easier to use a non-relational database like MongoDB
The real beauty of bots is that they are designed to work without interaction or oversight. Imagine a situation where you want to write a Twitter bot that replies 'HOORAY!' every time someone posts on Twitter that they were accepted to Cal. One option is to write a python script like this and call it by hand every minute.
In [ ]:
import time
r = api.request('search/tweets', {'q':'accepted berkeley'})
for item in r:
username = item['user']['screen_name']
parameters = {'status':'HOORAY! @' + username}
r = api.request('statuses/update', parameters)
time.sleep(5)
print r.status_code
But you are a human that needs to eat, sleep, and be social with other humans. Luckily, most UNIX
based systems have a time-based daemon called cron
that will run scripts like this for you. The way that cron
works is it reads in files where each line has a time followed by a job (these are called cronjobs). They looks like this:
0 * * * * python twitter_bot.py
This is telling cron
to execute python twitter_bot.py
at 0
seconds, every minute, every hour, every day, every year, until the end of time.
In [ ]:
# That thing after crontab is a lowercase L even though it looks like a 1
# This will execute directly through your shell, so use at your own risk
# Make sure you replace the <> with the file path
!crontab -l | { cat; echo "0 * * * * python <absolute path to>twitter_bot.py"; } | crontab -
If you are using a mac (especially Mavericks or newer), Apple prefers that you use their init library, called launchd
. launchd
is a bit more complicated, and requires that you create an xml document that will be read by Apple's init service:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>twitter.bot</string>
<key>ProgramArguments</key>
<array>
<string>python</string>
<string>twitter_bot.py</string>
</array>
<key>StartCalendarInterval</key>
<dict>
<key>Minute</key>
<integer>00</integer>
</dict>
</dict>
</plist>
We won't be messing with launchd
for this workshop.
To get you started, here is a template in python. You should modify the search parameters and post parameters to get the bot to act the way you want.
You might find it easier to copy this out of the notebook into a python script and run it from your terminal.
In [ ]:
from TwitterAPI import TwitterAPI
import time
consumer_key = '9cQ7SNtWsmTTfta8Gv5y8svWD'
consumer_secret = 'kjJllUPEJefFQ4Dfr6dBXDETiQaVWFXTt0zLSNMy8tY8F8IpqK'
access_token_key = '3129088320-dIfoDZOt5cIKVCFnJpS0krt3oCYPB13rk5ITavI'
access_token_secret = 'H41REM344zgKCvJenCGGsF1JbFSK8I1r1WvFrc8Fs74jg'
api = TwitterAPI(consumer_key, consumer_secret, access_token_key, access_token_secret)
request_parameters = {} #Enter your search parameters here
def main():
while True: #You may want to set a condition here
r = api.request('search/tweets', request_parameters)
if r.status_code == 200:
for item in r:
if True: #You may want to set a condition here
post_parameters = {} #Enter your post parameters here
p = api.request('statuses/update', post_parameters)
print p.status
time.sleep(15)
if r.status_code == 420: #If Twitter is throttling you
break
if r.status_code == 429: #If you are exceeding the rate limit
time.sleep(60)
if __name__ == 'main':
main()
You can see an example of how to set conditions and search parameters in the code that powers the berkeleymood Twitter bot
If you have tried to run this, or some of the earlier code in this notebook, you have probably encountered some of Twitter's error codes. Here are the most common, and why you are triggering them.
400 = bad request
- This means the API (middleman) doesn't like how you formatted your request. Check the API documentation to make sure you are doing things correctly.
401 = unauthorized
- This either means you entered your auth codes incorrectly, or those auth codes don't have permission to do what you're trying to do. It takes Twitter a while to assign posting rights to your auth tokens after you've given them your phone number. If you have just done this, wait five minutes, then try again.
403 = forbidden
- Twitter won't let you post what you are trying to post, most likely because you are trying to post the same tweet twice in a row within a few minutes of each other. Try changing your status update. If that doesn't fix it, then you are either:
A. Hitting Twitter's daily posting limit. They don't say what this is.
B. Trying to follow too many people, rapidly following and unfollowing the same person, or are otherwise making Twitter think you are a spambot
420 = enhance your calm
- Simultaneously a joke about San Rafael High students and Sylverster Stallone's prescient film about the future, it has been deprecated in favor of:
429 = too many requests
- This means that you have exceeded Twitter's rate limit for whatever it is you are trying to do. Increase your time.sleep()
value.
In [ ]: