In this section we are going to see how to connect to the Twitter API to collect tweets and save them.
"In computer programming, an Application Programming Interface (API) is a set of subroutine definitions, protocols, and tools for building application software." wikipedia
The Twitter API is the tool we use to collect tweets from Twitter
Twitter offers two different APIs:
The Streaming API (https://dev.twitter.com/streaming/public) which allows to access a sample (~1%) of the public data flowing through Twitter.
The REST API (https://dev.twitter.com/rest/public) which provide programmatic access to read and write Twitter data.
To use the Twitter API from python, we will use the library tweepy which facilitate the access to the API.
To install it run the following command in your terminal or execute the cell below:
pip install tweepy
In [ ]:
# this will install tweepy on your machine
!pip install tweepy
Create a Twitter app and find your consumer token and secret
Create New App
manage keys and access tokens
create my access token
In [ ]:
consumer_key = 'xxx'
consumer_secret = 'xxx'
access_token = 'xxx'
access_token_secret = 'xxx'
In [ ]:
import tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
# create the api object that we will use to interact with Twitter
api = tweepy.API(auth)
In [ ]:
# example of:
tweet = api.update_status('Hello Twitter')
In [ ]:
# see all the information contained in a tweet:
print(tweet)
source : http://tweepy.readthedocs.io/en/v3.5.0/streaming_how_to.html
This simple stream listener prints status text. The on_data method of Tweepy’s StreamListener conveniently passes data from statuses to the on_status method. Create class MyStreamListener inheriting from StreamListener and overriding on_status.:
In [ ]:
#override tweepy.StreamListener to make it print tweet content when new data arrives
class MyStreamListener(tweepy.StreamListener):
def on_status(self, status):
print(status.text)
In [ ]:
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener)
A number of twitter streams are available through Tweepy. Most cases will use filter, the user_stream, or the sitestream. For more information on the capabilities and limitations of the different streams see Twitter Streaming API Documentation
In this example we will use filter to stream all tweets containing the word python. The track parameter is an array of search terms to stream.
In [ ]:
myStream.filter(track=['new york'])
In [ ]:
myStream.disconnect()
In [ ]:
myStream.filter(track=['realdonaldtrump,trump'], languages=['en'])
In [ ]:
myStream.disconnect()
In [ ]:
# streaming tweets from a given location
# we need to provide a comma-separated list of longitude,latitude pairs specifying a set of bounding boxes
# for example for New York
myStream.filter(locations=[-74,40,-73,41])
In [ ]:
myStream.disconnect()
In [ ]:
#override tweepy.StreamListener to make it save data to a file
class StreamSaver(tweepy.StreamListener):
def __init__(self, filename, max_num_tweets=2000, api=None):
self.filename = filename
self.num_tweets = 0
self.max_num_tweets = max_num_tweets
tweepy.StreamListener.__init__(self, api=api)
def on_data(self, data):
#print json directly to file
with open(self.filename,'a') as tf:
tf.write(data)
self.num_tweets += 1
print(self.num_tweets)
if self.num_tweets >= self.max_num_tweets:
return True
def on_error(self, status):
print(status)
In [ ]:
# create the new StreamListener and stream object that will save collected tweets to a file
saveStream = StreamSaver(filename='testTweets.txt')
mySaveStream = tweepy.Stream(auth = api.auth, listener=saveStream)
In [ ]:
mySaveStream.filter(track=['realdonaldtrump,trump'], languages=['en'])
In [ ]:
mySaveStream.disconnect()