IPython keyboard shortcuts: http://ipython.org/ipython-doc/stable/interactive/notebook.html#keyboard-shortcuts
In [1]:
from __future__ import division, print_function, unicode_literals
%matplotlib inline
import os
import IPython.display
import numpy as np
import requests
import requests_oauthlib
import oauthlib
import arrow
import BeautifulSoup
import json_io
import yaml_io
import utilities
import twitter
The following set of links to Twitter's documentation are those I found most useful:
The page Help with the Search API has this helpful tidbit of information when you expect a large number of return tweets. In this case it is important to pay attention to iterating through the results:
Iterating in a result set: parameters such count, until, since_id, max_id allow to control how we iterate through search results, since it could be a large set of tweets. The 'Working with Timelines' documentation is a very rich and illustrative tutorial to learn how to use these parameters to achieve the best efficiency and reliability when processing result sets.
I have written a module named twitter.py
which contains useful functions and classes based on what I learned with the previous notebook. One of the first capabilities I added was a function to generate a session
object from the requests
package, authorized via OAuth-2.
The cell below demonstrates querying the Twitter API for information on my account's current rate limit status.
In [8]:
session = utilities.authenticate()
print('\nclient_id: {:s}'.format(session.client_id.client_id))
info = utilities.rate_limit_from_api(session)
print('\nRate Status')
print('-----------')
print('Limit: {:d}'.format(info['limit']))
print('Remaining: {:d}'.format(info['remaining']))
delta = arrow.get(info['reset']) - arrow.now()
seconds = delta.total_seconds()
minutes = seconds / 60.
print('Reset: {:02.0f}:{:4.1f}'.format(minutes, seconds-int(minutes)*60.))
Another important capabiity in twitter.py
is the ability to search for Tweets matching a specified text pattern. The primary interface to search is though the class Tweet_Search
. Calling the method run()
on an instance returns a generator allowing for efficient retrieval of matching tweets.
The cell below shows a simple example. The query "grey hound dog" gets relatively few hits, about once per day or so. This is nice for testing as it. If I comment out this query and try again with something like the title of a current popular movie, I will receive many tens of thousands of tweets. This is when it became clear to me that I need another layer to manage larger volumes of tweets.
In [7]:
query = 'grey hound dog'
# query = 'hobbit desolation smaug'
# Output folder. Will be created if it oes not already exist.
path_example = os.path.join(os.path.curdir, 'tweets_testing_one')
# Build a search object that knows how to talk to Twitter's API.
searcher = twitter.Search(session)
# Run a search for a specific query string, operates as a generator.
gen = searcher.run(query)
# Loop over returned results.
for k, tw in enumerate(gen):
print('\n{:3d} | {:s} | {:s}'.format(k, str(tw.timestamp), tw.text))
# Save Tweet to file.
tw.serialize(path_example)
# Stop the search if it goes on for too long.
if k > 250:
raise StopIteration
In [4]:
print(tw)
In [5]:
mgr_tweets = twitter.Tweet_Manager(path_example)
print(mgr_tweets.count)
print(mgr_tweets.min_id)
print(mgr_tweets.max_id)
print(mgr_tweets.min_timestamp)
print(mgr_tweets.max_timestamp)
In [6]:
# Output folder.
path_example = os.path.join(os.path.curdir, 'tweets_testing_two')
searcher = twitter.Search(session)
mgr_tweets = twitter.Tweet_Manager(path_example)
# Run a search for a specific query string, operates as a generator.
gen = searcher.run(query)
# Loop over returned results.
for k, tw in enumerate(gen):
print('\n{:3d} | {:s} | {:s}'.format(k, str(tw.timestamp), tw.text))
mgr_tweets.add_tweet_obj(tw)
# Stop the search if it goes on for too long.
if k > 250:
raise StopIteration
print()
print(mgr_tweets.count)
print(mgr_tweets.min_id)
print(mgr_tweets.max_id)
print(mgr_tweets.min_timestamp)
print(mgr_tweets.max_timestamp)
In [7]:
for tw in mgr_tweets.tweets:
print('{:d} | {:<10s} | {:s}'.format(tw.id_int, tw.source, tw.source_full))
# print('{:d} | filter: {:5s}'.format(tw.id_int, str(twitter.filter(tw))))
# print('{:d} --> {:s}'.format(tw.id_int, tw.text))
# print('{:d} | {:s} | {:5s} | {:2d}'.format(tw.id_int, str(tw.timestamp), str(tw.retweet), tw.retweet_count))
Next up is a Search_Manager
to help search for new Tweets to add to a new or existing collection desribed by a Tweet_Manager
. So far I have an easy way to serialize a Tweet to a .json file. But I may need to restart a search that was interrupted or if I hit the rate limit. I will also want to refresh a given search sometime in the future. Search_Manager
is implemented as a subclass of Tweet_Manager
and makes direct use of the Tweet_Search
class.
In [13]:
path_example = os.path.join(os.path.curdir, 'tweets_testing_three')
query = 'grey hound dog'
# query = 'happy dog'
# query = 'hobbit desolation smaug'
manager = twitter.Search_Manager(session, query, path_example)
manager.search()
print(manager.count)
print(manager.min_timestamp)
print(manager.max_timestamp)
print(manager.api_remaining)
In [12]:
path_query = os.path.join(os.path.curdir, 'tweets')
# query = 'hobbit desolation smaug'
# query = 'Anchorman 2 The Legend Continues'
query = 'Anchorman'
manager = twitter.Search_Manager(session, query, path_query)
manager.search_continuous()
print(manager.api_remaining)
In [12]:
In [54]:
In [ ]:
In [ ]:
In [8]:
In [8]: