My thanks to Bill Howe of the University of Washington for his Coursera course Introduction to Data Analysis and Matthew A. Russell for his book Mining the Social Web for getting me started.
Note 1: I run the IPython notebook in pylab=inline mode which makes it look like Matlab; this may mean that you will have to explicitly import some modules that I take for granted. Alternatively, you could configure yourself for pylab, which I obviously think is better.
Note 2: Twitter very popular outside the United States. Many non-English languages have codesets that I don't handle properly leading to program failures. Keep calm and carry on: at worst, you'll have to restart the kernel.
I have my Twitter credentials in a file twitter_credential.py ... you need to create your own as follows:
def twitter_credentials():
"""
Generate your own Twitter credentials here: https://apps.twitter.com/
"""
api_key = " "
api_secret = " "
access_token_key = " "
access_token_secret = " "
return (api_key, api_secret, access_token_key, access_token_secret)
If you get a message similar to <twitter.api.Twitter object at 0x000000000E79AAC8>, you've succesfully signed on
In [1]:
import twitter_credentials
reload(twitter_credentials)
from twitter_credentials import twitter_credentials # my credentials
token, token_secret, consumer_key, consumer_secret = twitter_credentials()
#consumer_key, consumer_secret, token, token_secret = twitter_credentials()
import oauth2 as oauth
import urllib2 as urllib
import twitter
auth = twitter.oauth.OAuth(consumer_key, consumer_secret, token, token_secret)
twitter_api = twitter.Twitter(auth=auth)
print twitter_api
In [2]:
%qtconsole
twitter_functions.pyI have a file twitter_functions.py that contains a number of utility functions. Find it on my GitHub repo: grfiv/healthcare_twitter_analysis
You also need to have the AFINN-111.txt sentiment-word file installed.
A note on AFINN-111.txt: the file has a tab delimiter separating words and phrases and their score. You can add your own n-grams as long as you retain the formatting. I strip out extra white space in phrases so you don't have to be exact in that regard. The sentiment score is simply the sum of the scores of any n-grams found in the text after urls, hashtags and users have been removed.
Modify the word or phrase in q = " ... ". Only one term.
Try changing num_tweet_request, too.
In [3]:
import sys
num_tweet_request = 500
q = "Lung Cancer"
try:
treturn = twitter_api.search.tweets(q=q, count=num_tweet_request)
except:
print sys.exc_info()
initial_results = treturn['statuses']
The idea is that the the best tweets are ones that have been retweeted. So we will keep only those tweets that have a retweet_count greater than some threshold.
favorite_count is also supposed to have some bearing on importance but it's usually zero in the ones I've looked at.
In [4]:
retweet_threshold = 25
results = [tweet
for tweet in initial_results
if tweet['retweet_count'] > retweet_threshold]
num_tweets = len(results)
print "%d tweets received"%len(initial_results)
print "%d tweets after retweet filtering"%len(results)
In [ ]:
# I commented this out because the list is very long
# However, to do any serious Twitter munging, you need to know all about this thing
#print json.dumps(results[1], indent=4)
num_tweets responses returnedI use my utility function parse_tweet_text for the body of the tweet itself; it also returns the AFINN sentiment, for which you must have AFINN-111.txt installed. I parse the json directly for the other fields.
A response of [] means nothing was returned for a particular 'entity'.
In [6]:
from twitter_functions import parse_tweet_text
senders = []
for i in range(num_tweets):
senders.append(results[i]['user']['name'])
tweet_text = results[i]['text']
words, hashes, users, urls = parse_tweet_text(tweet_text)
print "the actual text of results[%d]['text']\nretweet_count = %d\nfavorite_count = %d\n...................\n%s"%(i,results[i]['retweet_count'],results[i]['favorite_count'],tweet_text)
print "\nthe list of words %s"%[word.encode('utf-8') for word in words]
print "hashtags referenced %s"%[hash_.encode('utf-8') for hash_ in hashes]
print "twitter users mentioned %s"%[user.encode('utf-8') for user in users]
print "URLs %s"%[url.encode('utf-8') for url in urls]
print "............................"
print "description: %s"%results[i]['user']['description']
print "followers_count: %d"%results[i]['user']['followers_count']
print "friends_count: %d"%results[i]['user']['friends_count']
print "favourites_count: %d"%results[i]['user']['favourites_count']
print "location: %s"%results[i]['user']['location']
print "time_zone: %s"%results[i]['user']['time_zone']
print "screen_name: %s"%results[i]['user']['screen_name']
print "name: %s"%results[i]['user']['name']
if results[i]['place'] is not None: print "place: %s"%results[i]['place']
print "====================================================================================\n"
In [9]:
tweets = ""
for i in range(num_tweets):
tweets = tweets + " " + results[i]['text']
words, hashes, users, urls = parse_tweet_text(tweets)
In [10]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop.append("it's")
stop.append('w/')
filtered_word_list = [word for word in words if word not in stop]
filtered_word_set = set(filtered_word_list)
print "%d total words"%len(filtered_word_list)
print "%d unique words"%len(filtered_word_set)
In [11]:
from collections import Counter
from prettytable import PrettyTable
for label, data in (('Word', filtered_word_list),
('Senders', senders),
('Users Mentioned', users),
('Hashtag', hashes),
('URL', urls)):
pt = PrettyTable(field_names=[label, 'Count'])
c = Counter(data)
[ pt.add_row(kv) for kv in c.most_common()[:10] ]
pt.align[label], pt.align['Count'] = 'l', 'r' # Set column alignment
print pt
In [12]:
from twitter_functions import lexical_diversity, average_words
print "Lexical Diversity"
print "words %0.2f"%lexical_diversity(set(words), words) # words not filtered_word_list
print "hashes %0.2f"%lexical_diversity(set(hashes), hashes)
print "users %0.2f"%lexical_diversity(set(users), users)
print "urls %0.2f"%lexical_diversity(set(urls), urls)
print "\nAverage per tweet"
print "words %0.2f"%average_words(words, num_tweets)
print "filtered words %0.2f"%average_words(filtered_word_list, num_tweets)
print "hashes %0.2f"%average_words(hashes, num_tweets)
print "users %0.2f"%average_words(users, num_tweets)
print "urls %0.2f"%average_words(urls, num_tweets)
Note 1: foreign-language codes often cause this to fail.
Note 2: the lexical density determines the look of the result: a few words in a mass of many infrequent words makes for a more attractive picture. Playing with the numerical parameter to num_tags also has an effect.
In [13]:
from pytagcloud import create_tag_image, make_tags
import IPython.display
# 'rt' is very common and useless
filter_ex_rt = list(filtered_word_list)
while 'rt' in filter_ex_rt: filter_ex_rt.remove('rt')
for item in [filter_ex_rt]:
c = Counter(item)
num_tags = min(150, len(c.most_common())-1)
tags = make_tags(c.most_common()[:num_tags], maxsize=120)
create_tag_image(tags, 'wordcloud.png', size=(900,900), fontname='Lobster' )
IPython.display.display(IPython.display.Image(filename='wordcloud.png'))
In [15]:
import json
tweet_file = open("../files/bigtweet_file002.json", "r")
tweet_list = [json.loads(str(line)) for line in tweet_file]
len(tweet_list)
Out[15]:
In [16]:
text_set = set()
popular_tweets = []
retweet_threshold = 199
for tweet in tweet_list:
if tweet['retweet_count'] > retweet_threshold and tweet['text'] not in text_set:
text_set.add(tweet['text'])
popular_tweets.append(tweet)
len(popular_tweets)
Out[16]:
In [17]:
for tweet in popular_tweets:
print "retweets: %d; user name: %s; screen name: %s\ntext: %s\n"%\
(tweet['retweet_count'],tweet['user']['name'],tweet['user']['screen_name'],tweet['text'])
In [18]:
import sys
try:
users = twitter_api.users.lookup(screen_name='WorldAIDSDayUS')
except:
emsg = sys.exc_info()
print emsg
user = users[0]
In [119]:
print user['name']
print user['description']
print user["location"]
print user["followers_count"]
print user["friends_count"]
print user["statuses_count"]
In [19]:
import sys
import json
try:
followers = twitter_api.followers.ids(screen_name='WorldAIDSDayUS')
except:
emsg = sys.exc_info()[1]
print emsg.e
print emsg.response_data
print json.dumps(followers,indent=4)
In [20]:
followers['ids']
Out[20]:
In [ ]: