Case Study 1 : Collecting Data from Twitter

Due Date: February 10, before the class


TEAM Members:

Haley Huang
Helen Hong
Tom Meagher
Tyler Reese

Required Readings:

NOTE

  • Please don't forget to save the notebook frequently when working in IPython Notebook, otherwise the changes you made can be lost.

Problem 1: Sampling Twitter Data with Streaming API about a certain topic

  • Select a topic that you are interested in, for example, "WPI" or "Lady Gaga"
  • Use Twitter Streaming API to sample a collection of tweets about this topic in real time. (It would be recommended that the number of tweets should be larger than 200, but smaller than 1 million.
  • Store the tweets you downloaded into a local file (txt file or json file)

Our topic of interest is the New England Patriots. All Patriots fans ourselves, we were disappointed upon their elimination from the post-season, and unsure how to approach the upcoming Super Bowl. In order to sample how others may be feeling about the Patriots, we use search term "Patriots" for the Twitter streaming API.


In [25]:
# HELPER FUNCTIONS
import io
import json
import twitter


def oauth_login(token, token_secret, consumer_key, consumer_secret):
    """
    Snag an auth from Twitter
    """
    auth = twitter.oauth.OAuth(token, token_secret,
                               consumer_key, consumer_secret)
    return auth


def save_json(filename, data):
    """
    Save json data to a filename
    """
    print 'Saving data into {0}.json...'.format(filename)
    with io.open('{0}.json'.format(filename),
                 'w', encoding='utf-8') as f:
        f.write(unicode(json.dumps(data, ensure_ascii=False)))


def load_json(filename):
    """
    Load json data from a filename
    """
    print 'Loading data from {0}.json...'.format(filename)
    with open('{0}.json'.format(filename)) as f:
        return json.load(f)

In [26]:
# API CONSTANTS
CONSUMER_KEY = '92TpJf8O0c9AWN3ZJjcN8cYxs'
CONSUMER_SECRET ='dyeCqzI2w7apETbTUvPai1oCDL5oponvZhHSmYm5XZTQbeiygq'
OAUTH_TOKEN = '106590533-SEB5EGGoyJ8EsjOKN05YuOQYu2rg5muZgMDoNrqN'
OAUTH_TOKEN_SECRET = 'BficAky6uGyGfRzDGJqZYVKo0HS6G6Ex3ijYW3zy3kjNJ'

In [27]:
# CREATE AND CHECK API AND STREAM
auth = oauth_login(CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
twitter_api = twitter.Twitter(auth=auth)
twitter_stream = twitter.TwitterStream(auth=auth)
if twitter_api and twitter_stream:
    print 'Bingo! API and stream set up!'
else:
    print 'Hmmmm, something is wrong here.'


Bingo! API and stream set up!

In [3]:
# COLLECT TWEETS FROM STREAM WITH TRACK, 'PATRIOTS'
# track = "Patriots"  # Tweets for Patriots

# TOTAL_TWEETS = 2500

# patriots = []
# patriots_counter = 0

# while patriots_counter < TOTAL_TWEETS:  # collect tweets while current time is less than endTime
#     # Create a stream instance
#     auth = oauth_login(consumer_key=CONSUMER_KEY, consumer_secret=CONSUMER_SECRET,
#                        token=OAUTH_TOKEN, token_secret=OAUTH_TOKEN_SECRET)
#     twitter_stream = TwitterStream(auth=auth)
#     stream = twitter_stream.statuses.filter(track=track)
#     counter = 0
#     for tweet in stream:
#         if patriots_counter == TOTAL_TWEETS:
#             print 'break'
#             break
#         elif counter % 500 == 0 and counter != 0:
#             print 'get new stream'
#             break
#         else:
#             patriots.append(tweet)
#             patriots_counter += 1
#             counter += 1
#             print patriots_counter, counter

# save_json('json/patriots', patriots)

In [29]:
# Use this code to load tweets that have already been collected
filename = "stream/json/patriots"
results = load_json(filename)
print 'Number of tweets loaded:', len(results)


Loading data from stream/json/patriots.json...
Number of tweets loaded 2500

In [5]:
# Compute additional statistics about the tweets collected

# Determine the average number of words in the text of each tweet
def average_words(tweet_texts):
    total_words =  sum([len(s.split()) for s in tweet_texts])
    return 1.0*total_words/len(tweet_texts)

tweet_texts = [ tweet['text'] 
                 for tweet in results ]

print 'Average number of words:', average_words(tweet_texts)

# Calculate the lexical diversity of all words contained in the tweets
def lexical_diversity(tokens):
    return 1.0*len(set(tokens))/len(tokens)

words = [ word 
          for tweet in tweet_texts 
              for word in tweet.split() ]

print 'Lexical Diversity:', lexical_diversity(words)


Average number of words: 14.5536
Lexical Diversity: 0.190633245383

Report some statistics about the tweets you collected

  • The topic of interest: Patriots

  • The total number of tweets collected: 2,500

  • Average number of words per tweet: 14.5536

  • Lexical Diversity of all words contained in the collection of tweets: 0.1906


Problem 2: Analyzing Tweets and Tweet Entities with Frequency Analysis

1. Word Count:

  • Use the tweets you collected in Problem 1, and compute the frequencies of the words being used in these tweets.
  • Plot a table of the top 30 words with their counts

In [6]:
from collections import Counter
from prettytable import PrettyTable
import nltk

tweet_texts = [ tweet['text'] 
                 for tweet in results ]
words = [ word 
          for tweet in tweet_texts 
              for word in tweet.split()
                 if word not in ['RT', '&amp;'] # filter out RT and ampersand
        ]

# Use the natural language toolkit to eliminate stop words

# nltk.download('stopwords') # download stop words if you do not have it
stop_words = nltk.corpus.stopwords.words('english')
non_stop_words = [w for w in words if w.lower() not in stop_words]

# frequency of words
count = Counter(non_stop_words).most_common()

# table of the top 30 words with their counts
pretty_table = PrettyTable(field_names=['Word', 'Count']) 
[ pretty_table.add_row(w) for w in count[:30] ]
pretty_table.align['Word'] = 'l'
pretty_table.align['Count'] = 'r'
print pretty_table


+-------------------------+-------+
| Word                    | Count |
+-------------------------+-------+
| Patriots                |   637 |
| @Patriots:              |   372 |
| patriots                |   346 |
| #NFLHonors              |   297 |
| Gronk.                  |   285 |
| Deion.                  |   285 |
| https://t.co/v762e2V5TF |   283 |
| @Patriots               |   234 |
| @USFreedomArmy:         |   177 |
| #Patriots               |   158 |
| Super                   |   132 |
| http://t.co/oSPeY3QMpH. |   129 |
| Bowl                    |   122 |
| Enlist                  |   121 |
| New                     |   118 |
| Join                    |   113 |
| England                 |   108 |
| Time                    |   103 |
| via                     |    95 |
| NFL                     |    95 |
| MT                      |    94 |
| Tom                     |    93 |
| behind                  |    93 |
| conservative            |    92 |
| #TLOT                   |    91 |
| unite                   |    91 |
| Cruz!                   |    90 |
| win                     |    90 |
| playing                 |    90 |
| @mericanrefugee:        |    90 |
+-------------------------+-------+

2. Find the most popular tweets in your collection of tweets

Please plot a table of the top 10 tweets that are the most popular among your collection, i.e., the tweets with the largest number of retweet counts.


In [7]:
from collections import Counter
from prettytable import PrettyTable

# Create a list of all tweets with a retweeted_status key, and index the originator of that tweet and the text.
retweets = [
            (tweet['retweet_count'],
            tweet['retweeted_status']['user']['screen_name'],
            tweet['text'])
            
            #Ensure that a retweet exists
            for tweet in results                      
                if tweet.has_key('retweeted_status')
            ]

pretty_table = PrettyTable(field_names = ['Count','Screen Name','Text'])

# Sort tweets by descending number of retweets and display the top 10 results in a table.
[pretty_table.add_row(row) for row in sorted(retweets, reverse = True)[:10]]
pretty_table.max_width['Text'] = 50
pretty_table.align = 'l'
print pretty_table


+-------+-----------------+----------------------------------------------------+
| Count | Screen Name     | Text                                               |
+-------+-----------------+----------------------------------------------------+
| 0     | vlewey          | RT @vlewey: @jillarie85  https://t.co/5CCdYV2F3t   |
| 0     | vitorsergio     | RT @vitorsergio: E ainda levei uma sacaneada       |
|       |                 | retroativa ao dizer que torço para o Patriots...   |
| 0     | vaneessab3      | RT @vaneessab3: Me: I need to buy a Broncos shirt  |
|       |                 | Sara: ya I need to buy a patriots shirt            |
|       |                 |                                                    |
|       |                 | .....????                                          |
|       |                 | @SaraEchelberry                                    |
| 0     | usosports1      | RT @usosports1: @Patriots RB @joeyiosefa gives     |
|       |                 | back to community/spends time with local           |
|       |                 | elementary kids. #PatriotNation #NFL #NFLPA        |
|       |                 | https:/…                                           |
| 0     | usosports1      | RT @usosports1: @Patriots RB @joeyiosefa gives     |
|       |                 | back to community/spends time with local           |
|       |                 | elementary kids. #PatriotNation #NFL #NFLPA        |
|       |                 | https:/…                                           |
| 0     | usosports1      | RT @usosports1: @Patriots RB @joeyiosefa gives     |
|       |                 | back to community/spends time with local           |
|       |                 | elementary kids. #PatriotNation #NFL #NFLPA        |
|       |                 | https:/…                                           |
| 0     | uhatremblay     | RT @uhatremblay: @curtisbeast Yeah...everyone      |
|       |                 | knows he's a fraud and incompetent with no         |
|       |                 | integrity...unless he goes after Patriots. Then    |
|       |                 | he…                                                |
| 0     | tweet4upatriots | RT @tweet4upatriots: Whose watching the 8th        |
|       |                 | REPUBLICAN DEBATE #PJNET #TCOT #CCOT #VETS #USMIL  |
|       |                 | #PATRIOTS #CONSERVATIVE PRAYERS  2016 https:/…     |
| 0     | tweet4upatriots | RT @tweet4upatriots: VETERANS FIRST                |
|       |                 | https://t.co/Xx3vRwETp2                            |
| 0     | tweet4upatriots | RT @tweet4upatriots: Support military spouses and  |
|       |                 | veteran's spouses #C2GTHR #LNYHBT #PJNET #TCOT     |
|       |                 | #VETS #USMIL #PATRIOTS #CONSERVATIVE https…        |
+-------+-----------------+----------------------------------------------------+

Another measure of tweet "popularity" could be the number of times it is favorited. The following calculates the top-10 tweets with the most "favorites"


In [8]:
from prettytable import PrettyTable

# Determine the number of "favorites" for each tweet collected.

favorites = [
            (tweet['favorite_count'],
             tweet['text'])
            for tweet in results
            ]
            
pretty_table = PrettyTable(field_names = ['Count','Text'])

# Sort tweets by descending number of favorites and display the top 10 results in a table.
[pretty_table.add_row(row) for row in sorted(favorites, reverse = True)[:10]]
pretty_table.max_width['Text'] = 75
pretty_table.align = 'l'
print pretty_table


+-------+-----------------------------------------------------------------------------+
| Count | Text                                                                        |
+-------+-----------------------------------------------------------------------------+
| 0     | 😍💍😍💍😍💍😍💍 https://t.co/tXUtoysKwu                                    |
| 0     | 😍 https://t.co/iwMCVHKJXw                                                  |
| 0     | 🔥🔥🔥🔥🔥🔥                                                                |
|       |                                                                             |
|       | 25 retweets for late NBA                                                    |
|       |                                                                             |
|       | Patriots army STAND UP                                                      |
| 0     | 📷 townienews: It’s been 14 years since Brady, Belichick &amp; the          |
|       | @Patriots gave me the best sports day of... https://t.co/ECfkSKg03w         |
| 0     | 💯 RT @TomLandrysGhost: I enjoyed watching #Patriots lose, NE fans are      |
|       | going to enjoy watching Peyton Manning get smoked (again) in #SB50          |
| 0     | 👏👏 https://t.co/zD98qCzhdd                                                |
| 0     | 次の @YouTube 動画を高く評価しました: https://t.co/JFvmGUvRQc 観るMETAL GEAR SOLID 4 GUNS |
|       | OF THE PATRIOTS                                                             |
| 0     | ★★★ Patriots Who Dare... Join our fight to save America! ➠ Click Here       |
|       | https://t.co/x76SZaT6Gf #BB4SP https://t.co/nBeFAa7u9M                      |
| 0     | ★★★ Patriots Who Dare... Join our fight to save America! ➠ Click Here       |
|       | https://t.co/x76SZaT6Gf #BB4SP https://t.co/2Eafww5z8F                      |
| 0     | ★★★ Patriots Who Dare... Join our fight to save America! ➠ Click Here       |
|       | https://t.co/x76SZaBvhF #BB4SP https://t.co/r6NR5IKw02                      |
+-------+-----------------------------------------------------------------------------+

3. Find the most popular Tweet Entities in your collection of tweets

Please plot a table of the top 10 hashtags, top 10 user mentions that are the most popular in your collection of tweets.


In [9]:
from collections import Counter
from prettytable import PrettyTable

# Extract the screen names which appear among the collection of tweets
screen_names = [user_mention['screen_name']
               for tweet in results
                   for user_mention in tweet['entities']['user_mentions']]

# Extract the hashtags which appear among the collection of tweets
hashtags = [ hashtag['text']
           for tweet in results
               for hashtag in tweet['entities']['hashtags']]

# Simultaneously determine the frequency of screen names/hashtags, and display the top 10 most common in a table.
for label, data in (('Screen Name',screen_names),
                   ('Hashtag',hashtags)):
    pretty_table = PrettyTable(field_names =[label,'Count'])
    counter = Counter(data)
    [ pretty_table.add_row(entity) for entity in counter.most_common()[:10]]
    pretty_table.align[label] ='l'
    pretty_table.align['Count'] = 'r'
    print pretty_table


+----------------+-------+
| Screen Name    | Count |
+----------------+-------+
| Patriots       |   629 |
| USFreedomArmy  |   177 |
| mericanrefugee |    90 |
| realOBF        |    44 |
| NFL            |    31 |
| PatriotsExtra  |    31 |
| HouSuperBowl   |    31 |
| BethMinyard    |    29 |
| Seahawks       |    21 |
| feistyoldguy   |    20 |
+----------------+-------+
+---------------------+-------+
| Hashtag             | Count |
+---------------------+-------+
| NFLHonors           |   298 |
| Patriots            |   183 |
| PJNET               |   123 |
| CruzCrew            |    96 |
| TLOT                |    91 |
| SB49                |    68 |
| PATRIOTS            |    67 |
| NFL                 |    57 |
| EsuranceSweepstakes |    55 |
| SuperBowl50         |    53 |
+---------------------+-------+

Problem 3: Getting "All" friends and "All" followers of a popular user in twitter

  • choose a popular twitter user who has many followers, such as "ladygaga".
  • Get the list of all friends and all followers of the twitter user.
  • Plot 20 out of the followers, plot their ID numbers and screen names in a table.
  • Plot 20 out of the friends (if the user has more than 20 friends), plot their ID numbers and screen names in a table.

Our chosen twitter user is RobGronkowski, one of the Patriots players


In [10]:
#----------------------------------------------
import sys
import time
from urllib2 import URLError
from httplib import BadStatusLine
import json
from functools import partial
from sys import maxint

# The following is the "general-purpose API wrapper" presented in "Mining the Social Web" for making robust twitter requests.
# This function can be used to accompany any twitter API function.  It force-breaks after receiving more than max_errors
# error messages from the Twitter API.  It also sleeps and later retries when rate limits are enforced.

def make_twitter_request(twitter_api_func, max_errors = 10, *args, **kw):
    def handle_twitter_http_error(e, wait_period = 2, sleep_when_rate_limited = True):
        
        if wait_period > 3600:
            print >> sys.stderr, 'Too many retries. Quitting.'
            raise e
        
        if e.e.code == 401:
            print >> sys.stderr, 'Encountered 401 Error (Not Authorized)'
            return None
        
        elif e.e.code == 404:
            print >> sys.stderr, 'Encountered 404 Error (Not Found)'
            return None
        
        elif e.e.code == 429:
            print >> sys.stderr, 'Encountered 429 Error (Rate Limit Exceeded)'
            if sleep_when_rate_limited:
                print >> sys.stderr, "Retrying again in 15 Minutes...ZzZ..."
                sys.stderr.flush()
                time.sleep(60*15 + 5)
                print >> sys.stderr, '...ZzZ...Awake now and trying again.'
                return 2
            else:
                raise e
                
        elif e.e.code in (500,502,503,504):
            print >> sys.stderr, 'Encountered %i Error.  Retrying in %i seconds' % \
                (e.e.code, wait_period)
            time.sleel(wait.period)
            wait.period *= 1.5
            return wait_period
        
        else:
            raise e
            
    wait_period = 2
    error_count = 0
    
    while True:
        try:
            return twitter_api_func(*args,**kw)
        except twitter.api.TwitterHTTPError, e:
            error_count = 0
            wait_period = handle_twitter_http_error (e, wait_period)
            if wait_period is None:
                return
            
        except URLError, e:
            error_count += 1
            print >> sys.stderr, "URLError encountered.  Continuing"
            if error_count > max_errors:
                print >> sys.stderr, "Too many consecutive errors...bailing out."
                raise
        
        except BadStatusLine, e:
            error_count += 1
            print >> sys.stderr, "BadStatusLineEncountered.  Continuing"
            if error_count > max_errors:
                print >> sys.stderr, "Too many consecutive errors...bailing out."
                raise

In [11]:
# This function uses the above Robust Request wrapper to retreive all friends and followers of a given user.  This code
# can be found in Chapter 9, the `Twitter Cookbook' in "Mining the social web"

from functools import partial
from sys import maxint

def get_friends_followers_ids(twitter_api, screen_name = None, user_id = None, friends_limit = maxint, followers_limit = maxint):
    assert(screen_name != None) != (user_id != None), \
    "Must have screen_name or user_id, but not both"
    
    # See https://dev.twitter.com/docs/api/1.1/get/friends/ids and
    # https://dev.twitter.com/docs/api/1.1/get/followers/ids for details
    # on API parameters
    
    get_friends_ids = partial(make_twitter_request, twitter_api.friends.ids, 
                              count=5000)
    get_followers_ids = partial(make_twitter_request, twitter_api.followers.ids, 
                                count=5000)

    friends_ids, followers_ids = [], []
    
    for twitter_api_func, limit, ids, label in [
                    [get_friends_ids, friends_limit, friends_ids, "friends"], 
                    [get_followers_ids, followers_limit, followers_ids, "followers"]
                ]:
        
        if limit == 0: continue
        
        cursor = -1
        while cursor != 0:
        
            # Use make_twitter_request via the partially bound callable...
            if screen_name: 
                response = twitter_api_func(screen_name=screen_name, cursor=cursor)
            else: # user_id
                response = twitter_api_func(user_id=user_id, cursor=cursor)

            if response is not None:
                ids += response['ids']
                cursor = response['next_cursor']
        
            print 'Fetched {0} total {1} ids for {2}'.format(len(ids), 
                                                    label, (user_id or screen_name))
        
                   
            if len(ids) >= limit or response is None:
                break
                
    return friends_ids[:friends_limit], followers_ids[:followers_limit]

Use the following code to retreive all friends and followers of @RobGronkowski, one of the Patriots players.


In [12]:
# Retrieve the friends and followers of a user, and save to a json file.

# screen_name = 'RobGronkowski'

# gronk_friends_ids, gronk_followers_ids = get_friends_followers_ids(twitter_api, screen_name = screen_name)

# filename = "json/gronk_friends"
# save_json(filename, gronk_friends_ids)

# filename = "json/gronk_followers"
# save_json(filename, gronk_followers_ids)

In [13]:
# Use this code to load the already-retrieved friends and followers from a json file.
gronk_followers_ids = load_json('json/gronk_followers')
gronk_friends_ids = load_json('json/gronk_friends')


Loading data from json/gronk_followers.json...
Loading data from json/gronk_friends.json...

In [14]:
# The following function retrieves the screen names of Twitter users, given their user IDs.  If a certain number of screen
# names is desired (for example, 20) max_ids limits the number retreived.

def get_screen_names(twitter_api, user_ids = None, max_ids = None):

    response = []
    
    items = user_ids
    
    # Due to individual user security settings, not all user profiles can be obtained.  Iterate over all user IDs
    # to ensure at least (max_ids) screen names are obtained.
    
    while len(response) < max_ids:
        items_str = ','.join([str(item) for item in items[:100]])
        items = items[100:]
        
        responses = make_twitter_request(twitter_api.users.lookup, user_id = items_str)
        
        response += responses
    
    items_to_info = {}
     
    # The above loop has retrieved all user information.    
    for user_info in response:
        items_to_info[user_info['id']] = user_info
    
    # Extract only the screen names obtained.  The keys of items_to_info are the user ID numbers.
    names = [items_to_info[number]['screen_name']
            for number in items_to_info.keys()
            ]

    numbers =[number for number in items_to_info.keys()]

    return names , numbers

In [15]:
from prettytable import PrettyTable

# Given a set of user ids, this function calls get_screen_names and plots a table of the first (max_ids) ID's and screen names.
def table_ids_screen_names(twitter_api, user_ids = None, max_ids = None):

    names, numbers = get_screen_names(twitter_api, user_ids = user_ids, max_ids = max_ids)

    ids_screen_names = zip(numbers, names)
    
    pretty_table = PrettyTable(field_names = ['User ID','Screen Name'])
    [ pretty_table.add_row (row) for row in ids_screen_names[:max_ids]]
    pretty_table.align = 'l'
    print pretty_table

In [16]:
# Given a list of friends_ids and followers_ids, this function counts and prints the size of each collection.
# It then plots a tables of the first (max_ids) listed friends and followers.
def display_friends_followers(screen_name, friends_ids, followers_ids ,max_ids = None):
    friends_ids_set, followers_ids_set = set(friends_ids),set(followers_ids)
    
    print
    print '{0} has {1} friends.  Here are {2}:'.format(screen_name, len(friends_ids_set),max_ids)
    print
    table_ids_screen_names(twitter_api, user_ids = friends_ids, max_ids = max_ids)
    print
    print '{0} has {1} followers.  Here are {2}:'.format(screen_name,len(followers_ids_set),max_ids)
    print
    table_ids_screen_names(twitter_api, user_ids = followers_ids, max_ids = max_ids)
    print

In [17]:
display_friends_followers(screen_name = screen_name, friends_ids = gronk_friends_ids, followers_ids = gronk_followers_ids, max_ids = 20)


RobGronkowski has 384 friends.  Here are 20:

+------------+-----------------+
| User ID    | Screen Name     |
+------------+-----------------+
| 20575752   | marcelluswiley  |
| 890891     | BleacherReport  |
| 2992537113 | GronkPartyBus   |
| 2966774301 | uninterrupted   |
| 218748456  | Drubnation      |
| 25367082   | samanthapeszek  |
| 3020277803 | ninko50         |
| 67381805   | StaffordBros    |
| 63253045   | MonsterEnergy   |
| 123276343  | BarstoolBigCat  |
| 743044668  | opendorse       |
| 343546941  | CatherinVaritek |
| 2227768384 | goon356         |
| 26053643   | jimmykimmel     |
| 229293125  | RontezMiles     |
| 34461255   | ImDJHollywood   |
| 21111883   | ddlovato        |
| 1683163405 | CaseyMuhtadi    |
| 829673054  | DannyAmendola   |
| 30274144   | hollyrpeete     |
+------------+-----------------+

RobGronkowski has 1263239 followers.  Here are 20:

+--------------------+-----------------+
| User ID            | Screen Name     |
+--------------------+-----------------+
| 694989692870205440 | stormtrooperivy |
| 695277623270887424 | WayneWinooski   |
| 695296100056457218 | beebeelee123    |
| 4100003847         | tejay_johnson   |
| 4876726803         | mgoddard44      |
| 4876873751         | Tsumugirii      |
| 228148249          | walkerrp02      |
| 4876646427         | Adamthegreatist |
| 4876717402         | DontChaseJustin |
| 519660066          | Jeremy_Ketch    |
| 887664163          | cerda011        |
| 4876770340         | PaPiCee17       |
| 1025779262         | y_yohn          |
| 3021470789         | penelopecg24h   |
| 4876845639         | BarrieaultFund  |
| 2153522761         | ohpatriotsgirl  |
| 2879975514         | RayMcP3         |
| 713746011          | Javiii_Castro15 |
| 4876856416         | ecosurfinc1     |
| 3334787685         | poppy_carlton   |
+--------------------+-----------------+

  • Compute the mutual friends within the two groups, i.e., the users who are in both friend list and follower list, plot their ID numbers and screen names in a table

In [18]:
# Given a list of friends_ids and followers_ids, this function use set intersection to find the number of mutual friends.
# It then plots a table of the first (max_ids) listed mutual friends.

def display_mutual_friends(screen_name, friends_ids, followers_ids ,max_ids = None):
    friends_ids_set, followers_ids_set = set(friends_ids),set(followers_ids)
    
    print
    print '{0} has {1} mutual friends.  Here are {2}:'.format(screen_name, len(friends_ids_set.intersection(followers_ids_set)),max_ids)
    print
    mutual_friends_ids = list(friends_ids_set.intersection(followers_ids_set))
    table_ids_screen_names(twitter_api, user_ids = mutual_friends_ids, max_ids = max_ids)

In [19]:
display_mutual_friends(screen_name = screen_name, friends_ids = gronk_friends_ids, followers_ids = gronk_followers_ids, max_ids = 20)


RobGronkowski has 335 mutual friends.  Here are 20:

+------------+-----------------+
| User ID    | Screen Name     |
+------------+-----------------+
| 1059194370 | kobebryant      |
| 191650646  | viccarucci      |
| 314298886  | Simzy18         |
| 17587207   | boburnham       |
| 20575752   | marcelluswiley  |
| 84451338   | QuintonAaron    |
| 26053643   | jimmykimmel     |
| 101852687  | ZIMMERWIZ       |
| 145745936  | RobinMeade      |
| 22938645   | EricStangel     |
| 142364694  | Shandrewpr      |
| 29653015   | MarcusSmith_    |
| 128102424  | Chan95Jones     |
| 2992537113 | GronkPartyBus   |
| 41293339   | ckreiswirthESPN |
| 2966774301 | uninterrupted   |
| 61604894   | DWXXIII         |
| 198735903  | GordieGronk     |
| 25880097   | GronkDreams87   |
| 207923746  | Timbaland       |
+------------+-----------------+

Problem 4: Explore the data

Run some additional experiments with your data to gain familiarity with the twitter data ant twitter API

The following code was used to collect all Twitter followers of the Patriots, Broncos, and Panthers. Once collected, the followers were saved to files.


In [20]:
# ## PATRIOTS
# patriots_friends_ids, patriots_followers_ids = get_friends_followers_ids(twitter_api, screen_name = 'Patriots')
# save_json('json/Patriots_Followers',patriots_followers_ids)
# save_json('json/Patriots_Friends', patriots_friends_ids)

# ## BRONCOS
# broncos_friends_ids, broncos_followers_ids = get_friends_followers_ids(twitter_api, screen_name = 'Broncos')
# save_json('json/Broncos_Followers',broncos_followers_ids)
# save_json('json/Broncos_Friends', broncos_friends_ids)

# ## PANTHERS
# panthers_friends_ids, panthers_followers_ids = get_friends_followers_ids(twitter_api, screen_name = 'Panthers')
# save_json('json/Panthers_Followers',panthers_followers_ids)
# save_json('json/Panthers_Friends', panthers_friends_ids)

This code is used to load the above followers, having already been collected. It then makes a venn-diagram comparing the mutual followers between the three teams.


In [30]:
patriots_followers_ids = load_json('json/Patriots_Followers')
broncos_followers_ids = load_json('json/Broncos_Followers')
panthers_followers_ids = load_json('json/Panthers_Followers')


Loading data from json/Patriots_Followers.json...
Loading data from json/Broncos_Followers.json...
Loading data from json/Panthers_Followers.json...

In [24]:
%matplotlib inline
from matplotlib_venn import venn3

patriots_followers_set = set(patriots_followers_ids)
broncos_followers_set = set(broncos_followers_ids)
panthers_followers_set = set(panthers_followers_ids)

venn3([patriots_followers_set, broncos_followers_set, panthers_followers_set], ('Patriots Followers', 'Broncos Followers', 
                                                                                'Panthers Followers'))


Out[24]:
<matplotlib_venn._common.VennDiagram instance at 0x11e9355f0>

Next we wanted to estimate popularity of the Broncos and Panthers (the two remaining Super Bowl teams) in the Boston area. Our chosen metric of "popularity" is the speed at which tweets are generated. The following periodically collects tweets (constrained to the Boston geo zone) filtered for "Broncos" and "Panthers." It tracks the number of such tweets collected in each time window, allowing us to estimate a Tweets per Minute ratio.


In [31]:
# COLLECT TWEETS FROM STREAM WITH BRONCOS AND PANTHERS IN TWEET TEXT FROM BOSTON GEO ZONE

# from datetime import timedelta, datetime
# from time import sleep
# from twitter import TwitterStream

# track = "Broncos, Panthers"  # Tweets for Broncos OR Panthers
# locations = '-73.313057,41.236511,-68.826305,44.933163'  # New England / Boston geo zone

# NUMBER_OF_COLLECTIONS = 5 # number of times to collect tweets from stream
# COLLECTION_TIME = 2.5  # length of each collection in minutes
# WAIT_TIME = 10  # sleep time in between collections in minutes

# date_format = '%m/%d/%Y %H:%M:%S' # i.e. 1/1/2016 13:00:00

# broncos, panthers, counts = [], [], []
# for counter in range(1, NUMBER_OF_COLLECTIONS + 1):
#     print '------------------------------------------'
#     print 'COLLECTION NUMBER %s out of %s' % (counter, NUMBER_OF_COLLECTIONS)
#     broncos_counter, panthers_counter = 0, 0 # set the internal counter for Broncos and Panthers to 0
#     count_dict = {'start_time': datetime.now().strftime(format=date_format)} # add collection start time

#     # Create a new stream instance every collection to avoid rate limits
#     auth = oauth_login(consumer_key=CONSUMER_KEY, consumer_secret=CONSUMER_SECRET, token=OAUTH_TOKEN, token_secret=OAUTH_TOKEN_SECRET)
#     twitter_stream = TwitterStream(auth=auth)
#     stream = twitter_stream.statuses.filter(track=track, locations=locations)

#     endTime = datetime.now() + timedelta(minutes=COLLECTION_TIME)
#     while datetime.now() <= endTime:  # collect tweets while current time is less than endTime
#         for tweet in stream:
#             if 'text' in tweet.keys(): # check to see if tweet contains text
#                 if datetime.now() > endTime:
#                     break # if the collection time is up, break out of the loop
#                 elif 'Broncos' in tweet['text'] and 'Panthers' in tweet['text']:
#                     broncos.append(tweet), panthers.append(tweet) # if a tweet contains both Broncos and Panthers, add the tweet to both arrays
#                     broncos_counter += 1
#                     panthers_counter += 1
#                     print 'Panthers: %s, Broncos: %s' % (panthers_counter, broncos_counter)
#                 elif 'Broncos' in tweet['text']:
#                     broncos.append(tweet)
#                     broncos_counter += 1
#                     print 'Broncos: %s' % broncos_counter
#                 elif 'Panthers' in tweet['text']:
#                     panthers.append(tweet)
#                     panthers_counter += 1
#                     print 'Panthers: %s' % panthers_counter
#                 else:
#                     print 'continue' # if the tweet text does not match 'Panthers' or 'Broncos', keep going
#                     continue
#         count_dict['broncos'] = broncos_counter
#         count_dict['panthers'] = panthers_counter
#         count_dict['end_time'] = datetime.now().strftime(format=date_format) # add collection end time 
#         counts.append(count_dict)

#     print counts
#     if counter != NUMBER_OF_COLLECTIONS:
#         print 'Sleeping until %s' % (datetime.now() + timedelta(minutes=WAIT_TIME))
#         sleep(WAIT_TIME * 60) # sleep for WAIT_TIME
#     else:
#         print '------------------------------------------'

# # Save arrays to files
# save_json('stream/json/counts', counts)
# save_json('stream/json/broncos', broncos)
# save_json('stream/json/panthers', panthers)

In [56]:
# LOAD JSON FOR BRONCOS AND PANTHERS
broncos = load_json('stream/json/broncos')
panthers = load_json('stream/json/panthers')
counts = load_json('stream/json/counts')

pretty_table = PrettyTable(field_names =['Broncos Tweets', 'Collection Start Time', 'Panthers Tweets','Collection End Time'])
[ pretty_table.add_row(row.values()) for row in counts]
pretty_table.align[label] ='l'
pretty_table.align['Count'] = 'r'
print pretty_table

print 'TOTALS – Broncos: %s, Panthers: %s' % (len(broncos), len(panthers))


Loading data from stream/json/broncos.json...
Loading data from stream/json/panthers.json...
Loading data from stream/json/counts.json...
+----------------+-----------------------+-----------------+---------------------+
| Broncos Tweets | Collection Start Time | Panthers Tweets | Collection End Time |
+----------------+-----------------------+-----------------+---------------------+
|      336       |  02/06/2016 15:27:52  |       358       | 02/06/2016 15:30:23 |
|      332       |  02/06/2016 15:40:23  |       310       | 02/06/2016 15:42:53 |
|      300       |  02/06/2016 15:52:53  |       266       | 02/06/2016 15:55:25 |
|      392       |  02/06/2016 16:05:25  |       278       | 02/06/2016 16:07:56 |
|      311       |  02/06/2016 16:17:56  |       317       | 02/06/2016 16:20:27 |
+----------------+-----------------------+-----------------+---------------------+
TOTALS – Broncos: 1671, Panthers: 1529

In [101]:
%matplotlib inline
# CUMULATIVE TWEET DISTRIBUTION FOR BRONCOS AND PANTHERS
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import mlab

# create numpy arrays for Broncos and Panthers tweets
broncos_tweets = np.array([row['broncos'] for row in counts])
panthers_tweets = np.array([row['panthers'] for row in counts])
bins = len(counts) * 10

# evaluate histogram
broncos_values, broncos_base = np.histogram(broncos_tweets, bins=bins)
panthers_values, panthers_base = np.histogram(panthers_tweets, bins=bins)

# evaluate cumulative function
broncos_cumulative = np.cumsum(broncos_values)
panthers_cumulative = np.cumsum(panthers_values)

# plot cumulative function
plt.plot(broncos_base[:-1], broncos_cumulative, c='darkorange')
plt.plot(panthers_base[:-1], panthers_cumulative, c='blue')

plt.grid(True)
plt.title('Cumulative Distribution of Broncos & Panthers Tweets')
plt.xlabel('Tweets')
plt.ylabel('Collection')

plt.show()



Done

All set!

What do you need to submit?

  • Notebook File: Save this IPython notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "ipython notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.
  • PPT Slides: please prepare PPT slides (for 10 minutes' talk) to present about the case study . We will ask two teams which are randomly selected to present their case studies in class for this case study.

  • Report: please prepare a report (less than 10 pages) to report what you found in the data.

    • What data you collected?
    • Why this topic is interesting or important to you? (Motivations)
    • How did you analyse the data?
    • What did you find in the data?

      (please include figures or tables in the report, but no source code)

Please compress all the files in a zipped file.

How to submit:

    Please submit through myWPI, in the Assignment "Case Study 1".

Note: Each team just need to submit one submission in myWPI

Grading Criteria:

Totoal Points: 120


Notebook: Points: 80

-----------------------------------
Qestion 1:
Points: 20
-----------------------------------

(1) Select a topic that you are interested in.
Points: 6 

(2) Use Twitter Streaming API to sample a collection of tweets about this topic in real time. (It would be recommended that the number of tweets should be larger than 200, but smaller than 1 million. Please check whether the total number of tweets collected is larger than 200?
Points: 10 


(3) Store the tweets you downloaded into a local file (txt file or json file)
Points: 4 


-----------------------------------
Qestion 2:
Points: 20
-----------------------------------

1. Word Count

(1) Use the tweets you collected in Problem 1, and compute the frequencies of the words being used in these tweets.
Points: 4 

(2) Plot a table of the top 30 words with their counts 
Points: 4 

2. Find the most popular tweets in your collection of tweets
plot a table of the top 10 tweets that are the most popular among your collection, i.e., the tweets with the largest number of retweet counts.
Points: 4 

3. Find the most popular Tweet Entities in your collection of tweets

(1) plot a table of the top 10 hashtags, 
Points: 4 

(2) top 10 user mentions that are the most popular in your collection of tweets.
Points: 4 


-----------------------------------
Qestion 3:
Points: 20
-----------------------------------

(1) choose a popular twitter user who has many followers, such as "ladygaga".
Points: 4 

(2) Get the list of all friends and all followers of the twitter user.
Points: 4 

(3) Plot 20 out of the followers, plot their ID numbers and screen names in a table.
Points: 4 

(4) Plot 20 out of the friends (if the user has more than 20 friends), plot their ID numbers and screen names in a table.
Points: 4 

(5) Compute the mutual friends within the two groups, i.e., the users who are in both friend list and follower list, plot their ID numbers and screen names in a table
Points: 4 

-----------------------------------
Qestion 4:  Explore the data
Points: 20
-----------------------------------
    Novelty: 10
    Interestingness: 10
-----------------------------------
Run some additional experiments with your data to gain familiarity with the twitter data and twitter API





Report: communicate the results Points: 20

(1) What data you collected? Points: 5

(2) Why this topic is interesting or important to you? (Motivations) Points: 5

(3) How did you analyse the data? Points: 5

(4) What did you find in the data? (please include figures or tables in the report, but no source code) Points: 5


Slides (for 10 minutes of presentation): Story-telling Points: 20

  1. Motivation about the data collection, why the topic is interesting to you. Points: 5

  2. Communicating Results (figure/table) Points: 10

  3. Story telling (How all the parts (data, analysis, result) fit together as a story?) Points: 5


In [ ]: