Part 1: `tweetharvest` Example Analysis

This is an example notebook demonstrating how to establish a connection to a database of tweets collected using tweetharvest. It presupposes that all the setup instructions have been completed (see README file for that repository) and that MongoDB server is running as described there. We start by importing core packages the PyMongo package, the official package to access MongoDB databases.



In [1]:

    
import pymongo

Next we establish a link with the database. We know that the database created by tweetharvester is called tweets_db and within it is a collection of tweets that goes by the name of the project, in this example: emotweets.



In [2]:

    
db = pymongo.MongoClient().tweets_db
coll = db.emotweets
coll









    Out[2]:





Collection(Database(MongoClient('localhost', 27017), u'tweets_db'), u'emotweets')

We now have an object, coll, that offers full access to the MongoDB API where we can analyse the data in the collected tweets. For instance, in our small example collection, we can count the number of tweets:



In [3]:

    
coll.count()









    Out[3]:





10598

Or we can count the number of tweets that are geolocated with a field containing the latitude and longitude of the user when they sent the tweet. We construct a MongoDB query that looks for a non-empty field called coordinates.



In [4]:

    
query = {'coordinates': {'$ne': None}}
coll.find(query).count()









    Out[4]:





607

Or how many tweets had the hashtag #happy in them?



In [5]:

    
query = {'hashtags': {'$in': ['happy']}}
coll.find(query).count()









    Out[5]:





8258

Pre-requisites for Analysis

In order to perform these analyses there are a few things one needs to know:

At the risk of stating the obvious: how to code in Python (there is also an excellent tutorial). Please note that the current version of tweetharvest uses Python 2.7, and not Python 3.
How to perform mongoDB queries, including aggregation, counting, grouping of subsets of data. There is a most effective short introduction (The Little Book on MongoDB by Karl Seguin), as well as extremely rich documentation on the parent website.
How to use PyMongo to interface with the MongoDB API.

Apart from these skills, one needs to know how each status is stored in the database. Here is an easy way to look at the data structure of one tweet.



In [6]:

    
coll.find_one()









    Out[6]:





{u'_id': 610008194618757121L,
 u'contributors': None,
 u'coordinates': None,
 u'created_at': datetime.datetime(2015, 6, 14, 8, 57, 41),
 u'entities': {u'hashtags': [{u'indices': [0, 4], u'text': u'sad'}],
  u'symbols': [],
  u'urls': [],
  u'user_mentions': []},
 u'favorite_count': 2,
 u'favorited': False,
 u'geo': None,
 u'hashtags': [u'sad'],
 u'id_str': u'610008194618757121',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'is_quote_status': False,
 u'lang': u'und',
 u'metadata': {u'iso_language_code': u'und', u'result_type': u'recent'},
 u'place': None,
 u'retweet_count': 1,
 u'retweeted': False,
 u'source': u'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 u'text': u'#sad',
 u'truncated': False,
 u'user': {u'contributors_enabled': False,
  u'created_at': datetime.datetime(2012, 10, 26, 5, 15, 26),
  u'default_profile': False,
  u'default_profile_image': False,
  u'description': u'xvii // not subtle',
  u'entities': {u'description': {u'urls': []}},
  u'favourites_count': 10565,
  u'follow_request_sent': None,
  u'followers_count': 683,
  u'following': None,
  u'friends_count': 374,
  u'geo_enabled': True,
  u'id': 905331738,
  u'id_str': u'905331738',
  u'is_translation_enabled': False,
  u'is_translator': False,
  u'lang': u'en',
  u'listed_count': 0,
  u'location': u'ceb, phl ',
  u'name': u'Kim \u2743',
  u'notifications': None,
  u'profile_background_color': u'FFFFFF',
  u'profile_background_image_url': u'http://pbs.twimg.com/profile_background_images/438273422448009216/1OKtL--y.png',
  u'profile_background_image_url_https': u'https://pbs.twimg.com/profile_background_images/438273422448009216/1OKtL--y.png',
  u'profile_background_tile': True,
  u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/905331738/1431745819',
  u'profile_image_url': u'http://pbs.twimg.com/profile_images/607543257296310272/U5Yflc4l_normal.jpg',
  u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/607543257296310272/U5Yflc4l_normal.jpg',
  u'profile_link_color': u'94D487',
  u'profile_sidebar_border_color': u'FFFFFF',
  u'profile_sidebar_fill_color': u'BADB7C',
  u'profile_text_color': u'E69DF2',
  u'profile_use_background_image': True,
  u'protected': False,
  u'screen_name': u'kimjereza',
  u'statuses_count': 30050,
  u'time_zone': u'Beijing',
  u'url': None,
  u'utc_offset': 28800,
  u'verified': False}}

This JSON data structure is documented on the Twitter API website where each field is described in detail. It is recommended that this description is studied in order to understand how to construct valid queries.

tweetharvest is faithful to the core structure of the tweets as described in that documentation, but with minor differences created for convenience:

All date fields are stored as MongoDB Date objects and returned as Python datetime objects. This makes it easier to work on date ranges, sort by date, and do other date and time related manipulation.
A hashtags field is created for convenience. This contains a simple array of all the hashtags contained in a particular tweet and can be queried directly instead of looking for tags inside a dictionary, inside a list of other entities. It is included for ease of querying but may be ignored if one prefers.

Next Steps

This notebook establishes how you can connect to the database of tweets that you have harvested and how you can use the power of Python and MongoDB to access and analyse your collections. Good luck!

Part 2: `tweetharvest` Further Analysis

Assuming we need some more advanced work to be done on the dataset we have collected, below are some sample analyses to dip our toes in the water.

The examples below are further illustration of using our dataset with standard Python modules used in datascience. The typical idion is that of queryiong MongoDB to get a cursor on our dataset, importing that into an analytic tool such as Pandas, and then producing the analysis. The analyses below require that a few packages are installed on our system:

matplotlib: a python 2D plotting library (documentation)
pandas: "an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools" (documentation)

Important Note

The dataset used in this notebook is not published on the Github repository. If you want to experiment with your own data, you need to install the tweetharvest package, harvest some tweets to replicate the emotweets project embedded there, and then run the notebook. The intended use of this example notebook is simply as an illustration of the type of analysis one might want to do using your own tools.



In [7]:

    
%matplotlib inline



In [8]:

    
import pymongo  # in case we have run Part 1 above
import pandas as pd  # for data manipulation and analysis

import matplotlib.pyplot as plt









    



/Users/gauden/anaconda/lib/python2.7/site-packages/pytz/__init__.py:29: UserWarning: Module argparse was already imported from /Users/gauden/anaconda/lib/python2.7/argparse.pyc, but /Users/gauden/anaconda/lib/python2.7/site-packages is being added to sys.path
  from pkg_resources import resource_stream

Establish a Link to the Dataset as a MongoDB Collection



In [9]:

    
db = pymongo.MongoClient().tweets_db
COLL = db.emotweets
COLL









    Out[9]:





Collection(Database(MongoClient('localhost', 27017), u'tweets_db'), u'emotweets')

Descriptive Statistics

Number of Tweets in Dataset



In [10]:

    
COLL.count()









    Out[10]:





10598



In [11]:

    
def count_by_tag(coll, hashtag):
    query = {'hashtags': {'$in': [hashtag]}}
    count = coll.find(query).count()
    return count

print 'Number of #happy tweets: {}'.format(count_by_tag(COLL, 'happy'))
print 'Number of #sad tweets: {}'.format(count_by_tag(COLL, 'sad'))









    



Number of #happy tweets: 8258
Number of #sad tweets: 2403

Number of Geolocated Tweets



In [12]:

    
query = {'coordinates': {'$ne': None}}
COLL.find(query).count()









    Out[12]:





607

Range of Creation Times for Tweets



In [13]:

    
# return a cursor that iterates over all documents and returns the creation date
cursor = COLL.find({}, {'created_at': 1, '_id': 0})

# list all the creation times and convert to Pandas DataFrame
times = pd.DataFrame(list(cursor))
times = pd.to_datetime(times.created_at)

earliest_timestamp = min(times)
latest_timestamp = max(times)

print 'Creation time for EARLIEST tweet in dataset: {}'.format(earliest_timestamp)
print 'Creation time for LATEST tweet in dataset: {}'.format(latest_timestamp)









    



Creation time for EARLIEST tweet in dataset: 2015-06-13 07:24:40
Creation time for LATEST tweet in dataset: 2015-06-14 09:29:21

Plot Tweets per Hour



In [14]:

    
query = {}  # empty query means find all documents

# return just two columns, the date of creation and the id of each document
projection = {'created_at': 1}

df = pd.DataFrame(list(COLL.find(query, projection)))
times = pd.to_datetime(df.created_at)
df.set_index(times, inplace=True)
df.drop('created_at', axis=1, inplace=True)
tweets_all = df.resample('60Min', how='count')

tweets_all.plot(figsize=[12, 7], title='Number of Tweets per Hour', legend=None);

More Complex Query

As an example of a more complex query, the following demonstrates how to extract all tweets that are not retweets, contain the hashtag #happy as well at least one other hashtag, and that are written in English. These attributes are passed to the .find method as a dictionary, and the hashtags are then extracted.

The hashtags of the first ten tweets meeting this specification are then printed out.



In [15]:

    
query = {                                # find all documents that: 
        'hashtags': {'$in': ['happy']},  # contain #happy hashtag
        'retweeted_status': None,        # are not retweets
        'hashtags.1': {'$exists': True}, # and have more than 1 hashtag
        'lang': 'en'                     # written in English
        }
projection = {'hashtags': 1, '_id': 0}
cursor = COLL.find(query, projection)

for tags in cursor[:10]:
    print tags['hashtags']









    



[u'rains', u'drenched', u'happy', u'kids', u'birds', u'animals', u'tatasky', u'home', u'sad', u'life']
[u'quote', u'wisdom', u'sad', u'happy']
[u'truro', u'nightout', u'drunk', u'nationalginday', u'happy', u'fun', u'cornwall', u'girlsnight', u'zafiros']
[u'happy', u'positivity']
[u'vaghar', u'cook', u'ghee', u'colzaoil', u'spices', u'love', u'happy', u'digestion', u'ayurveda', u'intuitive']
[u'happy', u'yay']
[u'kinderscout', u'peakdistrict', u'darkpeaks', u'happy']
[u'ichoisehappy', u'life', u'happy', u'quote', u'instaphoto']
[u'streetartthrowdown', u'me', u'myself', u'wacky', u'pretty', u'cute', u'nice', u'awesome', u'cool', u'smile', u'happy', u'selfie', u'selca']
[u'brothers', u'love', u'forever', u'heart', u'bless', u'live', u'family', u'happy', u'proud']

Build a Network of Hashtags

We could use this method to produce a network of hashtags. The following illustrates this by:

creating a generator function that yields every possible combination of two hashtags from each tweet
adding these pairs of tags as edges in a NetworkX graph
deleting the node happy (since it is connected to all the others by definition)
deleting those edges that are below a threshold weight
plotting the result

In order to run this, we need to install the NetworkX package (pip install networkx, documentation) and import it as well as the combinations function from Python's standard library itertools module.



In [16]:

    
from itertools import combinations

import networkx as nx

Generate list of all pairs of hashtags



In [17]:

    
def gen_edges(coll, hashtag):
    query = {                            # find all documents that: 
        'hashtags': {'$in': [hashtag]},  # contain hashtag of interest
        'retweeted_status': None,        # are not retweets
        'hashtags.1': {'$exists': True}, # and have more than 1 hashtag
        'lang': 'en'                     # written in English
        }
    projection = {'hashtags': 1, '_id': 0}
    cursor = coll.find(query, projection)
    
    for tags in cursor:
        hashtags = tags['hashtags']
        for edge in combinations(hashtags, 2):
            yield edge

Build graph with weighted edges between hashtags



In [18]:

    
def build_graph(coll, hashtag, remove_node=True):
    g = nx.Graph()
    for u,v in gen_edges(coll, hashtag):
        if g.has_edge(u,v):
            # add 1 to weight attribute of this edge
            g[u][v]['weight'] = g[u][v]['weight'] + 1
        else:
            # create new edge of weight 1
            g.add_edge(u, v, weight=1)
    if remove_node:
        # since hashtag is connected to every other node, 
        # it adds no information to this graph; remove it.
        g.remove_node(hashtag)
    return g



In [19]:

    
G = build_graph(COLL, 'happy')

Remove rarer edges

Finally we remove rare edges (defined here arbitrarily as edges that have a weigthing of less than 25), then print a table of these edges sorted in descending order by weight.



In [20]:

    
def trim_edges(g, weight=1):
    # function from http://shop.oreilly.com/product/0636920020424.do
    g2 = nx.Graph()
    for u, v, edata in g.edges(data=True):
        if edata['weight'] > weight:
            g2.add_edge(u, v, edata)
    return g2

View as Table



In [21]:

    
G2 = trim_edges(G, weight=25)

df = pd.DataFrame([(u, v, edata['weight'])
                   for u, v, edata in G2.edges(data=True)],
                  columns = ['from', 'to', 'weight'])
df.sort(['weight'], ascending=False, inplace=True)
df









    Out[21]:






  
    
      
      from
      to
      weight
    
  
  
    
      7 
                love
                  me
       78
    
    
      1 
                cute
                love
       74
    
    
      14
                love
              follow
       74
    
    
      11
                love
           instagood
       72
    
    
      17
                love
       photooftheday
       64
    
    
      48
                  me
           instagood
       63
    
    
      43
       photooftheday
           instagood
       63
    
    
      31
              follow
           instagood
       63
    
    
      4 
                cute
              follow
       63
    
    
      0 
                cute
                  me
       62
    
    
      29
              follow
                  me
       60
    
    
      3 
                cute
           instagood
       60
    
    
      41
       photooftheday
                  me
       59
    
    
      33
              follow
       photooftheday
       58
    
    
      32
              follow
            followme
       58
    
    
      30
              follow
                 tbt
       57
    
    
      47
                 tbt
           instagood
       57
    
    
      46
                 tbt
                  me
       57
    
    
      37
            followme
                  me
       57
    
    
      10
                love
                 tbt
       57
    
    
      16
                love
            followme
       57
    
    
      6 
                cute
       photooftheday
       56
    
    
      2 
                cute
                 tbt
       56
    
    
      39
            followme
           instagood
       56
    
    
      42
       photooftheday
                 tbt
       55
    
    
      38
            followme
                 tbt
       52
    
    
      5 
                cute
            followme
       51
    
    
      40
            followme
       photooftheday
       50
    
    
      12
                love
               smile
       37
    
    
      9 
                love
              family
       33
    
    
      34
           happiness
               truth
       31
    
    
      24
               allah
           happiness
       31
    
    
      23
               allah
               truth
       31
    
    
      13
                love
                 fun
       29
    
    
      8 
                love
                life
       29
    
    
      20
               allah
           lifegoals
       28
    
    
      44
                good
             prophet
       28
    
    
      21
               allah
             prophet
       28
    
    
      35
           happiness
           lifegoals
       28
    
    
      22
               allah
             promise
       28
    
    
      28
             promise
                good
       28
    
    
      27
             promise
             prophet
       28
    
    
      19
               allah
                good
       28
    
    
      45
           lifegoals
               truth
       28
    
    
      18
              selfie
               smile
       27
    
    
      25
                i_am
            positive
       27
    
    
      15
                love
             friends
       27
    
    
      36
            positive
         affirmation
       27
    
    
      26
                i_am
         affirmation
       27

Plot the Network



In [22]:

    
G3 = trim_edges(G, weight=35)

pos=nx.circular_layout(G3) # positions for all nodes

# nodes
nx.draw_networkx_nodes(G3, pos, node_size=700,
                       linewidths=0, node_color='#cccccc')

edge_list = [(u, v) for u, v in G3.edges()]
weight_list = [edata['weight']/5.0 for u, v, edata in G3.edges(data=True)]

# edges
nx.draw_networkx_edges(G3, pos,
                       edgelist=edge_list,
                       width=weight_list,
                       alpha=0.4,edge_color='b')

# labels
nx.draw_networkx_labels(G3, pos, font_size=20,
                        font_family='sans-serif', font_weight='bold')

fig = plt.gcf()
fig.set_size_inches(10, 10)
plt.axis('off');

Repeat for `#sad`



In [23]:

    
G_SAD = build_graph(COLL, 'sad')



In [24]:

    
G2S = trim_edges(G_SAD, weight=5)

df = pd.DataFrame([(u, v, edata['weight'])
                   for u, v, edata in G2S.edges(data=True)],
                  columns = ['from', 'to', 'weight'])
df.sort(['weight'], ascending=False, inplace=True)
df









    Out[24]:






  
    
      
      from
      to
      weight
    
  
  
    
      7 
        pathetic
              rude
       36
    
    
      13
       depressed
             quote
       19
    
    
      18
           quote
            quotes
       15
    
    
      23
            funy
       all_sms_pkg
       13
    
    
      8 
         stylish
       all_sms_pkg
       13
    
    
      9 
         stylish
              funy
       13
    
    
      20
           quote
             happy
       11
    
    
      17
           quote
             quote
       10
    
    
      12
       depressed
            quotes
       10
    
    
      2 
            info
           stylish
        8
    
    
      5 
            info
              just
        8
    
    
      10
         stylish
              just
        8
    
    
      14
       depressed
        depression
        8
    
    
      22
            just
              funy
        7
    
    
      21
            just
       all_sms_pkg
        7
    
    
      0 
            teen
           sadgirl
        7
    
    
      15
       depressed
             happy
        7
    
    
      6 
         suicide
          suicidal
        7
    
    
      4 
            info
              funy
        7
    
    
      3 
            info
       all_sms_pkg
        7
    
    
      16
           quote
              love
        6
    
    
      1 
            teen
                 v
        6
    
    
      19
           quote
        depression
        6
    
    
      11
       depressed
             alone
        6
    
    
      24
          quotes
              love
        6
    
    
      25
          quotes
             happy
        6
    
    
      26
               v
           sadgirl
        6

Graph is drawn with a spring layout to bring out more clearly the disconnected sub-graphs.



In [25]:

    
G3S = trim_edges(G_SAD, weight=5)

pos=nx.spring_layout(G3S) # positions for all nodes

# nodes
nx.draw_networkx_nodes(G3S, pos, node_size=700,
                       linewidths=0, node_color='#cccccc')

edge_list = [(u, v) for u, v in G3S.edges()]
weight_list = [edata['weight'] for u, v, edata in G3S.edges(data=True)]

# edges
nx.draw_networkx_edges(G3S, pos,
                       edgelist=edge_list,
                       width=weight_list,
                       alpha=0.4,edge_color='b')

# labels
nx.draw_networkx_labels(G3S, pos, font_size=12,
                        font_family='sans-serif', font_weight='bold')

fig = plt.gcf()
fig.set_size_inches(13, 13)
plt.axis('off');

	from	to	weight
7	love	me	78
1	cute	love	74
14	love	follow	74
11	love	instagood	72
17	love	photooftheday	64
48	me	instagood	63
43	photooftheday	instagood	63
31	follow	instagood	63
4	cute	follow	63
0	cute	me	62
29	follow	me	60
3	cute	instagood	60
41	photooftheday	me	59
33	follow	photooftheday	58
32	follow	followme	58
30	follow	tbt	57
47	tbt	instagood	57
46	tbt	me	57
37	followme	me	57
10	love	tbt	57
16	love	followme	57
6	cute	photooftheday	56
2	cute	tbt	56
39	followme	instagood	56
42	photooftheday	tbt	55
38	followme	tbt	52
5	cute	followme	51
40	followme	photooftheday	50
12	love	smile	37
9	love	family	33
34	happiness	truth	31
24	allah	happiness	31
23	allah	truth	31
13	love	fun	29
8	love	life	29
20	allah	lifegoals	28
44	good	prophet	28
21	allah	prophet	28
35	happiness	lifegoals	28
22	allah	promise	28
28	promise	good	28
27	promise	prophet	28
19	allah	good	28
45	lifegoals	truth	28
18	selfie	smile	27
25	i_am	positive	27
15	love	friends	27
36	positive	affirmation	27
26	i_am	affirmation	27

	from	to	weight
7	pathetic	rude	36
13	depressed	quote	19
18	quote	quotes	15
23	funy	all_sms_pkg	13
8	stylish	all_sms_pkg	13
9	stylish	funy	13
20	quote	happy	11
17	quote	quote	10
12	depressed	quotes	10
2	info	stylish	8
5	info	just	8
10	stylish	just	8
14	depressed	depression	8
22	just	funy	7
21	just	all_sms_pkg	7
0	teen	sadgirl	7
15	depressed	happy	7
6	suicide	suicidal	7
4	info	funy	7
3	info	all_sms_pkg	7
16	quote	love	6
1	teen	v	6
19	quote	depression	6
11	depressed	alone	6
24	quotes	love	6
25	quotes	happy	6
26	v	sadgirl	6

Part 1: tweetharvest Example Analysis