Part 1: tweetharvest Example Analysis

This is an example notebook demonstrating how to establish a connection to a database of tweets collected using tweetharvest. It presupposes that all the setup instructions have been completed (see README file for that repository) and that MongoDB server is running as described there. We start by importing core packages the PyMongo package, the official package to access MongoDB databases.


In [1]:
import pymongo

Next we establish a link with the database. We know that the database created by tweetharvester is called tweets_db and within it is a collection of tweets that goes by the name of the project, in this example: emotweets.


In [2]:
db = pymongo.MongoClient().tweets_db
coll = db.emotweets
coll


Out[2]:
Collection(Database(MongoClient('localhost', 27017), u'tweets_db'), u'emotweets')

We now have an object, coll, that offers full access to the MongoDB API where we can analyse the data in the collected tweets. For instance, in our small example collection, we can count the number of tweets:


In [3]:
coll.count()


Out[3]:
10598

Or we can count the number of tweets that are geolocated with a field containing the latitude and longitude of the user when they sent the tweet. We construct a MongoDB query that looks for a non-empty field called coordinates.


In [4]:
query = {'coordinates': {'$ne': None}}
coll.find(query).count()


Out[4]:
607

Or how many tweets had the hashtag #happy in them?


In [5]:
query = {'hashtags': {'$in': ['happy']}}
coll.find(query).count()


Out[5]:
8258

Pre-requisites for Analysis

In order to perform these analyses there are a few things one needs to know:

  1. At the risk of stating the obvious: how to code in Python (there is also an excellent tutorial). Please note that the current version of tweetharvest uses Python 2.7, and not Python 3.
  2. How to perform mongoDB queries, including aggregation, counting, grouping of subsets of data. There is a most effective short introduction (The Little Book on MongoDB by Karl Seguin), as well as extremely rich documentation on the parent website.
  3. How to use PyMongo to interface with the MongoDB API.

Apart from these skills, one needs to know how each status is stored in the database. Here is an easy way to look at the data structure of one tweet.


In [6]:
coll.find_one()


Out[6]:
{u'_id': 610008194618757121L,
 u'contributors': None,
 u'coordinates': None,
 u'created_at': datetime.datetime(2015, 6, 14, 8, 57, 41),
 u'entities': {u'hashtags': [{u'indices': [0, 4], u'text': u'sad'}],
  u'symbols': [],
  u'urls': [],
  u'user_mentions': []},
 u'favorite_count': 2,
 u'favorited': False,
 u'geo': None,
 u'hashtags': [u'sad'],
 u'id_str': u'610008194618757121',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'is_quote_status': False,
 u'lang': u'und',
 u'metadata': {u'iso_language_code': u'und', u'result_type': u'recent'},
 u'place': None,
 u'retweet_count': 1,
 u'retweeted': False,
 u'source': u'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 u'text': u'#sad',
 u'truncated': False,
 u'user': {u'contributors_enabled': False,
  u'created_at': datetime.datetime(2012, 10, 26, 5, 15, 26),
  u'default_profile': False,
  u'default_profile_image': False,
  u'description': u'xvii // not subtle',
  u'entities': {u'description': {u'urls': []}},
  u'favourites_count': 10565,
  u'follow_request_sent': None,
  u'followers_count': 683,
  u'following': None,
  u'friends_count': 374,
  u'geo_enabled': True,
  u'id': 905331738,
  u'id_str': u'905331738',
  u'is_translation_enabled': False,
  u'is_translator': False,
  u'lang': u'en',
  u'listed_count': 0,
  u'location': u'ceb, phl ',
  u'name': u'Kim \u2743',
  u'notifications': None,
  u'profile_background_color': u'FFFFFF',
  u'profile_background_image_url': u'http://pbs.twimg.com/profile_background_images/438273422448009216/1OKtL--y.png',
  u'profile_background_image_url_https': u'https://pbs.twimg.com/profile_background_images/438273422448009216/1OKtL--y.png',
  u'profile_background_tile': True,
  u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/905331738/1431745819',
  u'profile_image_url': u'http://pbs.twimg.com/profile_images/607543257296310272/U5Yflc4l_normal.jpg',
  u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/607543257296310272/U5Yflc4l_normal.jpg',
  u'profile_link_color': u'94D487',
  u'profile_sidebar_border_color': u'FFFFFF',
  u'profile_sidebar_fill_color': u'BADB7C',
  u'profile_text_color': u'E69DF2',
  u'profile_use_background_image': True,
  u'protected': False,
  u'screen_name': u'kimjereza',
  u'statuses_count': 30050,
  u'time_zone': u'Beijing',
  u'url': None,
  u'utc_offset': 28800,
  u'verified': False}}

This JSON data structure is documented on the Twitter API website where each field is described in detail. It is recommended that this description is studied in order to understand how to construct valid queries.

tweetharvest is faithful to the core structure of the tweets as described in that documentation, but with minor differences created for convenience:

  1. All date fields are stored as MongoDB Date objects and returned as Python datetime objects. This makes it easier to work on date ranges, sort by date, and do other date and time related manipulation.
  2. A hashtags field is created for convenience. This contains a simple array of all the hashtags contained in a particular tweet and can be queried directly instead of looking for tags inside a dictionary, inside a list of other entities. It is included for ease of querying but may be ignored if one prefers.

Next Steps

This notebook establishes how you can connect to the database of tweets that you have harvested and how you can use the power of Python and MongoDB to access and analyse your collections. Good luck!

Part 2: tweetharvest Further Analysis

Assuming we need some more advanced work to be done on the dataset we have collected, below are some sample analyses to dip our toes in the water.

The examples below are further illustration of using our dataset with standard Python modules used in datascience. The typical idion is that of queryiong MongoDB to get a cursor on our dataset, importing that into an analytic tool such as Pandas, and then producing the analysis. The analyses below require that a few packages are installed on our system:

  • matplotlib: a python 2D plotting library (documentation)
  • pandas: "an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools" (documentation)

Important Note

The dataset used in this notebook is not published on the Github repository. If you want to experiment with your own data, you need to install the tweetharvest package, harvest some tweets to replicate the emotweets project embedded there, and then run the notebook. The intended use of this example notebook is simply as an illustration of the type of analysis one might want to do using your own tools.


In [7]:
%matplotlib inline

In [8]:
import pymongo  # in case we have run Part 1 above
import pandas as pd  # for data manipulation and analysis

import matplotlib.pyplot as plt


/Users/gauden/anaconda/lib/python2.7/site-packages/pytz/__init__.py:29: UserWarning: Module argparse was already imported from /Users/gauden/anaconda/lib/python2.7/argparse.pyc, but /Users/gauden/anaconda/lib/python2.7/site-packages is being added to sys.path
  from pkg_resources import resource_stream

In [9]:
db = pymongo.MongoClient().tweets_db
COLL = db.emotweets
COLL


Out[9]:
Collection(Database(MongoClient('localhost', 27017), u'tweets_db'), u'emotweets')

Descriptive Statistics

Number of Tweets in Dataset


In [10]:
COLL.count()


Out[10]:
10598

In [11]:
def count_by_tag(coll, hashtag):
    query = {'hashtags': {'$in': [hashtag]}}
    count = coll.find(query).count()
    return count

print 'Number of #happy tweets: {}'.format(count_by_tag(COLL, 'happy'))
print 'Number of #sad tweets: {}'.format(count_by_tag(COLL, 'sad'))


Number of #happy tweets: 8258
Number of #sad tweets: 2403

Number of Geolocated Tweets


In [12]:
query = {'coordinates': {'$ne': None}}
COLL.find(query).count()


Out[12]:
607

Range of Creation Times for Tweets


In [13]:
# return a cursor that iterates over all documents and returns the creation date
cursor = COLL.find({}, {'created_at': 1, '_id': 0})

# list all the creation times and convert to Pandas DataFrame
times = pd.DataFrame(list(cursor))
times = pd.to_datetime(times.created_at)

earliest_timestamp = min(times)
latest_timestamp = max(times)

print 'Creation time for EARLIEST tweet in dataset: {}'.format(earliest_timestamp)
print 'Creation time for LATEST tweet in dataset: {}'.format(latest_timestamp)


Creation time for EARLIEST tweet in dataset: 2015-06-13 07:24:40
Creation time for LATEST tweet in dataset: 2015-06-14 09:29:21

Plot Tweets per Hour


In [14]:
query = {}  # empty query means find all documents

# return just two columns, the date of creation and the id of each document
projection = {'created_at': 1}

df = pd.DataFrame(list(COLL.find(query, projection)))
times = pd.to_datetime(df.created_at)
df.set_index(times, inplace=True)
df.drop('created_at', axis=1, inplace=True)
tweets_all = df.resample('60Min', how='count')

tweets_all.plot(figsize=[12, 7], title='Number of Tweets per Hour', legend=None);


More Complex Query

As an example of a more complex query, the following demonstrates how to extract all tweets that are not retweets, contain the hashtag #happy as well at least one other hashtag, and that are written in English. These attributes are passed to the .find method as a dictionary, and the hashtags are then extracted.

The hashtags of the first ten tweets meeting this specification are then printed out.


In [15]:
query = {                                # find all documents that: 
        'hashtags': {'$in': ['happy']},  # contain #happy hashtag
        'retweeted_status': None,        # are not retweets
        'hashtags.1': {'$exists': True}, # and have more than 1 hashtag
        'lang': 'en'                     # written in English
        }
projection = {'hashtags': 1, '_id': 0}
cursor = COLL.find(query, projection)

for tags in cursor[:10]:
    print tags['hashtags']


[u'rains', u'drenched', u'happy', u'kids', u'birds', u'animals', u'tatasky', u'home', u'sad', u'life']
[u'quote', u'wisdom', u'sad', u'happy']
[u'truro', u'nightout', u'drunk', u'nationalginday', u'happy', u'fun', u'cornwall', u'girlsnight', u'zafiros']
[u'happy', u'positivity']
[u'vaghar', u'cook', u'ghee', u'colzaoil', u'spices', u'love', u'happy', u'digestion', u'ayurveda', u'intuitive']
[u'happy', u'yay']
[u'kinderscout', u'peakdistrict', u'darkpeaks', u'happy']
[u'ichoisehappy', u'life', u'happy', u'quote', u'instaphoto']
[u'streetartthrowdown', u'me', u'myself', u'wacky', u'pretty', u'cute', u'nice', u'awesome', u'cool', u'smile', u'happy', u'selfie', u'selca']
[u'brothers', u'love', u'forever', u'heart', u'bless', u'live', u'family', u'happy', u'proud']

Build a Network of Hashtags

We could use this method to produce a network of hashtags. The following illustrates this by:

  • creating a generator function that yields every possible combination of two hashtags from each tweet
  • adding these pairs of tags as edges in a NetworkX graph
  • deleting the node happy (since it is connected to all the others by definition)
  • deleting those edges that are below a threshold weight
  • plotting the result

In order to run this, we need to install the NetworkX package (pip install networkx, documentation) and import it as well as the combinations function from Python's standard library itertools module.


In [16]:
from itertools import combinations

import networkx as nx

Generate list of all pairs of hashtags


In [17]:
def gen_edges(coll, hashtag):
    query = {                            # find all documents that: 
        'hashtags': {'$in': [hashtag]},  # contain hashtag of interest
        'retweeted_status': None,        # are not retweets
        'hashtags.1': {'$exists': True}, # and have more than 1 hashtag
        'lang': 'en'                     # written in English
        }
    projection = {'hashtags': 1, '_id': 0}
    cursor = coll.find(query, projection)
    
    for tags in cursor:
        hashtags = tags['hashtags']
        for edge in combinations(hashtags, 2):
            yield edge

Build graph with weighted edges between hashtags


In [18]:
def build_graph(coll, hashtag, remove_node=True):
    g = nx.Graph()
    for u,v in gen_edges(coll, hashtag):
        if g.has_edge(u,v):
            # add 1 to weight attribute of this edge
            g[u][v]['weight'] = g[u][v]['weight'] + 1
        else:
            # create new edge of weight 1
            g.add_edge(u, v, weight=1)
    if remove_node:
        # since hashtag is connected to every other node, 
        # it adds no information to this graph; remove it.
        g.remove_node(hashtag)
    return g

In [19]:
G = build_graph(COLL, 'happy')

Remove rarer edges

Finally we remove rare edges (defined here arbitrarily as edges that have a weigthing of less than 25), then print a table of these edges sorted in descending order by weight.


In [20]:
def trim_edges(g, weight=1):
    # function from http://shop.oreilly.com/product/0636920020424.do
    g2 = nx.Graph()
    for u, v, edata in g.edges(data=True):
        if edata['weight'] > weight:
            g2.add_edge(u, v, edata)
    return g2

View as Table


In [21]:
G2 = trim_edges(G, weight=25)

df = pd.DataFrame([(u, v, edata['weight'])
                   for u, v, edata in G2.edges(data=True)],
                  columns = ['from', 'to', 'weight'])
df.sort(['weight'], ascending=False, inplace=True)
df


Out[21]:
from to weight
7 love me 78
1 cute love 74
14 love follow 74
11 love instagood 72
17 love photooftheday 64
48 me instagood 63
43 photooftheday instagood 63
31 follow instagood 63
4 cute follow 63
0 cute me 62
29 follow me 60
3 cute instagood 60
41 photooftheday me 59
33 follow photooftheday 58
32 follow followme 58
30 follow tbt 57
47 tbt instagood 57
46 tbt me 57
37 followme me 57
10 love tbt 57
16 love followme 57
6 cute photooftheday 56
2 cute tbt 56
39 followme instagood 56
42 photooftheday tbt 55
38 followme tbt 52
5 cute followme 51
40 followme photooftheday 50
12 love smile 37
9 love family 33
34 happiness truth 31
24 allah happiness 31
23 allah truth 31
13 love fun 29
8 love life 29
20 allah lifegoals 28
44 good prophet 28
21 allah prophet 28
35 happiness lifegoals 28
22 allah promise 28
28 promise good 28
27 promise prophet 28
19 allah good 28
45 lifegoals truth 28
18 selfie smile 27
25 i_am positive 27
15 love friends 27
36 positive affirmation 27
26 i_am affirmation 27

Plot the Network


In [22]:
G3 = trim_edges(G, weight=35)

pos=nx.circular_layout(G3) # positions for all nodes

# nodes
nx.draw_networkx_nodes(G3, pos, node_size=700,
                       linewidths=0, node_color='#cccccc')

edge_list = [(u, v) for u, v in G3.edges()]
weight_list = [edata['weight']/5.0 for u, v, edata in G3.edges(data=True)]

# edges
nx.draw_networkx_edges(G3, pos,
                       edgelist=edge_list,
                       width=weight_list,
                       alpha=0.4,edge_color='b')

# labels
nx.draw_networkx_labels(G3, pos, font_size=20,
                        font_family='sans-serif', font_weight='bold')

fig = plt.gcf()
fig.set_size_inches(10, 10)
plt.axis('off');


Repeat for #sad


In [23]:
G_SAD = build_graph(COLL, 'sad')

In [24]:
G2S = trim_edges(G_SAD, weight=5)

df = pd.DataFrame([(u, v, edata['weight'])
                   for u, v, edata in G2S.edges(data=True)],
                  columns = ['from', 'to', 'weight'])
df.sort(['weight'], ascending=False, inplace=True)
df


Out[24]:
from to weight
7 pathetic rude 36
13 depressed quote 19
18 quote quotes 15
23 funy all_sms_pkg 13
8 stylish all_sms_pkg 13
9 stylish funy 13
20 quote happy 11
17 quote quote 10
12 depressed quotes 10
2 info stylish 8
5 info just 8
10 stylish just 8
14 depressed depression 8
22 just funy 7
21 just all_sms_pkg 7
0 teen sadgirl 7
15 depressed happy 7
6 suicide suicidal 7
4 info funy 7
3 info all_sms_pkg 7
16 quote love 6
1 teen v 6
19 quote depression 6
11 depressed alone 6
24 quotes love 6
25 quotes happy 6
26 v sadgirl 6

Graph is drawn with a spring layout to bring out more clearly the disconnected sub-graphs.


In [25]:
G3S = trim_edges(G_SAD, weight=5)

pos=nx.spring_layout(G3S) # positions for all nodes

# nodes
nx.draw_networkx_nodes(G3S, pos, node_size=700,
                       linewidths=0, node_color='#cccccc')

edge_list = [(u, v) for u, v in G3S.edges()]
weight_list = [edata['weight'] for u, v, edata in G3S.edges(data=True)]

# edges
nx.draw_networkx_edges(G3S, pos,
                       edgelist=edge_list,
                       width=weight_list,
                       alpha=0.4,edge_color='b')

# labels
nx.draw_networkx_labels(G3S, pos, font_size=12,
                        font_family='sans-serif', font_weight='bold')

fig = plt.gcf()
fig.set_size_inches(13, 13)
plt.axis('off');