GitHubArchive consumer

Loading data from GitHub Archive

We're going to use GitHubArchive to retrieve a lot of data from GitHub activity stream. GitHubArchive provides several file whose names are http://data.githubarchive.org/{year}-{month}-{day}-{hour}.json.gz that we are going to retrieve. GitHubArchive provides those files since 2011-12-02, but the file format has changed at the beginning of 2015.

We first generate a list of links we are interested in, say from 2013-01-01 to 2013-01-05 excluded.


In [ ]:
from dateutil import rrule
from datetime import date

start_date = date(2013, 1, 1)
end_date = date(2013, 1, 5)

date_list = rrule.rrule(rrule.HOURLY, dtstart=start_date, until=end_date)

link_format = 'http://data.githubarchive.org/{year}-{month:0>2}-{day:0>2}-{hour}.json.gz'

links = [link_format.format(year=d.year, month=d.month, day=d.day, hour=d.hour) for d in date_list]

if __name__ == '__main__':
    print '\n'.join(links)

The easiest way to retrieve several files is using wget (or requests module in Python). Assuming you stored the list of links in a file links.txt:

wget -i links.txt -nc -c

Those files are mainly JSON strings that are gzipped. It is easy to get its content using the following function:


In [ ]:
import json
import gzip

def get_content_from_file(filepath):
    """
    Return a list of JSON structures that are contained in the file
    described by filepath. This function expects that the file is 
    gzipped.
    """
    
    with gzip.GzipFile(filepath) as f:
        try:
            content = map(json.loads, f.readlines())
        except Exception as e: 
            # Anything related to JSON, like UnicodeError, ValueError, etc. 
            pass  # ... for sure, it's a bad idea...
    return content

Say we are interested to get the content of every file that we downloaded:


In [ ]:
import os

# Assuming we are inside the right directory and there is no other files in it.
filename_list = os.listdir('.')  

# 'activity' will store the entire activity stream
activity = []
for filename in filename_list: 
    activity += get_content_from_file(filename)
    
# Now, activity contains the entiere activity stream
print activity[0]

This content can then be put in a (relational or not) database. An event looks like:

{u'actor': u'lastr2d2',
 u'actor_attributes': {u'email': u'lastr2d2@gmail.com',
  u'gravatar_id': u'2a8a2ef556894cb1b6945a8c471bc4e9',
  u'login': u'lastr2d2',
  u'name': u'Wayne Wang',
  u'type': u'User'},
 u'created_at': u'2014-01-01T01:01:58-08:00',
 u'payload': {u'head': u'afa9b3ac304d6ab92fd7689d1604f240b8f4ae38',
  u'ref': u'refs/heads/master',
  u'shas': [[u'afa9b3ac304d6ab92fd7689d1604f240b8f4ae38',
    u'lastr2d2@gmail.com',
    u'updated minifized version',
    u'Wayne Wang',
    True]],
  u'size': 1},
 u'public': True,
 u'repository': {u'created_at': u'2013-11-19T00:01:51-08:00',
  u'description': u'My userscript for douban.fm',
  u'fork': False,
  u'forks': 0,
  u'has_downloads': True,
  u'has_issues': True,
  u'has_wiki': True,
  u'id': 14517966,
  u'language': u'JavaScript',
  u'master_branch': u'master',
  u'name': u'scripts-doubanfm',
  u'open_issues': 0,
  u'owner': u'lastr2d2',
  u'private': False,
  u'pushed_at': u'2014-01-01T01:01:57-08:00',
  u'size': 128,
  u'stargazers': 0,
  u'url': u'https://github.com/lastr2d2/scripts-doubanfm',
  u'watchers': 0},
 u'type': u'PushEvent',
 u'url': u'https://github.com/lastr2d2/scripts-doubanfm/compare/ba4d721b3d...afa9b3ac30'}

Notice that payload has not a fixed schema.

Filtering R packages

We filtered out the events (and by extension their repositories) that are related to R language. This can be easily done by filtering on repository.language key. With such a list of R repositories, we are interested in identifying which ones are R packages. To do this, we filtered out the repositories that contains a DESCRIPTION file at its root.

If repository is the (full) name of the repository, then this file could possibly be retrieved from https://raw.githubusercontent.com/_repository_/master/DESCRIPTION


In [ ]:
import requests

url = 'https://raw.githubusercontent.com/{repository}/master/DESCRIPTION'

def get_description_file_for(repository):
    """ 
    Given the full name of a github repository, return the DESCRIPTION file
    if it exists or None otherwhise. 
    """
    
    result = requests.get(url.format(repository))
    if status_code == 200:
        return r.content
    else:
        return None

We put all the data collected from 2013 and 2014 in a MongoDB datastore. Our MongoDB contains a collection events with every event from GitHub Archive related to R. We spread the data into several collections: events contains the raw event, repository contains event.repository, payload contains event.payload, etc. for every subdocument contained in each event.

Here's the result of some queries, FYI:

> db.events.count()
1016423
> db.events.distinct('repository.url').length
121385
> db.events.distinct('repository.id').length
118675
> db.events.distinct('repository.name').length
43164

Notice that, at the same time, https://github.com/search?utf8=%E2%9C%93&q=language%3AR&type=Repositories&ref=searchresults shows 67275 repositories. This can be explained as a large majority of the 121385 repositories we have have been deleted since the events were collected.

Moreover, we added a collection descriptionfile. This collection contains {_id: URL, file: CONTENT} where URL is the URL of a R repository, and CONTENT is the content of the DESCRIPTION file if any.

> db.descriptionfile.find({file: {$exists: true}}).length()
19052

This is, we collected and identified 19052 packages that are related to R. This includes redundant repositories, like rpkg/* which is an alias for cran/*.

> db.descriptionfile.find({_id: { $regex: /^https:\/\/github.com\/cran/ } } ).length()
6007
> db.descriptionfile.find({_id: { $regex: /^https:\/\/github.com\/rpkg/ } } ).length()
4423

Identifying R packages from CRAN, RPKG and the others

Say we are interested in identifying which are the R packages that are hosted by CRAN on Github:


In [ ]:
import pymongo

# Assuming we are locally running a MongoDB instance
db = pymongo.MongoClient().r

# Contains 3-uples ('https://github.com', OWNER_NAME, REPOSITORY_NAME)
packages = map(lambda x: x['_id'].rsplit('/', 2), list(db.descriptionfile.find({'file': {'$exists': True}}, fields=['_id'])))

# Filter CRAN, RPKG and the others
cran_names = map(lambda x: x[2], filter(lambda x: x[1] == 'cran', packages))
rpkg_names = map(lambda x: x[2], filter(lambda x: x[1] == 'rpkg', packages))
other_names = map(lambda x: x[2], filter(lambda x: x[1] != 'rpkg' and x[1] != 'cran', packages))

# len(other_names) == 8643

cran_set = set(cran_names)
rpkg_set = set(rpkg_names)
other_set = set(other_names)

# Names that are NOT in cran_set and rpkg_set: 5068 items
outside_only = other_set.difference(cran_set).difference(rpkg_set)

# Names that are in cran_set: 1210 items
cran_too = other_set.intersection(cran_set)

# Names that are in rpkg_set: 753 items
rpkg_too = other_set.intersection(rpkg_set)

# Names from cran that are in rpkg too: 4345 items (rpkg has 4410 items!)
rpkg_cran = rpkg_set.intersection(cran_set)