We extracted a list of GitHub events for 2013 and 2014, for all repository that are taggued with "Language = R" from GitHubArchive. Based on this list, we checked on February 17th, for every repository, if this repository still exists and if it contains a DESCRIPTION
file at its root.
The results were collected (among other data) in ../../IWSECO2015/data/github-RPackage-repository.csv
.
In [7]:
import pandas
from datetime import date
In [8]:
df = pandas.DataFrame.from_csv('../../IWSECO2015/data/github-RPackage-repository.csv')
We first identify from this list repositories that are NO fork and if two (or more) repositories contain the same package, we keep the data from the oldest one. We also filter out repositories that belong to cran
or rpkg
, as those are only a (partial) mirror of CRAN.
In [9]:
df = df.query('fork == False and owner != "cran" and owner !="rpkg"').sort('creation').drop_duplicates(('Package'))
print len(df), 'packages'
In [11]:
df.to_csv('../data/github-repositories-{date}.csv'.format(date=date.today().isoformat()))