The file ../data/github-repositories-2015-02-17
contains a list of GitHub repositories that are candidates to store a package related to R. Those candidates were collected from the activity on GitHub between 2013 and 2014 (included). Those candidates all contain a DESCRIPTION
file at the root of the repository.
We git clone
-ed each of those repository. This notebook will parse those git repositories and extract the DESCRIPTION
file of each commit.
In [1]:
import pandas
from datetime import date
We will make use of the following commands:
git clone <url> <path>
where git log --follow --format="%H/%ci" <path>
where DESCRIPTION
. The output of this command is a list of git show <commit>:<path>
where DESCRIPTION
. This command outputs the content of the file at the given commit.
In [2]:
github = pandas.DataFrame.from_csv('../data/github-repositories-2015-02-17.csv')
repositories = github[['owner', 'repository']]
# Date is 2015-05-04 because we love starwars and because we cloned the repositories at this date
FILENAME = '../data/github-raw-{date}.csv'.format(date='2015-05-04')
# Root of the directory where the repositories were collected
GIT_DIR = '/data/github/'
We will retrieve a lot of data, we can benefit from IPython's parallel computation tool.
To use this notebook, you need either to configure your IPController or to start a cluster of IPython nodes, using ipcluster start -n 4
for example. See https://ipython.org/ipython-doc/dev/parallel/parallel_process.html for more information.
It seems that most recent versions of IPython Notebook can directly start cluster from the web interface, under the Cluster tab.
In [3]:
from IPython import parallel
clients = parallel.Client()
clients.block = False # asynchronous computations
print 'Clients:', str(clients.ids)
In [4]:
def get_data_from((owner, repository)):
# Move to target directory
try:
os.chdir(os.path.join(GIT_DIR, owner, repository))
except OSError as e:
# Should happen when directory does not exist
return []
data_list = []
# Get commits for DESCRIPTION
try:
commits = subprocess.check_output(['git', 'log', '--format=%H/%ci', '--', 'DESCRIPTION'])
except subprocess.CalledProcessError as e:
# Should not happen!?
raise Exception(owner + ' ' + repository + '/ log : ' + e.output)
for commit in [x for x in commits.split('\n') if len(x.strip())!=0]:
commit_sha, date = map(lambda x: x.strip(), commit.split('/'))
# Get file content
try:
content = subprocess.check_output(['git', 'show', '{id}:{path}'.format(id=commit_sha, path='DESCRIPTION')])
except subprocess.CalledProcessError as e:
# Could happen when DESCRIPTION was added in this commit. Silently ignore
continue
try:
metadata = deb822.Deb822(content.split('\n'))
except Exception as e:
# I don't known which are the exceptions that Deb822 may throw!
continue # Go further
data = {}
for md in ['Package', 'Version', 'License', 'Imports', 'Suggests', 'Depends']:
data[md] = metadata.get(md, '')
data['CommitDate'] = date
data['Owner'] = owner
data['Repository'] = repository
data_list.append(data)
# Return to root directory
os.chdir(GIT_DIR)
return data_list
In [5]:
data = []
clients[:].execute('import subprocess, os')
clients[:].execute('from debian import deb822')
clients[:]['GIT_DIR'] = GIT_DIR
balanced = clients.load_balanced_view()
items = [(owner, repo) for idx, (owner, repo) in repositories.iterrows()]
print len(items), 'items'
res = balanced.map(get_data_from, items, ordered=False, timeout=15)
import time
while not res.ready():
time.sleep(5)
print res.progress, ' ',
for result in res.result:
data.extend(result)
In [7]:
df = pandas.DataFrame.from_records(data)
df.to_csv(FILENAME, encoding='utf-8')
print len(df), 'items'
print len(df.drop_duplicates(['Package'])), 'packages'
print len(df.drop_duplicates(['Owner', 'Repository'])), 'repositories'
print len(df.drop_duplicates(['Package', 'Version'])), 'pairs (package, version)'
In [ ]:
df