This notebook aims to retrieve and collect all the available (useful) data from BioConductor. In particular, it collects the R packages' meta-data for a given set of versions.
In [8]:
import pandas
import requests
import json
import BeautifulSoup as bs
from datetime import date
from itertools import repeat
We will retrieve a lot of data, we can benefit from IPython's parallel computation tool.
To use this notebook, you need either to configure your IPController or to start a cluster of IPython nodes, using ipcluster start -n 4
for example. See https://ipython.org/ipython-doc/dev/parallel/parallel_process.html for more information.
It seems that most recent versions of IPython Notebook can directly start cluster from the web interface, under the Cluster tab.
In [9]:
from IPython import parallel
clients = parallel.Client()
clients.block = True # synchronous computations
print 'Clients:', str(clients.ids)
We first define a set of useful constants that store several URL's. As BioConductor splits its packages in three categories (Softwares, AnnotationData and ExperimentData), we'll store three distinct URL's for each of those.
In [10]:
CATEGORIES = ['Software', 'AnnotationData', 'ExperimentData']
# Base list, not used in this notebook. Can be used for a "human" to see the list in a pretty format.
BASE_LIST = {
'Software': 'http://bioconductor.org/packages/{version}/BiocViews.html#___Software',
'AnnotationData': 'http://bioconductor.org/packages/{version}/BiocViews.html#___AnnotationData',
'ExperimentData': 'http://bioconductor.org/packages/{version}/BiocViews.html#___ExperimentData'
}
# Lists that will be parsed. Is used by BioConductor to populate BASE_LIST.
JSON_LIST = {
'Software': 'http://bioconductor.org/packages/json/{version}/bioc/packages.js',
'AnnotationData': 'http://bioconductor.org/packages/json/{version}/data/annotation/packages.js',
'ExperimentData': 'http://bioconductor.org/packages/json/{version}/data/experiment/packages.js'
}
# Details page for every package.
PACKAGE_DETAILS = {
'Software': 'http://bioconductor.org/packages/{version}/bioc/html/{name}.html',
'AnnotationData': 'http://bioconductor.org/packages/{version}/data/annotation/html/{name}.html',
'ExperimentData': 'http://bioconductor.org/packages/{version}/data/experiment/html/{name}.html'
}
# Available versions and the corresponding date.
VERSIONS = [
('1.6', '2005-05-18'),
('1.7', '2005-10-14'),
('1.8', '2006-04-27'),
('1.9', '2006-10-04'),
('2.0', '2007-04-26'),
('2.1', '2007-10-08'),
('2.2', '2008-05-01'),
('2.3', '2008-10-22'),
('2.4', '2009-04-21'),
('2.5', '2009-10-28'),
('2.6', '2010-04-23'),
('2.7', '2010-10-18'),
('2.8', '2011-04-14'),
('2.9', '2011-11-01'),
('2.10', '2012-04-02'),
('2.11', '2012-10-03'),
('2.12', '2013-04-04'),
('2.13', '2013-10-15'),
('2.14', '2014-04-14'),
('3.0', '2014-10-14'),
# ('3.1', '2015-04-17'),
]
# The pages of versions <2.5 do not have the same structure.
VERSIONS = filter(lambda x: x[1] >= '2009-10-28', VERSIONS)
# Meta-data we're interested in.
METADATA = ['Version', 'License', 'Depends', 'Imports', 'Suggests']
# Output
FILENAME = '../data/bioconductor-{date}.csv'.format(date=date.today().isoformat())
In [11]:
def metadata_for_packages(category, name, version):
"""
Return a subset of the meta-data that are available for this package.
The subset is built upon the items in METADATA.
"""
try:
content = requests.get(PACKAGE_DETAILS[category].format(version=version, name=name)).content
soup = bs.BeautifulSoup(content)
table = soup.find(name='table', attrs={'class': 'details'})
data = {}
for row in table.findChildren('tr'):
key, value = row.findChildren('td')
if key.text in METADATA:
data[key.text] = value.text
return data
except Exception:
print 'Exception while working on', name, version, 'in', category
raise
In [12]:
def packages_list(category, version):
"""
Return a list of available packages for the given version in the given category.
"""
content = requests.get(JSON_LIST[category].format(version=version)).content
# Remove variable declaration
_, content = content[:-1].split(' = ', 1)
content = json.loads(content)
return map(lambda x: x[0], content['content'])
We have now everything we need to retrieve all the data from BioConductor:
packages_list
for every CATEGORIES
and every VERSIONS
. This returns a list of package names.metadata_for_packages
.
In [13]:
def get_data_for(category, date, version, package):
pkg_data = metadata_for_packages(category, package, version)
pkg_data['Package'] = package
pkg_data['BiocVersion'] = version
pkg_data['BiocDate'] = date
pkg_data['BiocCategory'] = category
return pkg_data
data = []
clients[:].execute('import requests')
clients[:].execute('import BeautifulSoup as bs')
export = ['metadata_for_packages', 'PACKAGE_DETAILS', 'METADATA']
for name in export:
clients[:][name] = eval(name)
balanced = clients.load_balanced_view()
for version, date in VERSIONS:
print 'BioConductor version', version
for category in CATEGORIES:
packages = packages_list(category, version)
n = len(packages)
print 'Version', version, '-', n, 'items retrieved for', category
new_data = balanced.map(get_data_for, repeat(category, n), repeat(date, n), repeat(version, n), packages)
data += new_data
In [14]:
# Save in .csv file using pandas
df = pandas.DataFrame(data)
df = df[['Package'] + METADATA + ['BiocCategory', 'BiocVersion', 'BiocDate']]
df.to_csv(FILENAME)