Data Merging

We collected data from GitHub, BioConductor and CRAN. In this notebook, we will merge those data in a single file that will be used to do our analysis. The expected format of the file will be something like:

Package : name of the package
Version : version for the meta-data
Source : github, cran or bioconductor
Date : either the value of CommitDate, SnapshotFirstDate or BiocDate, based on the value of Source
License : the license
Suggests : a white-space separated list of suggested dependencies
Imports : a white-space separated list of imported packages
Depends : a white-space separated list of dependencies
Owner : (github only) name of the owner of the repository
Repository : (github only) name of the repository
CommitDate : (github only) date of the commit containing the meta-data
CRANRelease : (cran only) file server date of the file containing the meta-data
SnapshotFirstDate : (cran only) date of the first snapshot containing this version
SnapshotLastDate : (cran only) date of the last snapshot containing this version
BiocDate : (bioc only) date of the BioConductor release including this version
BiocVersion : (bioc only) version of this BioConductor release
BiocCategory : (bioc only) either Software, Annotation Data or Experiment Data



In [9]:

    
fields = ['Package', 'Version', 'Source', 'Date', 'License', 'Suggests', 'Imports', 'Depends', 'Owner', 
 'Repository', 'CommitDate', 'CRANRelease', 'SnapshotFirstDate', 'SnapshotLastDate', 'BiocDate',
 'BiocVersion', 'BiocCategory']

OUTPUT = '../data/github-cran-bioc-alldata.csv'



In [10]:

    
import pandas

github = pandas.DataFrame.from_csv('../data/github-raw-2015-05-04.csv')
cran = pandas.DataFrame.from_csv('../data/cran-deps-history-2015-04-20.csv', index_col=None)
bioc = pandas.DataFrame.from_csv('../data/bioconductor-2015-05-05.csv')

Forget that: For data coming from Github, we do some preprocessing: if a pair (package, version) has many instances, we keep the oldest one.



In [11]:

    
# github = github.sort('CommitDate')
# github = github.drop_duplicates(('Package', 'Version'), take_last=False)

The same (not?) applies for BioConductor.



In [12]:

    
# bioc = bioc.sort('BiocDate')
# bioc = bioc.drop_duplicates(('Package', 'Version'), take_last=False)

The following function parses the dependencies and return a list of strings.



In [13]:

    
def parse_dependencies(str_list, ignored=[]):
    """
    Return a list of strings where each string is a package name not in `ignored`.
    The input is a list of dependencies as contained in a DESCRIPTION file. 
    """
    # Check NaN
    str_list = str_list if str_list != pandas.np.nan else ''
    
    # Filter version numbers
    f = lambda lst: [dep.split('(')[0].strip() for dep in lst.split(',')]
    items = filter(lambda x: len(x) > 0, f(str_list))
    items = filter(lambda x: x not in ignored, items)
    return items

We now merge the three datasets into one big dataset, and apply some processing (parse_dependencies).



In [14]:

    
cran['Source'] = 'cran'
cran['Date'] = cran['SnapshotFirstDate']
github['Source'] = 'github'
github['Date'] = github['CommitDate']
bioc['Source'] = 'bioc'
bioc['Date'] = bioc['BiocDate']

# Merge
packages = pandas.concat([cran, github, bioc])

# Deal with dependencies lists
dependencies_formatter = lambda x: ' '.join(parse_dependencies(x))
for field in ['Suggests', 'Imports', 'Depends']:
    packages[field] = packages[field].fillna(value='').apply(dependencies_formatter)

# Convert date
packages['Date'] = pandas.to_datetime(packages['Date'])

# Remove useless packages (see http://cran.r-project.org/doc/manuals/r-release/R-exts.html#Creating-R-packages)
# The mandatory ‘Package’ field gives the name of the package. 
# This should contain only (ASCII) letters, numbers and dot, have at least two characters and 
# start with a letter and not end in a dot. 
packages = packages.dropna(subset=['Version', 'Package', 'Date'])
packages = packages[packages.Package.str.match(r'^[a-zA-Z][a-zA-Z0-9\.]+$')]

    
output = packages[fields].sort('Package')



In [15]:

    
output.to_csv(OUTPUT, encoding='utf-8')