Évolution des packages dans le temps, sur les différentes sources

Ce notebook vise à donner un aperçu de l'évolution du nombre de packages présents dans chacune des sources considérées.



In [1]:

    
%matplotlib inline
from IPython.display import set_matplotlib_formats
import matplotlib.pyplot as plt
#set_matplotlib_formats('svg')

import pandas

Nous allons commencer par charger les données en provenance des différentes sources. Le fichier R-Packages.csv contient les données utiles pour Github : la liste des packages présents sur Github (colonne github == 1) avec la date de création du dépôt.

Le fichier cran-number-packages.csv contient, pour chaque date, le nombre de paquets présents à cette release.

Enfin, le fichier bioconductor-number-packages.csv contient, pour chaque release/date, le nombre de paquets présents à cette release pour chaque catégorie (software, experiment et annotation).



In [2]:

    
github_ = pandas.DataFrame.from_csv('../data/R-Packages.csv')
bioconductor_ = pandas.DataFrame.from_csv('../data/bioconductor-number-packages.csv', header=None)
cran_ = pandas.DataFrame.from_csv('../data/cran-number-packages.csv')



In [3]:

    
github_cran = github_.query('cran == 1 and github == 1 and canonical == 1')[['creation']]
github_cran['creation'] = pandas.to_datetime(github_cran['creation'])
github_cran = github_cran.set_index('creation')
github_cran['cran github'] = 1
github_cran = github_cran.sort_index()
github_cran = github_cran.cumsum()



In [4]:

    
github = github_.query('github == 1 and cran != 1 and bioconductor != 1 and canonical == 1')[['creation']]
github['creation'] = pandas.to_datetime(github['creation'])
github = github.set_index('creation')
github['github'] = 1
github = github.sort_index()
github = github.cumsum()



In [5]:

    
bioconductor = bioconductor_.rename(columns={1: 'BiocSoft', 2: 'BiocAnnotation', 3: 'BiocExperiment', 4: 'date'})
# bioconductor['BiocDatasets'] = bioconductor['BiocAnnotation'] + bioconductor['BiocExperiment']
bioconductor['Bioconductor'] = bioconductor['BiocSoft'] # + bioconductor['BiocDatasets']
bioconductor = bioconductor.set_index('date')
bioconductor = bioconductor.sort_index()[['Bioconductor']]



In [6]:

    
cran = cran_[['cran']]
cran = cran.sort_index()

Maintenant, nous mergeons les informations afin de pouvoir les afficher.



In [8]:

    
packages = cran.join(bioconductor, how='outer').join(github, how='outer').join(github_cran, how='outer').resample('1M', fill_method='pad')
t = packages['2013-09-03':'2014-12-31'][['cran', 'github', 'cran github']].plot(
    style=['blue', 'red', 'purple'], logy=True, figsize=(7,5))

import numpy

t.set_xlabel('date')
t.set_ylabel('number of packages (in logarithmic scale)')
t.grid(True, "both")

t.legend(('CRAN', 'GitHub \ CRAN', 'GitHub $\cap$ CRAN'), bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

plt.savefig("packages_number.svg", bbox_inches="tight")



In [7]: