Dépôts "populaires" liés à R

Sur base des événements de type PushEvent liés à des dépôts relatifs au langage R, nous allons identifier quels sont les dépôts les plus "populaires" (par activité, par starred, par fork, etc.).


In [1]:
import pandas
import pymongo
import dateutil.parser

from itertools import imap

conn = pymongo.MongoClient()

def _from_json_to_flat(e):
    return {'created_at': dateutil.parser.parse(e['created_at']),
            'owner': e['repository']['owner'],
            'repository': e['repository']['name'],
            'is_a_fork': e['repository']['fork'], 
            'forks': e['repository']['forks'], 
            'starred': e['repository']['stargazers'],
            'watchers': e['repository']['watchers']}

events = imap(_from_json_to_flat, 
              conn.r.events.find({'type': 'PushEvent'},
                                 fields=['created_at', 'repository.name', 'repository.owner', 
                                         'repository.fork', 'repository.forks', 
                                         'repository.stargazers', 'repository.watchers']))

In [2]:
# This should take some time (expect 683425 items)
df_events = pandas.DataFrame(list(events))

In [3]:
# Which ones are R packages?
_ = pandas.read_csv('../data/github-description.csv').query('key == "Package"')
_ = _.rename(columns={'value': 'package_name'})
_ = _[['owner', 'repository', 'package_name']]

# Add a flag
_['package'] = True
df = df_events.merge(_, how='left', on=('owner', 'repository'))

# Complete flag
df = df.fillna({'package': False})

In [4]:
# Aggregate by month
df['date_agg'] = df.apply(lambda x: pandas.datetime(x['created_at'].year, x['created_at'].month, 1), axis=1)
groups = df.groupby(('date_agg', 'owner', 'repository'), sort=False)
df_agg = groups.aggregate(pandas.np.max).drop('created_at', axis=1)

In [5]:
def topmost(df, date, criterion, limit=10):
    return df.loc[date].sort(criterion, ascending=False)[:limit]

Il suffit maintenant d'utiliser la fonction topmost afin d'obtenir un classement des dépôts.

Par exemple, topmost(df_agg, '2014-01-01', 'forks') pour obtenir la liste des 10 dépôts les plus forkés en fin janvier 2014. Il est facile de filtrer pour ne conserver que les packages. Par exemple, topmost(df_agg[df_agg['package'] == True], '2014-01-01', 'forks').

En particulier, voyons s'il y a des changements dans certains classements à des dates clés (pour rappel, les données d'une date correspondent aux données en fin de mois). Il y a essentiellement 3 moments clés dans l'activité des paquets : un tout petit pic en novembre/décembre 2013, et deux augmentation de l'activité en février/mars 2014 et octobre 2014.

Les dates retenues sont donc :


In [6]:
from IPython.display import display

dates_of_interest = ['2013-11-01', '2014-01-01', '2014-03-01', '2014-09-01', '2014-12-01']
criteria = ['forks', 'watchers', 'starred']

for _df in [df_agg, df_agg[df_agg['package'] == True]]:
    for criterion in criteria:
        print '=====', criterion
        for date in dates_of_interest:
            print '---', date
            display(topmost(_df, date, criterion, 5)[['package_name', criterion]])


===== forks
--- 2013-11-01
package_name forks
owner repository
yihui knitr knitr 177
hadley devtools devtools 135
systematicinvestor SIT NaN 83
yihui knitr-examples NaN 78
hadley adv-r NaN 71
--- 2014-01-01
package_name forks
owner repository
yihui knitr knitr 195
hadley devtools devtools 149
adv-r NaN 111
rstudio shiny shiny 109
systematicinvestor SIT NaN 91
--- 2014-03-01
package_name forks
owner repository
yihui knitr knitr 222
hadley ggplot2 ggplot2 171
devtools devtools 156
adv-r NaN 135
rstudio shiny shiny 123
--- 2014-09-01
package_name forks
owner repository
swirldev swirl_courses NaN 700
hadley httr httr 383
rstudio shiny shiny 317
yihui knitr knitr 314
genomicsclass labs NaN 270
--- 2014-12-01
package_name forks
owner repository
swirldev swirl_courses NaN 1228
hadley httr httr 517
rstudio shiny shiny 403
yihui knitr knitr 352
genomicsclass labs NaN 287
===== watchers
--- 2013-11-01
package_name watchers
owner repository
hadley devtools devtools 722
yihui knitr knitr 508
hadley plyr plyr 246
red red NaN 242
ramnathv slidify slidify 239
--- 2014-01-01
package_name watchers
owner repository
hadley devtools devtools 753
yihui knitr knitr 547
hyperq hyperq NaN 469
rstudio shiny shiny 467
red red NaN 281
--- 2014-03-01
package_name watchers
owner repository
hadley ggplot2 ggplot2 803
devtools devtools 775
yihui knitr knitr 595
rstudio shiny shiny 550
red red NaN 355
--- 2014-09-01
package_name watchers
owner repository
hadley devtools devtools 866
rstudio shiny shiny 811
yihui knitr knitr 754
hadley dplyr dplyr 471
swirldev swirl_courses NaN 367
--- 2014-12-01
package_name watchers
owner repository
fivethirtyeight data NaN 998
rstudio shiny shiny 942
hadley devtools devtools 915
yihui knitr knitr 817
hadley dplyr dplyr 591
===== starred
--- 2013-11-01
package_name starred
owner repository
hadley devtools devtools 722
yihui knitr knitr 508
hadley plyr plyr 246
red red NaN 242
ramnathv slidify slidify 239
--- 2014-01-01
package_name starred
owner repository
hadley devtools devtools 753
yihui knitr knitr 547
hyperq hyperq NaN 469
rstudio shiny shiny 467
red red NaN 281
--- 2014-03-01
package_name starred
owner repository
hadley ggplot2 ggplot2 803
devtools devtools 775
yihui knitr knitr 595
rstudio shiny shiny 550
red red NaN 355
--- 2014-09-01
package_name starred
owner repository
hadley devtools devtools 866
rstudio shiny shiny 811
yihui knitr knitr 754
hadley dplyr dplyr 471
swirldev swirl_courses NaN 367
--- 2014-12-01
package_name starred
owner repository
fivethirtyeight data NaN 998
rstudio shiny shiny 942
hadley devtools devtools 915
yihui knitr knitr 817
hadley dplyr dplyr 591
===== forks
--- 2013-11-01
package_name forks
owner repository
yihui knitr knitr 177
hadley devtools devtools 135
ramnathv slidify slidify 57
hadley plyr plyr 42
testthat testthat 37
--- 2014-01-01
package_name forks
owner repository
yihui knitr knitr 195
hadley devtools devtools 149
rstudio shiny shiny 109
hadley testthat testthat 46
plyr plyr 43
--- 2014-03-01
package_name forks
owner repository
yihui knitr knitr 222
hadley ggplot2 ggplot2 171
devtools devtools 156
rstudio shiny shiny 123
hadley testthat testthat 54
--- 2014-09-01
package_name forks
owner repository
hadley httr httr 383
rstudio shiny shiny 317
yihui knitr knitr 314
hadley devtools devtools 212
dplyr dplyr 131
--- 2014-12-01
package_name forks
owner repository
hadley httr httr 517
rstudio shiny shiny 403
yihui knitr knitr 352
hadley devtools devtools 239
dplyr dplyr 199
===== watchers
--- 2013-11-01
package_name watchers
owner repository
hadley devtools devtools 722
yihui knitr knitr 508
hadley plyr plyr 246
ramnathv slidify slidify 239
johnmyleswhite ProjectTemplate ProjectTemplate 208
--- 2014-01-01
package_name watchers
owner repository
hadley devtools devtools 753
yihui knitr knitr 547
rstudio shiny shiny 467
hadley plyr plyr 262
johnmyleswhite ProjectTemplate ProjectTemplate 220
--- 2014-03-01
package_name watchers
owner repository
hadley ggplot2 ggplot2 803
devtools devtools 775
yihui knitr knitr 595
rstudio shiny shiny 550
hadley dplyr dplyr 265
--- 2014-09-01
package_name watchers
owner repository
hadley devtools devtools 866
rstudio shiny shiny 811
yihui knitr knitr 754
hadley dplyr dplyr 471
rstudio ggvis ggvis 305
--- 2014-12-01
package_name watchers
owner repository
rstudio shiny shiny 942
hadley devtools devtools 915
yihui knitr knitr 817
hadley dplyr dplyr 591
plyr plyr 346
===== starred
--- 2013-11-01
package_name starred
owner repository
hadley devtools devtools 722
yihui knitr knitr 508
hadley plyr plyr 246
ramnathv slidify slidify 239
johnmyleswhite ProjectTemplate ProjectTemplate 208
--- 2014-01-01
package_name starred
owner repository
hadley devtools devtools 753
yihui knitr knitr 547
rstudio shiny shiny 467
hadley plyr plyr 262
johnmyleswhite ProjectTemplate ProjectTemplate 220
--- 2014-03-01
package_name starred
owner repository
hadley ggplot2 ggplot2 803
devtools devtools 775
yihui knitr knitr 595
rstudio shiny shiny 550
hadley dplyr dplyr 265
--- 2014-09-01
package_name starred
owner repository
hadley devtools devtools 866
rstudio shiny shiny 811
yihui knitr knitr 754
hadley dplyr dplyr 471
rstudio ggvis ggvis 305
--- 2014-12-01
package_name starred
owner repository
rstudio shiny shiny 942
hadley devtools devtools 915
yihui knitr knitr 817
hadley dplyr dplyr 591
plyr plyr 346