Problem statement

Looking at repositories in Github, can we predict the success of a single repository using a model derived from a sample of the entire collection of repositories?

Which attributes of a repository are correlated with success, as defined by "popularity", or the number of users following a repository?

programming language choice
project management attributes such as issue tracking
patterns of changes to the repository over time

Data selection

Working with the Github API, collect metadata about several thousand code repositories.

The Github API is a clean, easy to use source of data to analyze. Or at least it should be.

Listen carefully to Colby and Wen

They know of what they speak! I was originally going to work with them, and they had emphasized that the data source they selected had already been cleaned and prepared for analysis. When I told them about this other project idea, they warned me that this would be a lot of work.

Did I listen?

Did I?

Several days working on a data fetch script

Current working version is at: https://github.com/dchud/substar/blob/master/fetch.py

Key issues I ran into

Reviewing the code I wrote, the data fetch script had to account for the following:

rate limiting
delayed responses (HTTP code 202)
empty repositories
result repository time skew
subquery paging
recovering from errors
it takes a long time

This is mostly working now, the script is still running and is approach 5,000 repositories.

Data pre-processing

Iterate dozens of times over the available, selecting out attributes that might be interesting. Understand those data structures and what level of detail to gather or summarize. Processing script is available at https://github.com/dchud/substar/blob/master/process.py

Key details to handle and prep for SAS:

booleans, dates, computed values
data model for languages (many dozens!)
summarizing commit frequency and owner percentage
missing value representation
file format - tab-delimited .txt seems to work best

Still iterating over this, prepping computed values to serve different models. Having more recent data will help variety of values and repo types.

You can explore the data yourself at https://github-demo.statwing.com/p0/datasets/dat_TTYqDcC6HdWXRJPwna5yHbReJgrnJPyx#workspaces/2178 (a new thing I just found).

Simple data review

A whirlwind tour of the data I've collected so far (still in process!).



In [9]:

    
%load_ext rmagic









    



The rmagic extension is already loaded. To reload it, use:
  %reload_ext rmagic



In [10]:

    
%%R
library(data.table)
input <- read.csv('/Users/dchud/projects/substar/stripe10000.txt', sep='\t')
dt = data.table(input)
nrow(dt)



In [11]:

    
%R names(dt)









    Out[11]:





array(['id', 'owner', 'name', 'size', 'forks_count', 'network_count',
       'stargazers_count', 'subscribers_count', 'watchers_count',
       'open_issues_count', 'has_downloads', 'has_issues', 'has_wiki',
       'fork', 'created_at', 'updated_at', 'pushed_at', 'num_contributors',
       'num_weeks', 'lines_added', 'lines_added_per_week',
       'lines_subtracted', 'lines_subtracted_per_week',
       'num_weeks_since_change', 'all_commits', 'owner_commits',
       'owner_commits_percentage', 'mean_commits_per_week',
       'std_commits_per_week'], 
      dtype='|S25')



In [12]:

    
%%R
attach(dt)









    





The following objects are masked from dt (position 3):

    all_commits, created_at, fork, forks_count, has_downloads,
    has_issues, has_wiki, id, lines_added, lines_added_per_week,
    lines_subtracted, lines_subtracted_per_week, mean_commits_per_week,
    name, network_count, num_contributors, num_weeks,
    num_weeks_since_change, open_issues_count, owner, owner_commits,
    owner_commits_percentage, pushed_at, size, stargazers_count,
    std_commits_per_week, subscribers_count, updated_at, watchers_count



In [13]:

    
%R summary(stargazers_count)









    Out[13]:





array([  0.00000000e+00,   1.00000000e+00,   1.00000000e+00,
         1.44400000e+01,   3.00000000e+00,   7.77800000e+03])



In [14]:

    
%R hist(log(stargazers_count), col='steelblue', xlab='log(stargazers_count)')
%R grid()









    












    



Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...) : 
  plot.new has not been called yet



In [15]:

    
%R sum(has_downloads)









    Out[15]:





array([9501], dtype=int32)



In [21]:

    
%%R 
bools <- c(sum(has_downloads), sum(has_wiki), sum(has_issues), sum(fork))
barplot(bools / 10096, names=c('Downloads', 'Wiki', 'Issues', 'Fork'), ylim=c(0,1), col='steelblue', ylab='% of repositories', main='Overall percentages')



In [17]:

    
%R hist(log(size), xlab='log(size)', col='steelblue')









    












    Out[17]:





array([<rpy2.rinterface.SexpVector - Python:0x109dcf180 / R:0x104ea9408>,
       <rpy2.rinterface.SexpVector - Python:0x109dcf1c8 / R:0x102fb5008>,
       <rpy2.rinterface.SexpVector - Python:0x109dcf120 / R:0x10a113298>,
       <rpy2.rinterface.SexpVector - Python:0x109dcf168 / R:0x10a1131f0>,
       <rpy2.rinterface.SexpVector - Python:0x109dcf0d8 / R:0x107d64e48>,
       <rpy2.rinterface.SexpVector - Python:0x109dcf1f8 / R:0x107d64e78>], dtype=object)



In [18]:

    
%%R
par(mfrow=c(2, 2))
hist(log(network_count), col='steelblue')
hist(log(forks_count), col='steelblue')
hist(log(stargazers_count), col='steelblue')
hist(log(subscribers_count), col='steelblue')



In [19]:

    
%R pairs(~ network_count + forks_count + stargazers_count + subscribers_count)



In [20]:

    
%R pairs(~ network_count + mean_commits_per_week + num_weeks_since_change + num_contributors)



In [23]:

    
%R pairs(~ stargazers_count + std_commits_per_week + lines_added_per_week + num_weeks_since_change)



In [ ]: