Problem statement

Looking at repositories in Github, can we predict the success of a single repository using a model derived from a sample of the entire collection of repositories?

Which attributes of a repository are correlated with success, as defined by "popularity", or the number of users following a repository?

  • programming language choice
  • project management attributes such as issue tracking
  • patterns of changes to the repository over time

Data selection

Working with the Github API, collect metadata about several thousand code repositories.

The Github API is a clean, easy to use source of data to analyze. Or at least it should be.

Listen carefully to Colby and Wen

They know of what they speak! I was originally going to work with them, and they had emphasized that the data source they selected had already been cleaned and prepared for analysis. When I told them about this other project idea, they warned me that this would be a lot of work.

Did I listen?

Did I?

Several days working on a data fetch script

Key issues I ran into

Reviewing the code I wrote, the data fetch script had to account for the following:

  • rate limiting
  • delayed responses (HTTP code 202)
  • empty repositories
  • result repository time skew
  • subquery paging
  • recovering from errors
  • it takes a long time

    This is mostly working now, the script is still running and is approach 5,000 repositories.

Data pre-processing

Iterate dozens of times over the available, selecting out attributes that might be interesting. Understand those data structures and what level of detail to gather or summarize. Processing script is available at https://github.com/dchud/substar/blob/master/process.py

Key details to handle and prep for SAS:

  • booleans, dates, computed values
  • data model for languages (many dozens!)
  • summarizing commit frequency and owner percentage
  • missing value representation
  • file format - tab-delimited .txt seems to work best

Still iterating over this, prepping computed values to serve different models. Having more recent data will help variety of values and repo types.

You can explore the data yourself at https://github-demo.statwing.com/p0/datasets/dat_TTYqDcC6HdWXRJPwna5yHbReJgrnJPyx#workspaces/2178 (a new thing I just found).

Simple data review

A whirlwind tour of the data I've collected so far (still in process!).


In [9]:
%load_ext rmagic


The rmagic extension is already loaded. To reload it, use:
  %reload_ext rmagic

In [10]:
%%R
library(data.table)
input <- read.csv('/Users/dchud/projects/substar/stripe10000.txt', sep='\t')
dt = data.table(input)
nrow(dt)

In [11]:
%R names(dt)


Out[11]:
array(['id', 'owner', 'name', 'size', 'forks_count', 'network_count',
       'stargazers_count', 'subscribers_count', 'watchers_count',
       'open_issues_count', 'has_downloads', 'has_issues', 'has_wiki',
       'fork', 'created_at', 'updated_at', 'pushed_at', 'num_contributors',
       'num_weeks', 'lines_added', 'lines_added_per_week',
       'lines_subtracted', 'lines_subtracted_per_week',
       'num_weeks_since_change', 'all_commits', 'owner_commits',
       'owner_commits_percentage', 'mean_commits_per_week',
       'std_commits_per_week'], 
      dtype='|S25')

In [12]:
%%R
attach(dt)


The following objects are masked from dt (position 3):

    all_commits, created_at, fork, forks_count, has_downloads,
    has_issues, has_wiki, id, lines_added, lines_added_per_week,
    lines_subtracted, lines_subtracted_per_week, mean_commits_per_week,
    name, network_count, num_contributors, num_weeks,
    num_weeks_since_change, open_issues_count, owner, owner_commits,
    owner_commits_percentage, pushed_at, size, stargazers_count,
    std_commits_per_week, subscribers_count, updated_at, watchers_count

In [13]:
%R summary(stargazers_count)


Out[13]:
array([  0.00000000e+00,   1.00000000e+00,   1.00000000e+00,
         1.44400000e+01,   3.00000000e+00,   7.77800000e+03])

In [14]:
%R hist(log(stargazers_count), col='steelblue', xlab='log(stargazers_count)')
%R grid()


Error in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...) : 
  plot.new has not been called yet

In [15]:
%R sum(has_downloads)


Out[15]:
array([9501], dtype=int32)

In [21]:
%%R 
bools <- c(sum(has_downloads), sum(has_wiki), sum(has_issues), sum(fork))
barplot(bools / 10096, names=c('Downloads', 'Wiki', 'Issues', 'Fork'), ylim=c(0,1), col='steelblue', ylab='% of repositories', main='Overall percentages')



In [17]:
%R hist(log(size), xlab='log(size)', col='steelblue')


Out[17]:
array([<rpy2.rinterface.SexpVector - Python:0x109dcf180 / R:0x104ea9408>,
       <rpy2.rinterface.SexpVector - Python:0x109dcf1c8 / R:0x102fb5008>,
       <rpy2.rinterface.SexpVector - Python:0x109dcf120 / R:0x10a113298>,
       <rpy2.rinterface.SexpVector - Python:0x109dcf168 / R:0x10a1131f0>,
       <rpy2.rinterface.SexpVector - Python:0x109dcf0d8 / R:0x107d64e48>,
       <rpy2.rinterface.SexpVector - Python:0x109dcf1f8 / R:0x107d64e78>], dtype=object)

In [18]:
%%R
par(mfrow=c(2, 2))
hist(log(network_count), col='steelblue')
hist(log(forks_count), col='steelblue')
hist(log(stargazers_count), col='steelblue')
hist(log(subscribers_count), col='steelblue')



In [19]:
%R pairs(~ network_count + forks_count + stargazers_count + subscribers_count)



In [20]:
%R pairs(~ network_count + mean_commits_per_week + num_weeks_since_change + num_contributors)



In [23]:
%R pairs(~ stargazers_count + std_commits_per_week + lines_added_per_week + num_weeks_since_change)



In [ ]: