Looking at repositories in Github, can we predict the success of a single repository using a model derived from a sample of the entire collection of repositories?
Which attributes of a repository are correlated with success, as defined by "popularity", or the number of users following a repository?
Working with the Github API, collect metadata about several thousand code repositories.
The Github API is a clean, easy to use source of data to analyze. Or at least it should be.
They know of what they speak! I was originally going to work with them, and they had emphasized that the data source they selected had already been cleaned and prepared for analysis. When I told them about this other project idea, they warned me that this would be a lot of work.
Did I listen?
Did I?
Current working version is at: https://github.com/dchud/substar/blob/master/fetch.py
Reviewing the code I wrote, the data fetch script had to account for the following:
it takes a long time
This is mostly working now, the script is still running and is approach 5,000 repositories.
Iterate dozens of times over the available, selecting out attributes that might be interesting. Understand those data structures and what level of detail to gather or summarize. Processing script is available at https://github.com/dchud/substar/blob/master/process.py
Key details to handle and prep for SAS:
Still iterating over this, prepping computed values to serve different models. Having more recent data will help variety of values and repo types.
You can explore the data yourself at https://github-demo.statwing.com/p0/datasets/dat_TTYqDcC6HdWXRJPwna5yHbReJgrnJPyx#workspaces/2178 (a new thing I just found).
A whirlwind tour of the data I've collected so far (still in process!).
In [9]:
%load_ext rmagic
In [10]:
%%R
library(data.table)
input <- read.csv('/Users/dchud/projects/substar/stripe10000.txt', sep='\t')
dt = data.table(input)
nrow(dt)
In [11]:
%R names(dt)
Out[11]:
In [12]:
%%R
attach(dt)
In [13]:
%R summary(stargazers_count)
Out[13]:
In [14]:
%R hist(log(stargazers_count), col='steelblue', xlab='log(stargazers_count)')
%R grid()
In [15]:
%R sum(has_downloads)
Out[15]:
In [21]:
%%R
bools <- c(sum(has_downloads), sum(has_wiki), sum(has_issues), sum(fork))
barplot(bools / 10096, names=c('Downloads', 'Wiki', 'Issues', 'Fork'), ylim=c(0,1), col='steelblue', ylab='% of repositories', main='Overall percentages')
In [17]:
%R hist(log(size), xlab='log(size)', col='steelblue')
Out[17]:
In [18]:
%%R
par(mfrow=c(2, 2))
hist(log(network_count), col='steelblue')
hist(log(forks_count), col='steelblue')
hist(log(stargazers_count), col='steelblue')
hist(log(subscribers_count), col='steelblue')
In [19]:
%R pairs(~ network_count + forks_count + stargazers_count + subscribers_count)
In [20]:
%R pairs(~ network_count + mean_commits_per_week + num_weeks_since_change + num_contributors)
In [23]:
%R pairs(~ stargazers_count + std_commits_per_week + lines_added_per_week + num_weeks_since_change)
In [ ]: