Aravindan Balan (aravindan@cs.ucla.edu)
Aravind Ganapathy (ag1990@cs.ucla.edu)
Gautham Badhrinathan (gotemb@cs.ucla.edu)
Manoj Thakur (manojrthakur@cs.ucla.edu)
Praphull Kumar (praphull@cs.ucla.edu)
GitHub is a rich source of skillful developers for your next startup, it is also a good place to figure out which language you would want to use for your next startup.
But the questions to ask are:
Developers
For a developer's skill in a language we looked at his commits individually, summing the stars of the repository contributed to by the developer weighted by the proportion of the repository written in that language. The idea was that contributions to a highly starred repository would have to be up to some standard of quality, or else they'd be rolled back. Using these metrics we try to address the questions posed above for developers.
This metric indicates the amount of contribution of a developer towards a particular language.
We also analyze the relationship between repository popularity and it's attributes. This analysis provides an indication of which factors contribute to high number of followers, forks, pull requests and watchers for a repository.
We have analyzed the data from GitHub. The data was collected using the GitHub APIs [1]. The data is stored in the following format:
The project is divided into three main stages:
The data collected contains three main parts:
Apart from the entities mentioned above we require additional information that is derived from the above mentioned collections. These are used in the analysis stage.
We have used some packages for reading and playing with the data.
RMongo and rmongodb are packages used to read data from the mongodb.
rmongodb - http://cran.r-project.org/web/packages/rmongodb/rmongodb.pdf
Rmongo - http://cran.r-project.org/web/packages/RMongo/RMongo.pdf
We have also used the most powerful package for data analysis "plyr" in R to perform split, apply and combine functions on the data frames created.
We found that plyr very useful and powerful for performing those operations on our huge dataset.
In [14]:
%load_ext rmagic
import rpy2 as Rpy
In [15]:
%%R
###################### libraries used ##############################################
not.installed=function(package_name) !is.element(package_name,installed.packages()[,1])
if (not.installed("rmongodb"))
install.packages("rmongodb")
if (not.installed("reshape2"))
install.packages("reshape2")
if (not.installed("data.table"))
install.packages("data.table")
if (not.installed("plyr"))
install.packages("plyr")
if (not.installed("ggplot2"))
install.packages("ggplot2")
if (not.installed("RMongo"))
install.packages("RMongo")
if (not.installed("hash"))
install.packages("hash")
if (not.installed("graphics"))
install.packages("graphics")
if (not.installed("fpc"))
install.packages("fpc")
if (not.installed("cluster"))
install.packages("cluster")
library("fpc")
library("graphics")
library("stringr")
library("rjson")
library("ggplot2")
library("rmongodb")
library("reshape2")
library("MASS")
library("plyr")
library("RMongo")
library("hash")
require("data.table")
library("cluster")
library("HSAUR")
library("plyr")
In [16]:
%%R
##################### Utility Functions ###########################
find_stars_lang <- function(df, repo, lang)
{
lang <- tolower(lang)
ret <- 0
for(i in 1:nrow(df))
{
if(toString(df[i,1]) == repo && lang == tolower(toString(df[i,5])))
{
ret <- df[i,8]
}
}
ret
}
#find_forks_lang(ddply_sum_lineCount, "537fc5b7280ef15170b56d3b", "js")
find_forks_lang <- function(df, repo, lang)
{
ret <- 0
for(i in 1:nrow(df))
{
if(toString(df[i,1]) == repo && lang == toString(df[i,5]))
{
ret <- df[i,9]
}
}
ret
}
find_lang_contrib_ratio <- function(bs, lang)
{
lang = tolower(lang)
c_ratios <- mongo.bson.value(bs,"contribution_ratio")
ratio <- 0
for(i in 1:length(c_ratios))
{
if(!is.null(c_ratios[i][[1]]$language) && tolower(c_ratios[i][[1]]$language) == lang)
{
ratio <- c_ratios[i][[1]]$ratio
}
}
ratio
}
#################################################################################
x=1
The analysis for a particular language is based on two main aspects:
Plotting languages in a 2-dimensional space where one dimension is avg. popularity and other is avg. accessibiliy gives an indication of how effective a particular language is.
In [17]:
%%R -w 700 -h 500
##find all unique languages
get_all_lang <- function()
{
mongo <- mongoDbConnect("dataset", host="localhost")
output <- dbAggregate(mongo, "commits", c('{ "$unwind" : "$changes"}',
' { "$project" : { "lang" : "$changes.language" } } ',
' { "$group" : { "_id" : "$lang" } } '))
Languages_All <- c()
for(i in 1:length(output))
{
bson <- mongo.bson.from.JSON(output[i])
lang <- mongo.bson.value(bson,"_id")
if(!is.null(lang) && nchar(lang) > 0)
Languages_All <- c(Languages_All, lang)
}
unique_languages <- unique(Languages_All)
unique_languages
}
#######
# contains all the languages
unique_languages <- get_all_lang()
print((unique_languages))
print(length(unique_languages))
We create a data frame containing the repository and the Lanuguage information (stars, forks etc.) for each repository and language pair. Then we use split, apply and combine functions on the data to extract useful information from the data we obtained. We calculate the average popularity and average accessibility of each language and plot the same.
In [25]:
%%R
########################### How effective is a language ? ##############################
users = data.frame(stringsAsFactors = FALSE)
repos = data.frame(stringsAsFactors = FALSE)
mongo_rmongodb <- mongo.create(host = "localhost")
mongo_rmongo <- mongoDbConnect("dataset", host="localhost")
DBNS_users = "dataset.users"
DBNS_repos = "dataset.repositories"
if (mongo.is.connected(mongo_rmongodb)) {
repos = mongo.find.all(mongo_rmongodb, DBNS_repos, list(done=TRUE))
}
## create an empty dataframe and populate the same with repos and language data
repos_by_languages <- data.frame(stringsAsFactors = FALSE)
for(i in 1: nrow(repos))
{
repos_one <- repos[i,]
languages_list <- repos_one$languages
totalLineCount <- 0
if(!is.na(repos_one$stars) && repos_one$stars > 5 && !is.na(length(languages_list)) && class(repos_one$stars)!= 'mongo.oid')
{
for(j in 1:length(languages_list))
{
totalLineCount = totalLineCount + languages_list[[j]][['lineCount']]
}
for(j in 1:length(languages_list))
{
repos_by_languages <- rbind(repos_by_languages, data.frame(id=mongo.oid.to.string(repos[i,1][[1]]),name=repos_one$fullName,stars = repos_one$stars, forks = repos_one$forks, language = languages_list[[j]]['language'] , linecount = languages_list[[j]][['lineCount']], totalLineCount = totalLineCount ))
}
}
}
cat("Table data : \n\n\n")
print(head(repos_by_languages))
cat("\n\n\n")
cat("Dimensions : \n")
print(dim(repos_by_languages))
##### Use split, apply and combine to find the popularity and accessibility and proportional line count
ddply_sum_lineCount = ddply( repos_by_languages,
c("id","name", "stars","forks","language","linecount"),
summarize,
prop_lineCount = linecount/totalLineCount,
popularity = stars * prop_lineCount,
accessibility = forks * prop_lineCount
)
cat("Table data : \n\n\n")
print(head(ddply_sum_lineCount))
cat("\n\n\n")
cat("Dimensions : \n")
print(dim(ddply_sum_lineCount))
##### Use split, apply and combine to find the average popularity and accessibility
ddply_Overall = ddply( ddply_sum_lineCount,
c("language"),
summarize,
avg_popularity = mean(popularity),
avg_accessibility = mean(accessibility)
)
cat("Table data : \n\n\n")
print(head(ddply_Overall))
cat("\n\n\n")
cat("Dimensions : \n")
print(dim(ddply_Overall))
#ggplot(ddply_Overall,aes(x= avg_popularity, y = avg_accessibility, colour = language)) + geom_point() +
# geom_text(aes(label=language),hjust=0, vjust=0)
x=1
Here we attempt to quantify a language by its accesibility as well as its effectiveness. As mentioned before we use forks as a measure of accessibility and stars as a measure of quality/effectiveness. For each language we assigned an avg. accesibility and avg. effectiveness score; these were computed by analyzing the repositories which were at least partly written in that language, using the proportion of that repo written in that language as a weight, and finding the product of the weight with the number of forks/stars for accesibility and effectiveness respectively. We then average these across repositories.
In [ ]:
%%R
ggplot(ddply_Overall,aes(x= log(avg_popularity), y = log(avg_accessibility), colour = language)) + geom_point() +
geom_text(aes(label=language),hjust=0, vjust=0)
We have defined Domain to contain two languages namely Java and Ruby
In [ ]:
%%R
###############utility functions############################
update_hash_value <- function(h, key, value){
if(!has.key(key,h))
hash:::.set(h,keys=key,values=value)
else
{
temp <- values(h, keys=key)
hash:::.set(h,keys=key,values=(value+temp))
}
}
###########################################################
mongo <- mongoDbConnect("dataset", host="localhost")
contribution_df <- dbGetQuery(mongo, "contributionratios", '{$or:[{"contribution_ratio": {$elemMatch:{"language":"Java"}}},{"contribution_ratio": {$elemMatch:{"language":"Ruby"}}}]}', skip=0, limit=100000)
domain <- c("Java","ruby")
#define a threshold for each languageto pick only those contribution ratios above this threshold
threshold <- c(3.622832e-012, 3.622832e-012)
contribution_ratios <- contribution_df[,c("author","contribution_ratio"), drop=FALSE]
user_contribution_list <- c()
for(i in 1:nrow(contribution_ratios))
{
Contribution_jsonObject <- fromJSON( contribution_ratios[2][[1]][i])
for(j in 1:length(Contribution_jsonObject))
{
if("language" %in% names(Contribution_jsonObject[[j]]))
{
curr_language <- Contribution_jsonObject[[j]]$language
if(tolower(curr_language) %in% tolower(domain))
{
Language_index <- match(tolower(curr_language),tolower(domain))
threshold_language <- threshold[Language_index]
if(Contribution_jsonObject[[j]]$ratio > threshold_language)
{
user_contribution_list <- c(user_contribution_list,contribution_ratios[1][[1]][i] )
}
}
}
}
}
author_ratios <- data.frame(stringsAsFactors = FALSE)
h_stars_ruby <- hash(keys=user_contribution_unique_list, values= rep(0, length(user_contribution_unique_list)))
h_forks_ruby <- hash(keys=user_contribution_unique_list, values= rep(0, length(user_contribution_unique_list)))
h_stars_java <- hash(keys=user_contribution_unique_list, values= rep(0, length(user_contribution_unique_list)))
h_forks_java <- hash(keys=user_contribution_unique_list, values= rep(0, length(user_contribution_unique_list)))
for(authorId in user_contribution_unique_list)
{
cursor <- mongo.find(mongo_rmongodb, ns = "dataset.usercommitactvities", query=list(author=mongo.oid.from.string(authorId), language=list('$in'=c('Java','Ruby'))), fields=list(repository=1L,language=1L) )
user_contributions_ratio <- mongo.find(mongo_rmongodb, ns = "dataset.contributionratios", query=list(author=mongo.oid.from.string(authorId)),fields=list(contribution_ratio=1L))
ratio_lang <- 0
ratio_ruby <- 0
ratio_java <- 0
if (mongo.cursor.next(user_contributions_ratio))
{
ratio <- mongo.cursor.value(user_contributions_ratio)
##ratio <- mongo.bson.value( ratio, "contribution_ratio")
ratio_ruby <- find_lang_contrib_ratio(ratio, "Ruby")
ratio_java <- find_lang_contrib_ratio(ratio, "Java")
}
while (mongo.cursor.next(cursor))
{
item <- mongo.cursor.value(cursor)
oid <- mongo.bson.value(item, "repository")
lang <- mongo.bson.value(item, "language")
## get stars and forks from the user commit activities
stars <- find_stars_lang(ddply_sum_lineCount, mongo.oid.to.string(oid) , lang)
forks <- find_forks_lang(ddply_sum_lineCount, mongo.oid.to.string(oid) , lang)
if(lang == "Ruby")
{
stars <- stars * ratio_ruby
forks <- forks * ratio_ruby
update_hash_value(h_forks_ruby, authorId, forks)
update_hash_value(h_stars_ruby, authorId, stars)
}
else if(lang == "Java")
{
stars <- stars * ratio_java
forks <- forks * ratio_java
update_hash_value(h_forks_java, authorId, forks)
update_hash_value(h_stars_java, authorId, stars)
}
#cat(authorId, "-",mongo.oid.to.string(oid),"-",stars,"-",forks,"-",lang,"\n")
cat(".")
}
mongo.cursor.destroy(cursor)
}
forks_lang_ruby <- c()
for(vals in values(h_forks_ruby))
forks_lang_ruby <- c(forks_lang_ruby, vals)
stars_lang_ruby <- c()
for(vals in values(h_stars_ruby))
stars_lang_ruby <- c(stars_lang_ruby, vals)
forks_lang_java <- c()
for(vals in values(h_forks_java))
forks_lang_java <- c(forks_lang_java, vals)
stars_lang_java <- c()
for(vals in values(h_stars_java))
stars_lang_java <- c(stars_lang_java, vals)
par(new=TRUE)
plot(stars_lang_ruby,forks_lang_ruby,ylim =c(0,25),xlim=c(0,100),xlab="stars",ylab="forks", col="red")
par(new=TRUE)
plot(stars_lang_java,forks_lang_java,ylim =c(0,25),xlim=c(0,100),xlab="stars",ylab="forks", col="blue")
par(new=TRUE)
l <- legend( "topright", inset = c(0,0.4)
, cex = 1.5
, bty = "n"
, legend = c("Ruby", "Java")
, text.col = c("red", "blue")
, pt.bg = c("red","blue")
, pch = c(21,22)
)
title("Measure of forks vs stars for both Java and Ruby")
plot(stars_lang_ruby,stars_lang_java,ylim =c(0,25),xlim=c(0,100),xlab="Stars for Ruby",ylab="Stars for Java", col="red")
title("Measure of Stars for Ruby vs Stars for Java ")
|
|
Now we took a developer centric approach, trying to determine the quality and accesibility of code written by developers in a certain domain. To link the stars/forks of the repositories to the contributions of individual developers we used the contribution ratio metric. We used the ratio of their contribution in that language to the overall amount of code written, took the product with the proportion of code written in that language ina repository, and multiplying by stars/forks. For a given developer and language we sum these calculations across repositories he's made a contribution to in that language.
In [ ]:
print("User contributed to either JS or CSS")
print(length(unique(user_contribution_list)))
print("User contributed to both")
print(length((user_contribution_list)) - length(unique(user_contribution_list)))
print("537ff216280ef15170b59ba5" %in% user_contribution_list)
user_contribution_unique_list <- unique(user_contribution_list)
Either JavaScript or CSS : 9553 user
Both Javascript and CSS : 1872 users
similarly for Java and Ruby
Either Java or Ruby : 11313 user
Both Java and Ruby : 163 users
In [ ]:
%%R
###### forks vs stars for all languages
find_avg_userweight_for_repr <- function() {
cursor <- mongo.find(mongo_rmongodb, ns = "dataset.useravgweights", fields=list(author=1L,weight=1L))
h <- hash(keys=1,values=1)
while (mongo.cursor.next(cursor))
{
item <- mongo.cursor.value(cursor)
author <- mongo.bson.value(item, "author")
weight <- mongo.bson.value(item, "weight")
#if(!has.key(author, h))
#{
hash:::.set(h, keys=author,values=weight)
#}
}
cursor <- mongo.find(mongo_rmongodb, ns = "dataset.repositories", fields=list("_id"=1L,"contributors.user"=1L,stars=1L,forks=1L))
stars_arr <- c()
forks_arr <- c()
sum_arr <- c()
id_arr <- c()
while (mongo.cursor.next(cursor))
{
item <- mongo.cursor.value(cursor)
repo <- mongo.bson.value(item, "_id")
contrib <- mongo.bson.value(item, "contributors")
stars <- mongo.bson.value(item, "stars")
forks <- mongo.bson.value(item, "forks")
stars_arr <- c(stars_arr, stars)
forks_arr <- c(forks_arr, forks)
sum <- 0
count <- 0
for(i in 1:length(contrib))
{
sum = sum + values(h, keys=contrib[i][[1]]$user)[[1]]
count = count + 1
}
sum_arr <- c(sum_arr, sum/count)
id_arr <- c(id_arr, mongo.oid.to.string(repo))
}
df <- data.frame(id_arr, stars_arr, forks_arr, sum_arr)
df
}
df <- find_avg_userweight_for_repr()
fit <- lm(df[,3]~df[,2], data=df)
ggplot(df, aes(x=df[,3], y=df[,2])) + geom_point() + stat_smooth(method = "lm", formula = as.formula(fit), size = 1, se = FALSE, colour = "black")
Time series analysis of contributions to various languages reveals interesting trends in the way languages evelove over time. The data required for this analysis is partly retrieved using map-reduce framework in monogodb and partly using code written in R. The map reduce code performs the following steps:
Some interesting trends:
While trying to identify trends in the commit activity for languages over time we found an interesting trend for coffeescript
. The commit activity for this scripting language was initially low but there was a sudden peak in the commit activity in 2013, the year when coffeescript
was ranked 29th among languages.
The plots below show the trends in commit activity over time for all languages
, major programming languages
and coffeescript
We plot the number of commits for each language over the time.
In [ ]:
%%R
###
#
# Analysis of language evolution over time
#
###
temporal_lang_performance <- function()
{
cursor <- mongo.find(mongo_rmongodb, ns = "dataset.temporalcommits_new", sort=list("value.temporal.$.date"=1L) ,fields=list(value=1L))
h <- hash(keys=1,values=1)
while (mongo.cursor.next(cursor))
{
item <- mongo.cursor.value(cursor)
val <- mongo.bson.value(item, "value")
lang <- val$lang
if(!is.null(lang))
{
temporal <- val$temporal
changes_arr <- c()
date_arr <- c()
str_arr <- c()
if(length(temporal) == 0)
next
for(i in 1:length(temporal))
{
dt <- strptime(temporal[[i]]$date,format="%a, %d %b %Y %H:%M:%S GMT")
changes <- temporal[[i]]$changes
date_arr <- append(date_arr , as.Date(dt)) #format(as.Date(dt), "%b,%Y")
changes_arr <- append(changes_arr, changes)
str_arr <- append(str_arr, format(as.Date(dt), "%b,%Y"))
}
print(class(date_arr))
df <- data.frame(date_arr , changes_arr, str_arr)
hash:::.set(h,keys=lang, values=df)
}
}
h
}
ret <- temporal_lang_performance()
p <- ggplot()
mat_global <- data.frame()
lang <- keys(ret)
for(key in 1:length(lang)) {
if(!(key %in% c(19)))
next
if(length(values(ret, keys=lang[key])) < 3)
next
x <- values(ret, keys=lang[key])[[1]]
y <- values(ret, keys=lang[key])[[2]]
z <- values(ret, keys=lang[key])[[3]]
w <- rep(lang[key], length(x))
mat <- data.frame(x,y,z,w)
mat <- mat[order(mat[,1]), ]
mat <- data.frame(mat)
mat_global <- rbind(mat_global, mat)
}
mat_global <- mat_global[order(mat_global[,1]), ]
library(grid)
ggplot(mat_global, aes(x=x, y=y,
group=interaction(w),
colour=w)) +
geom_line() + scale_y_continuous(limits=c(0, 10000)) + theme(legend.key.size = unit(0.4, "cm"))
### End temporal analysis
We plot the number of commits for majorly used languages over the time.
We have plotted the temporal data for coffeescript. We see peek a peek in 2013. This is because as of 2013 Coffescript was the 12th most popular language on GitHub in terms of project-count [5]. In 2013 it was also ranked 29th among languages, based on number of questions tagged at Stack Overflow.
We performed Linear Regression on the data to find the relationship between Stars and Forks. The results weren't good enough as we couldn't find correlation between the two entities. Results are as below:
In [ ]:
%%R
find_avg_userweight_for_repr <- function() {
cursor <- mongo.find(mongo_rmongodb, ns = "dataset.useravgweights", fields=list(author=1L,weight=1L))
h <- hash(keys=1,values=1)
while (mongo.cursor.next(cursor))
{
item <- mongo.cursor.value(cursor)
author <- mongo.bson.value(item, "author")
weight <- mongo.bson.value(item, "weight")
#if(!has.key(author, h))
#{
hash:::.set(h, keys=author,values=weight)
#}
}
cursor <- mongo.find(mongo_rmongodb, ns = "dataset.repositories", fields=list("_id"=1L,"contributors.user"=1L,stars=1L,forks=1L))
stars_arr <- c()
forks_arr <- c()
sum_arr <- c()
id_arr <- c()
while (mongo.cursor.next(cursor))
{
item <- mongo.cursor.value(cursor)
repo <- mongo.bson.value(item, "_id")
contrib <- mongo.bson.value(item, "contributors")
stars <- mongo.bson.value(item, "stars")
forks <- mongo.bson.value(item, "forks")
stars_arr <- c(stars_arr, stars)
forks_arr <- c(forks_arr, forks)
sum <- 0
count <- 0
for(i in 1:length(contrib))
{
sum = sum + values(h, keys=contrib[i][[1]]$user)[[1]]
count = count + 1
}
sum_arr <- c(sum_arr, sum/count)
id_arr <- c(id_arr, mongo.oid.to.string(repo))
}
df <- data.frame(id_arr, stars_arr, forks_arr, sum_arr)
df
}
df <- find_avg_userweight_for_repr()
fit <- lm(df[,3]~df[,2], data=df)
In [ ]:
%%R
library("RMongo")
library("rmongodb")
library("plyr")
library("ggplot2")
get_all_lang <- function()
{
mongo <- mongoDbConnect("dataset", host="localhost")
output <- dbAggregate(mongo, "commits", c('{ "$unwind" : "$changes"}',
' { "$project" : { "lang" : "$changes.language" } } ',
' { "$group" : { "_id" : "$lang" } } '))
Languages_All <- c()
for(i in 1:length(output))
{
bson <- mongo.bson.from.JSON(output[i])
lang <- mongo.bson.value(bson,"_id")
if(!is.null(lang) && nchar(lang) > 0)
Languages_All <- c(Languages_All, lang)
}
unique_languages <- unique(Languages_All)
unique_languages
}
#unique_languages <- get_all_lang()
#print(length(unique_languages))
mongo_rmongodb <- mongo.create(host = "localhost")
user_commit_activity <- data.frame(stringsAsFactors = FALSE)
user_commit_act <- mongo.find(mongo_rmongodb, ns="dataset.usercommitactvities", fields=list(language=1L,repository=1L,author=1L,changes=1L))
while (mongo.cursor.next(user_commit_act))
{
item <- mongo.cursor.value(user_commit_act)
repository <- mongo.oid.to.string(mongo.bson.value(item, "repository"))
author <- mongo.oid.to.string(mongo.bson.value(item, "author"))
language <- mongo.bson.value(item, "language")
changes <- mongo.bson.value(item, "changes")
if (class(repository) != "NULL" && class(author) != "NULL" && class(language) != "NULL" && class(changes) != "NULL") {
user_commit_activity <- rbind(user_commit_activity, data.frame(author=author,repository=repository,language=language,changes=changes))
}
}
ddply_lang_project = ddply(user_commit_activity,
c("language", "repository"),
summarize,
temp = length(author)
)
print(head(ddply_lang_project))
ddply_projects_per_lang = ddply( ddply_lang_project,
c("language"),
summarize,
repo_count = length(repository)
)
print(head(ddply_projects_per_lang))
########
ddply_changes_per_user = ddply( user_commit_activity,
c("language", "author"),
summarize,
total_changes = sum(changes)
)
print(head(ddply_changes_per_user))
ddply_changes_per_user_lang = ddply( ddply_changes_per_user,
c("language"),
summarize,
avg_commits = mean(total_changes)
)
print(head(ddply_changes_per_user_lang))
########
ddply_dev_lang = ddply( user_commit_activity,
c("language", "author"),
summarize,
temp=length(repository)
)
print(head(ddply_dev_lang))
ddply_dev_per_lang = ddply( ddply_dev_lang,
c("language"),
summarize,
dev_count = length(author)
)
print(head(ddply_dev_per_lang))
#######
plot(x=ddply_projects_per_lang$repo_count,y=ddply_projects_per_lang$language,type="l")
plot(ddply_changes_per_user_lang)
plot(ddply_dev_per_lang)
par(las=2)
par(mar=c(5,8,4,2))
ddply_projects_per_lang_matrix <- as.matrix(ddply_projects_per_lang[, "repo_count"])
row.names(ddply_projects_per_lang_matrix) <- ddply_projects_per_lang[, "language"]
barplot(t(ddply_projects_per_lang_matrix),horiz=TRUE,cex.names=0.8,xlim=c(0,700),xlab="Number of Repositories", main="Number of Repositories for each Language")
grid(nx=7,ny=1,lty=1)
###############
par(las=2)
par(mar=c(5,8,4,2))
ddply_changes_per_user_lang_matrix <- as.matrix(ddply_changes_per_user_lang[, "avg_commits"])
row.names(ddply_changes_per_user_lang_matrix) <- ddply_changes_per_user_lang[, "language"]
barplot(t(ddply_changes_per_user_lang_matrix),horiz=TRUE,xlim=c(0,150000),cex.names=0.8,xlab="Average Line Changes", main="Average Line Changes for each Language")
grid(nx=5,ny=1,lty=1)
##########
par(las=2)
par(mar=c(5,8,4,2))
ddply_dev_per_lang_matrix <- as.matrix(ddply_dev_per_lang[, "dev_count"])
row.names(ddply_dev_per_lang_matrix) <- ddply_dev_per_lang[, "language"]
barplot(t(ddply_dev_per_lang_matrix),horiz=TRUE,xlim=c(0,5000),cex.names=0.8,xlab="Number of Developers", main="Number of Developers for each Language")
grid(nx=5,ny=1,lty=1)
In [ ]:
%%R
get_all_lang <- function()
{
mongo <- mongoDbConnect("dataset", host="localhost")
output <- dbAggregate(mongo, "commits", c('{ "$unwind" : "$changes"}',
' { "$project" : { "lang" : "$changes.language" } } ',
' { "$group" : { "_id" : "$lang" } } '))
Languages_All <- c()
for(i in 1:length(output))
{
bson <- mongo.bson.from.JSON(output[i])
lang <- mongo.bson.value(bson,"_id")
if(!is.null(lang) && nchar(lang) > 0)
Languages_All <- c(Languages_All, lang)
}
unique_languages <- unique(Languages_All)
unique_languages
}
#unique_languages <- get_all_lang()
#print(length(unique_languages))
mongo_rmongodb <- mongo.create(host = "localhost")
user_commit_activity <- data.frame(stringsAsFactors = FALSE)
user_commit_act <- mongo.find(mongo_rmongodb, ns="dataset.usercommitactvities", fields=list(language=1L,repository=1L,author=1L,changes=1L))
while (mongo.cursor.next(user_commit_act))
{
item <- mongo.cursor.value(user_commit_act)
repository <- mongo.oid.to.string(mongo.bson.value(item, "repository"))
author <- mongo.oid.to.string(mongo.bson.value(item, "author"))
language <- mongo.bson.value(item, "language")
changes <- mongo.bson.value(item, "changes")
if (class(repository) != "NULL" && class(author) != "NULL" && class(language) != "NULL" && class(changes) != "NULL") {
user_commit_activity <- rbind(user_commit_activity, data.frame(author=author,repository=repository,language=language,changes=changes))
}
}
In [ ]:
%%R
getUserName <- function(userId)
{
username <- ""
users <- mongo.find(mongo_rmongodb, ns = "dataset.users", query=list("_id"=mongo.oid.from.string(userId)),fields=list(username=1L))
while (mongo.cursor.next(users))
{
item <- mongo.cursor.value(users)
username <- mongo.bson.value(item, "username")
}
username
}
getRepoName <- function(repoId)
{
reponame <- ""
repos <- mongo.find(mongo_rmongodb, ns = "dataset.repositories", query=list("_id"=mongo.oid.from.string(repoId)),fields=list(fullName=1L))
while (mongo.cursor.next(repos))
{
item <- mongo.cursor.value(repos)
reponame <- mongo.bson.value(item, "fullName")
}
reponame
}
#################
ddply_auth_project <- data.frame(stringsAsFactors = FALSE)
ddply_auth_project = ddply(user_commit_activity,
c("author", "repository"),
summarize,
temp = length(language)
)
print(head(ddply_auth_project))
ddply_projects_per_author = ddply( ddply_auth_project,
c("author"),
summarize,
repo_count = length(repository)
)
print(head(ddply_projects_per_author))
par(las=2)
par(mar=c(5,8,4,2))
ddply_projects_per_author_matrix <- as.matrix(ddply_projects_per_author[, "repo_count"])
#row.names(ddply_projects_per_author_matrix) <- getUserName(ddply_projects_per_author[, "author"])
barplot(t(ddply_projects_per_author_matrix),horiz=TRUE,cex.names=1.0,xlim=c(0,25),xlab="Number of Repositories", main="Number of Repositories contributed by each user")
grid(nx=7,ny=1,lty=1)
####
ddply_auth_project = ddply(user_commit_activity,
c("author", "repository"),
summarize,
temp = length(language)
)
ddply_auth_project$ <- factor(ddply_auth_project$x)
print(head(ddply_auth_project))
ddply_projects_per_repos = ddply( ddply_auth_project,
c("repository"),
summarize,
author_count = length(author)
)
print(head(ddply_projects_per_repos))
par(las=2)
par(mar=c(5,8,4,2))
ddply_projects_per_repos_matrix <- as.matrix(ddply_projects_per_repos[, "author_count"])
#row.names(ddply_projects_per_author_matrix) <- getUserName(ddply_projects_per_author[, "author"])
barplot(t(ddply_projects_per_repos_matrix),horiz=TRUE,cex.names=1.0,xlim=c(0,500),xlab="Number of Contributors", main="Number of Contributors for each Repository")
grid(nx=7,ny=1,lty=1)
We tried to cluster users based on their contribution to Languages. The figure shows the various clusters with cluster ids.
In [ ]:
%%R
user_contribution_cursor <- mongo.find(mongo_rmongodb, ns = "dataset.contributionratios" )
user_lang_contribution_df <- matrix(0,nrow=1,ncol=length(unique_languages))
author_id_vector <-c()
while (mongo.cursor.next(user_contribution_cursor))
{
item <- mongo.cursor.value(user_contribution_cursor)
authorId <- mongo.bson.value(item, "author")
author_id_vector <- c(author_id_vector, mongo.oid.to.string(authorId))
contributions_list <- mongo.bson.value(item, "contribution_ratio")
#print(class(contributions_list))
language_zeros <- matrix(0,nrow=1,ncol=length(unique_languages))
#print(length(language_zeros))
for(contribution in contributions_list)
{
#print((contribution$language))
#print ("------")
#print (contribution)
if(class(contribution$language) != "NULL")
{
# print(match(tolower(contribution$language),tolower(unique_languages)))
#print (contribution$language)
#print (contribution$ratio)
if (class(contribution$ratio) != "NULL") {
language_zeros[match(tolower(contribution$language),tolower(unique_languages))] <- contribution$ratio
}
}
}
#print(language_zeros)
user_lang_contribution_df <- rbind(user_lang_contribution_df,language_zeros)
#user_lang_contribution_df <- rbind(user_lang_contribution_df, data.frame(author=mongo.oid.to.string(authorId),languages=language_zeros))
}
user_lang_contribution_df <- user_lang_contribution_df[-1,]
user_lang_contribution_dataframe <- data.frame(user_lang_contribution_df)
row.names(user_lang_contribution_dataframe) <- author_id_vector
print(rownames(user_lang_contribution_dataframe))
print(dim(user_lang_contribution_dataframe))
cl <- kmeans(user_lang_contribution_dataframe, 5)
plotcluster(user_lang_contribution_df, cl$cluster, pointsbyclvecd = TRUE)
cluster_vector <- cl$cluster
y <- which(cluster_vector==3, arr.in=TRUE)
print(length(y))
print(cl$size)
Analysis: IPython notebook construction effort
Data analysis Code: Balan, Ganapathy
Data analysis methods: Badhrinathan, Balan, Thakur, Kumar
Documentation:
Presentation Effort: Thakur, Balan
project summary effort: Badhrinathan, Balan, Thakur, Kumar
Issues with data collection:
The dataset was mainly obtained from GitHub APIs. A major limitation for these APIs is that they are rate limited as a result we can only make a limited number for http requests to the server. Since our data collection involves multiple api requests we had to come up with way around this problem. We developed a new library called job queue which basically works on the concept of producer comsumer model. It must be noted that the consumers are rate limited which ensures proper functioning of data collection. Moreover we had to ensure that all the consumers are suthenticated with the GITHUB server.
The data that we get from the APIS was not in the format suitable for the analysis that we intended to perform. So we had to introduce a new preprocessing stage where we perform the following operations to transform the data : 1) MongoDB Map-Reduce 2) MongoDB Aggregation 3) R ddply The information extracted from these analysis are mentioned in the data schema section.
Since the scale of data requred was high and the data collection and data preprocessing was complicated we faced issues with high CPU utilization while running our code. Moreover there were issues with the machine running the code going out of memory. We also faced issues with storage space limitations on SeasNet Servers. As a result we had to parallelize our code and run it across multiple machines. And each of these instances took a lot of time to run.
Suitable libraries for mongoDB in R:
Even though libraries exist for extracting information from MongoDB we faced issues with complicated functionality and queries where we had to find documents by `ObjectId`. Also the `RMongo` library had some limitations for extracting required fields so we had t parse the string obtained. Also the documentation was not sufficient. Figuring out to how to use the library in R took a lot of time.
We initially attempted to run data extraction code in CoffeScript from iPython notebook but since it doesn't support these languages we had to run this code separately
We faced challenges with identifying programming language given a fileName. We used a `linguist` library support for this.
Logistics:
1) Based on the amount of time and efforts spent on the initial data collection and preprocessing, we have realized that data extraction is as important and challenging as data analysis. We intend to develop a framework for data analysis that can ease the pain of data collection and standardize practices of data collection
2) Solving the issue of rate limit helped us analyze various ways of atacking a particular problem like synchronization, parallel programming, scale etc.
3) Data analysis on smaller data set before going ahead with bigger data helped us save time during analysis since we could detect earlier whether particular analysis made sense and whether we could use it or not.
4) We had to do a lot reading, brain storming and consider all possible features and analyze on the ones that made the most sense. Unlike a nomal case where you are given features we had to come up with features and evaluate them. This was a good learning experience.
6) Experimenting with libraries was another good learning experience. We tried a lot of analysis libraries as well as data extraction libraries this gave us a good idea of what all exists out there apart from what we already know.
Trends in dataset:
1) Volume of commits for programming languages have increased across all languages over time.
2) The fluctuations in the commit activity for a particular language characterizes the evolution of that language like in case of coffeescript
3) It seems from the data that time plays an important role in the number of stars and forks gained by a particular repository. For example we found repositories like 'eggert/tz' which is a very important repository it has the Time Zone Database (often called tz or zoneinfo) which contains code and data that represent the history of local time for many representative locations around the globe. This repository however has very few stars and forks this in a way could be attributed to year when the repository was created. It seems that older repositories tends to gather less attention and accessibility as compared to recent repositories that might not be as critical.
4) There are very few languages which have low accessibility and high popularity and viceversa indicating direct proportionality between forks and stars
5) Average activity and number of committers for a popular langauage is higher.
[1] GitHub API reference, https://developer.github.com/v3/
[2] The impact of language choice on github projects, http://corte.si/posts/code/devsurvey/
[3] GitHub data analysis, http://www.r-bloggers.com/github-data-analysis/
[4] Mining the Social Web, Chapter 7 - Mining GitHub
[5] Coffescript Data Analysis (2013) - http://en.wikipedia.org/wiki/CoffeeScript
[6] Job Queue - https://github.com/GotEmB/job-queue
Github Repository - https://github.com/GotEmB/potential-archer