Course Project Notebook

Team Members

Project Goal and Motivation

GitHub is a rich source of skillful developers for your next startup, it is also a good place to figure out which language you would want to use for your next startup.

But the questions to ask are:


  • How do we identify potential useful developers for a particular skillset?

  • Languages
  • How effective is a language?
  • How accesible is a language?
  • How impactful is a language?

    The goal of the project was to carry out data analysis on a subset of git repositories; Our goal was to analyze the trend in activities and the impact of those activities on a particular repository. The analysis is performed for each developer per programming language basis (the language the developer is actively working on).

  • Language perspective Analysis:

  • Forks :
    We took forks as measure of accessibility; Users fork repositories they feel they can actually contribute to. Consequently we felt that languages that form a high proportion of these highly forked repositories were more accessible, as they correspond to "easier" projects" and learning them, allows devlopers to make an immediate contribution.

    Stars :
    Similarly we judged languages that were used in highly starred repositories to be more effective. We assumed that more stars meant the project quality was higher, and therefore certain lnguages are mediums to create better quality projects.

    For developers we sought metrics to determine their experience and quality. As a proxy for experience in a particular language we came up with the metric "contribution ratio"; this was the number of lines contributed by a developer over all lines contributed in that language.

  • Developer Analysis:
  • For a developer's skill in a language we looked at his commits individually, summing the stars of the repository contributed to by the developer weighted by the proportion of the repository written in that language. The idea was that contributions to a highly starred repository would have to be up to some standard of quality, or else they'd be rolled back. Using these metrics we try to address the questions posed above for developers.

    This metric indicates the amount of contribution of a developer towards a particular language.

    We also analyze the relationship between repository popularity and it's attributes. This analysis provides an indication of which factors contribute to high number of followers, forks, pull requests and watchers for a repository.

    Github DataSet

    We have analyzed the data from GitHub. The data was collected using the GitHub APIs [1]. The data is stored in the following format:

    • Developer contribution information : For each developer we store the repositories he has contributed to. In each of these repositories we store the activity information like the number of lines of code modified by the user, the timestamp of the change, corresponding to which programming language was the change made. This data helped us identify the contribution and impact of a developer to a particular programming language.

    • Repository information : For each repository we store attributes like the programming languages involved, number of contributors, stars, number of forks. This information helped us to infer popular and easy languages. This information can enable an individual to decide on which programming language to use for a particular project.

    Project Stages

    The project is divided into three main stages:

    • Data Collection Stage: This stage involves data retrieval from GitHub APIs. The code for this stage is written CoffeeScript.
    • Data Pre-processing Stage: This stage involves creation of additional collections in mongodb from the collections obtained in previous stage. The collections creates in this stage are used for data analysis stage. The code in this stage is also written in coffeescript.
    • Data Analysis Stage: This stage involves analyzing the informaion collected to find trends in the information especially related to programming languages and developers. Most of the analysis code is written in R

    Github Dataset Schema

    The data collected contains three main parts:


    Users correspond to developers who have committed to the repositories that exist in our data set Schema { username: type: String, index: true, languages: [ language: String, temporal: [ period: String, changes: Number, ] ], starsEarned: Number, followers: Number } Count: 43750 users


    Repositories (projects) for which we have information in our dataset Schema { fullName: type: String, index: true, stars: Number, forks: Number, contributors: [ user: type: mongoose.Schema.ObjectId, ref: "User", weight: Number ] languages: [ language: String, lineCount: Number ], (*) done: Boolean, (*) instanceId: String, (*) serverError: Boolean } Count: 805 repositories


    Commits made by developers to the repositories that exist in our dataset Schema { sha: type: String, index: true, author: type: mongoose.Schema.ObjectId, ref: "User", repository: type: mongoose.Schema.ObjectId, ref: "Repository", changes: [ language: String, count: Number ], timestamp: Date } Count: 1385028 commits

    Derived Collections

    Apart from the entities mentioned above we require additional information that is derived from the above mentioned collections. These are used in the analysis stage.

    User Average Weight

    Average contribution of a user across all the repositories that he/she contributes to Schema: { author: type: mongoose.Schema.ObjectId, ref: "User", weight: Number }

    User Commit Activity

    Total contribution of a user to a particular repository for a particular language Schema: { author: type: mongoose.Schema.ObjectId, ref: "User", repository: type: mongoose.Schema.ObjectId, ref: "Repository", language: String, changes: Number }

    Repository Commit Activity

    Total changes made to a repository for each language Schema: { repository: type: mongoose.Schema.ObjectId, ref: "Repository", language: String, changes: Number }

    Contribution Ratio

    For each user contribution score (contribution ratio) for each language that the user has contributed to ContributionRatio { author: type: mongoose.Schema.ObjectId, ref: "User", contribution_ratio: [ language: String, ratio: Number ] }

    Function Stack

    Libraries Used

    We have used some packages for reading and playing with the data.

    RMongo and rmongodb are packages used to read data from the mongodb.

    rmongodb -

    Rmongo -

    We have also used the most powerful package for data analysis "plyr" in R to perform split, apply and combine functions on the data frames created.

    We found that plyr very useful and powerful for performing those operations on our huge dataset.

    ##################### Utility Functions ###########################
    find_stars_lang <- function(df, repo, lang)
      lang <- tolower(lang)
      ret <- 0
      for(i in 1:nrow(df))
        if(toString(df[i,1]) == repo && lang == tolower(toString(df[i,5])))
          ret <- df[i,8]
    #find_forks_lang(ddply_sum_lineCount, "537fc5b7280ef15170b56d3b", "js")
    find_forks_lang <- function(df, repo, lang)
      ret <- 0
      for(i in 1:nrow(df))
        if(toString(df[i,1]) == repo && lang == toString(df[i,5]))
          ret <- df[i,9]
    find_lang_contrib_ratio <- function(bs, lang)
      lang = tolower(lang)
      c_ratios <- mongo.bson.value(bs,"contribution_ratio")
      ratio <- 0
      for(i in 1:length(c_ratios))
        if(!is.null(c_ratios[i][[1]]$language) && tolower(c_ratios[i][[1]]$language) == lang)
           ratio <- c_ratios[i][[1]]$ratio

    Languages Analysis

    The analysis for a particular language is based on two main aspects:

    • Average popularity : it signifies how popular a programming language is. This is mainly calculated by performing the following steps:
      • For each language find all repositories that have code in that language
      • For each of these repositories find what percent of the code is written in that language
      • Multiply the stars earned by that repository by the ratio found above
      • Sum all the values calculated in above step
    • Average accessibility: it signifies how accessible or how widely accepted in terms of coding is a programming language. This is mainly calculated by performing the following steps:
      Follow the same steps as above except that consider the forks instead of stars earned by the repository.

    Plotting languages in a 2-dimensional space where one dimension is avg. popularity and other is avg. accessibiliy gives an indication of how effective a particular language is.

    Number of Unique Languages : 145

    %%R -w 700 -h 500
    ##find all unique languages
    get_all_lang <- function()
      mongo <- mongoDbConnect("dataset", host="localhost")
      output <- dbAggregate(mongo, "commits", c('{ "$unwind" : "$changes"}',
                                                ' { "$project" : { "lang" : "$changes.language" } } ',
                                                ' { "$group" : { "_id" : "$lang" } } '))
      Languages_All <- c()
      for(i in 1:length(output))
        bson <- mongo.bson.from.JSON(output[i])
        lang <- mongo.bson.value(bson,"_id")
        if(!is.null(lang) && nchar(lang) > 0)
          Languages_All <- c(Languages_All, lang)
      unique_languages <- unique(Languages_All)
    # contains all the languages
    unique_languages <- get_all_lang()
      [1] "Boo"                   "Brainfuck"             "Slash"                
      [4] "Processing"            "MediaWiki"             "FLUX"                 
      [7] "CLIPS"                 "Gentoo Ebuild"         "Ada"                  
     [10] "Volt"                  "mupad"                 "Shell"                
     [13] "Io"                    "Vala"                  "Common Lisp"          
     [16] "FORTRAN"               "GLSL"                  "ActionScript"         
     [19] "JSONiq"                "XS"                    "QML"                  
     [22] "Frege"                 "Creole"                "ColdFusion"           
     [25] "Pascal"                "Rust"                  "Julia"                
     [28] "PowerShell"            "CSS"                   "INI"                  
     [31] "AutoHotkey"            "Prolog"                "Protocol Buffer"      
     [34] "Kotlin"                "Diff"                  "Cython"               
     [37] "Haxe"                  "MoonScript"            "Haskell"              
     [40] "UnrealScript"          "VHDL"                  "XQuery"               
     [43] "Lua"                   "Gettext Catalog"       "C++"                  
     [46] "Turing"                "Matlab"                "AppleScript"          
     [49] "AspectJ"               "Harbour"               "LLVM"                 
     [52] "OpenCL"                "OCaml"                 "Tcl"                  
     [55] "Dart"                  "VimL"                  "Smarty"               
     [58] "Puppet"                "RHTML"                 "Groovy"               
     [61] "CoffeeScript"          "Stylus"                "Elixir"               
     [64] "Java Server Pages"     "Perl"                  "Ragel in Ruby Host"   
     [67] "Markdown"              "Fancy"                 "Awk"                  
     [70] "ASP"                   "XSLT"                  "TOML"                 
     [73] "ABAP"                  "Go"                    "F#"                   
     [76] "YAML"                  "D"                     "HTML"                 
     [79] "Org"                   "HTTP"                  "GAS"                  
     [82] "RDoc"                  "Sass"                  "R"                    
     [85] "HTML+ERB"              "Erlang"                "TypeScript"           
     [88] "CMake"                 "Standard ML"           "Objective-C++"        
     [91] "Textile"               "reStructuredText"      "SQL"                  
     [94] "Rebol"                 "Pod"                   "Objective-C"          
     [97] "Cucumber"              "fish"                  "C"                    
    [100] "Objective-J"           "Inno Setup"            "Tcsh"                 
    [103] "Literate CoffeeScript" "Scheme"                "Emacs Lisp"           
    [106] "Mask"                  "TeX"                   "LiveScript"           
    [109] "Stata"                 "Handlebars"            "C#"                   
    [112] "Raw token data"        "AsciiDoc"              "PHP"                  
    [115] "Batchfile"             "DOT"                   "Verilog"              
    [118] "JSON"                  "SCSS"                  "eC"                   
    [121] "Assembly"              "Python"                "Visual Basic"         
    [124] "Scala"                 "Gnuplot"               "PostScript"           
    [127] "IDL"                   "Haml"                  "XML"                  
    [130] "NSIS"                  "Cuda"                  "Twig"                 
    [133] "Less"                  "Makefile"              "Java"                 
    [136] "Arduino"               "HTML+Django"           "Clojure"              
    [139] "GAP"                   "HTML+PHP"              "JavaScript"           
    [142] "Liquid"                "Groff"                 "Jade"                 
    [145] "Ruby"                 
    [1] 145

    How effective is a language?

    We create a data frame containing the repository and the Lanuguage information (stars, forks etc.) for each repository and language pair. Then we use split, apply and combine functions on the data to extract useful information from the data we obtained. We calculate the average popularity and average accessibility of each language and plot the same.

    ########################### How effective is a language ? ##############################
    users = data.frame(stringsAsFactors = FALSE) 
    repos = data.frame(stringsAsFactors = FALSE)
    mongo_rmongodb <- mongo.create(host = "localhost") 
    mongo_rmongo <- mongoDbConnect("dataset", host="localhost")
    DBNS_users = "dataset.users"
    DBNS_repos = "dataset.repositories"
    if ( {
      repos = mongo.find.all(mongo_rmongodb, DBNS_repos, list(done=TRUE)) 
    ## create an empty dataframe and populate the same with repos and language data
    repos_by_languages <- data.frame(stringsAsFactors = FALSE)
    for(i in 1: nrow(repos))
      repos_one <-  repos[i,]
      languages_list <- repos_one$languages
      totalLineCount <- 0
      if(!$stars) && repos_one$stars > 5 && ! && class(repos_one$stars)!= 'mongo.oid')
        for(j in 1:length(languages_list))
          totalLineCount = totalLineCount + languages_list[[j]][['lineCount']]
        for(j in 1:length(languages_list))
          repos_by_languages <- rbind(repos_by_languages, data.frame([i,1][[1]]),name=repos_one$fullName,stars = repos_one$stars, forks = repos_one$forks, language = languages_list[[j]]['language'] , linecount = languages_list[[j]][['lineCount']], totalLineCount = totalLineCount ))
    cat("Table data : \n\n\n")
    cat("Dimensions : \n")
    ##### Use split, apply and combine to find the popularity and accessibility and proportional line count
    ddply_sum_lineCount = ddply( repos_by_languages,
                                 c("id","name", "stars","forks","language","linecount"), 
                                 prop_lineCount = linecount/totalLineCount, 
                                 popularity = stars * prop_lineCount,
                                 accessibility = forks * prop_lineCount
    cat("Table data : \n\n\n")
    cat("Dimensions : \n")
    ##### Use split, apply and combine to find the average popularity and accessibility
    ddply_Overall = ddply( ddply_sum_lineCount,
                           avg_popularity = mean(popularity),
                           avg_accessibility = mean(accessibility)
    cat("Table data : \n\n\n")
    cat("Dimensions : \n")
    #ggplot(ddply_Overall,aes(x= avg_popularity, y = avg_accessibility, colour = language)) + geom_point() + 
    #  geom_text(aes(label=language),hjust=0, vjust=0)
    Table data : 
                            id              name stars forks   language linecount
    1 537fc5b7280ef15170b56d3b bartaz/impress.js 20366  4150        CSS     21141
    2 537fc5b7280ef15170b56d3b bartaz/impress.js 20366  4150 JavaScript     32840
    3 538006aa280ef15170b5a296 Prinzhorn/skrollr  8881  1689 JavaScript    103132
    4 538006aa280ef15170b5a296 Prinzhorn/skrollr  8881  1689        CSS      8434
    5 5380183c280ef15170b5a7c3   nnnick/Chart.js  8025  2638 JavaScript    175411
    6 5380183c280ef15170b5a7c3   nnnick/Chart.js  8025  2638        CSS     14218
    1          53981
    2          53981
    3         111566
    4         111566
    5         189629
    6         189629
    Dimensions : 
    [1] 2553    7
    Table data : 
                            id              name stars forks   language linecount
    1 537fc5b7280ef15170b56d3b bartaz/impress.js 20366  4150        CSS     21141
    2 537fc5b7280ef15170b56d3b bartaz/impress.js 20366  4150 JavaScript     32840
    3 538006aa280ef15170b5a296 Prinzhorn/skrollr  8881  1689 JavaScript    103132
    4 538006aa280ef15170b5a296 Prinzhorn/skrollr  8881  1689        CSS      8434
    5 5380183c280ef15170b5a7c3   nnnick/Chart.js  8025  2638 JavaScript    175411
    6 5380183c280ef15170b5a7c3   nnnick/Chart.js  8025  2638        CSS     14218
      prop_lineCount popularity accessibility
    1     0.39163780  7976.0954     1625.2969
    2     0.60836220 12389.9046     2524.7031
    3     0.92440349  8209.6274     1561.3175
    4     0.07559651   671.3726      127.6825
    5     0.92502202  7423.3017     2440.2081
    6     0.07497798   601.6983      197.7919
    Dimensions : 
    [1] 2553    9
    Table data : 
          language avg_popularity avg_accessibility
    1          CSS      1066.3730         279.25479
    2   JavaScript      2947.1455         555.51786
    3         Ruby      1626.4341         356.97164
    4          SQL       116.6489          96.33535
    5 CoffeeScript       964.0486         143.56952
    6        Shell       319.6098          57.37411
    Dimensions : 
    [1] 86  3

    Average Popularity(Stars) vs Average Accessibility (forks)

    Here we attempt to quantify a language by its accesibility as well as its effectiveness. As mentioned before we use forks as a measure of accessibility and stars as a measure of quality/effectiveness. For each language we assigned an avg. accesibility and avg. effectiveness score; these were computed by analyzing the repositories which were at least partly written in that language, using the proportion of that repo written in that language as a weight, and finding the product of the weight with the number of forks/stars for accesibility and effectiveness respectively. We then average these across repositories.

    Log-Log Effectiveness of Languages

    ggplot(ddply_Overall,aes(x= log(avg_popularity), y = log(avg_accessibility), colour = language)) + geom_point() + 
      geom_text(aes(label=language),hjust=0, vjust=0)

    Domain Specific Analysis of Languages

    We have defined Domain to contain two languages namely Java and Ruby

    ###############utility functions############################
    update_hash_value <- function(h, key, value){
        temp <- values(h, keys=key)
    mongo <- mongoDbConnect("dataset", host="localhost")
    contribution_df <- dbGetQuery(mongo, "contributionratios", '{$or:[{"contribution_ratio": {$elemMatch:{"language":"Java"}}},{"contribution_ratio": {$elemMatch:{"language":"Ruby"}}}]}', skip=0, limit=100000)
    domain <- c("Java","ruby")
    #define a threshold for each languageto pick only those contribution ratios above this threshold 
    threshold <- c(3.622832e-012, 3.622832e-012) 
    contribution_ratios <- contribution_df[,c("author","contribution_ratio"), drop=FALSE]
    user_contribution_list <- c()
    for(i in 1:nrow(contribution_ratios))
      Contribution_jsonObject <- fromJSON( contribution_ratios[2][[1]][i])
      for(j in 1:length(Contribution_jsonObject))
        if("language" %in% names(Contribution_jsonObject[[j]]))
          curr_language <- Contribution_jsonObject[[j]]$language
          if(tolower(curr_language) %in% tolower(domain))
            Language_index <- match(tolower(curr_language),tolower(domain))
            threshold_language <- threshold[Language_index]
            if(Contribution_jsonObject[[j]]$ratio > threshold_language)
              user_contribution_list <- c(user_contribution_list,contribution_ratios[1][[1]][i] )
    author_ratios <- data.frame(stringsAsFactors = FALSE)
    h_stars_ruby <- hash(keys=user_contribution_unique_list, values= rep(0, length(user_contribution_unique_list)))
    h_forks_ruby <- hash(keys=user_contribution_unique_list, values= rep(0, length(user_contribution_unique_list)))
    h_stars_java <- hash(keys=user_contribution_unique_list, values= rep(0, length(user_contribution_unique_list)))
    h_forks_java <- hash(keys=user_contribution_unique_list, values= rep(0, length(user_contribution_unique_list)))
    for(authorId in user_contribution_unique_list)
      cursor <- mongo.find(mongo_rmongodb, ns = "dataset.usercommitactvities", query=list(author=mongo.oid.from.string(authorId), language=list('$in'=c('Java','Ruby'))), fields=list(repository=1L,language=1L) )
      user_contributions_ratio <- mongo.find(mongo_rmongodb, ns = "dataset.contributionratios", query=list(author=mongo.oid.from.string(authorId)),fields=list(contribution_ratio=1L))
      ratio_lang <- 0
      ratio_ruby <- 0
      ratio_java <- 0
      if (
        ratio <- mongo.cursor.value(user_contributions_ratio)
        ##ratio <- mongo.bson.value( ratio, "contribution_ratio")
        ratio_ruby <- find_lang_contrib_ratio(ratio, "Ruby")
        ratio_java <- find_lang_contrib_ratio(ratio, "Java")
      while (
        item <- mongo.cursor.value(cursor)
        oid <- mongo.bson.value(item, "repository")
        lang <- mongo.bson.value(item, "language")
        ## get stars and forks from the user commit activities
        stars <- find_stars_lang(ddply_sum_lineCount, , lang)
        forks <- find_forks_lang(ddply_sum_lineCount, , lang)
        if(lang == "Ruby")
            stars  <- stars * ratio_ruby
            forks  <- forks * ratio_ruby
            update_hash_value(h_forks_ruby, authorId, forks) 
            update_hash_value(h_stars_ruby, authorId, stars) 
        else if(lang == "Java")
            stars  <- stars * ratio_java
            forks  <- forks * ratio_java
            update_hash_value(h_forks_java, authorId, forks) 
            update_hash_value(h_stars_java, authorId, stars) 
        #cat(authorId, "-",,"-",stars,"-",forks,"-",lang,"\n")
    forks_lang_ruby <- c()
    for(vals in  values(h_forks_ruby))
      forks_lang_ruby <- c(forks_lang_ruby, vals)
    stars_lang_ruby <- c()
    for(vals in  values(h_stars_ruby))
      stars_lang_ruby <- c(stars_lang_ruby, vals)
    forks_lang_java <- c()
    for(vals in  values(h_forks_java))
      forks_lang_java <- c(forks_lang_java, vals)
    stars_lang_java <- c()
    for(vals in  values(h_stars_java))
      stars_lang_java <- c(stars_lang_java, vals)
    plot(stars_lang_ruby,forks_lang_ruby,ylim =c(0,25),xlim=c(0,100),xlab="stars",ylab="forks", col="red")
    plot(stars_lang_java,forks_lang_java,ylim =c(0,25),xlim=c(0,100),xlab="stars",ylab="forks", col="blue")
    l <- legend( "topright", inset = c(0,0.4) 
                 , cex = 1.5
                 , bty = "n"
                 , legend = c("Ruby", "Java")
                 , text.col = c("red", "blue")
                 , = c("red","blue")
                 , pch = c(21,22)
    title("Measure of forks vs stars for both Java and Ruby")
    plot(stars_lang_ruby,stars_lang_java,ylim =c(0,25),xlim=c(0,100),xlab="Stars for Ruby",ylab="Stars for Java", col="red")
    title("Measure of Stars for Ruby vs Stars for Java ")

    a) Contribution for Forks vs Contribution for Stars for a Domain - Java and Ruby

    Now we took a developer centric approach, trying to determine the quality and accesibility of code written by developers in a certain domain. To link the stars/forks of the repositories to the contributions of individual developers we used the contribution ratio metric. We used the ratio of their contribution in that language to the overall amount of code written, took the product with the proportion of code written in that language ina repository, and multiplying by stars/forks. For a given developer and language we sum these calculations across repositories he's made a contribution to in that language.

    b) Number of Users who have contributed to Javascript and CSS, Java and Ruby

    print("User contributed to either JS or CSS")
    print("User contributed to both")
    print(length((user_contribution_list)) - length(unique(user_contribution_list)))
    print("537ff216280ef15170b59ba5" %in% user_contribution_list)
    user_contribution_unique_list <- unique(user_contribution_list)

    Either JavaScript or CSS : 9553 user
    Both Javascript and CSS : 1872 users

    similarly for Java and Ruby

    Either Java or Ruby : 11313 user
    Both Java and Ruby : 163 users

    c) Forks vs Stars across all languages

    ###### forks vs stars for all languages
    find_avg_userweight_for_repr <- function() {
      cursor <- mongo.find(mongo_rmongodb, ns = "dataset.useravgweights", fields=list(author=1L,weight=1L))
      h <- hash(keys=1,values=1)
      while (
        item <- mongo.cursor.value(cursor)
        author <- mongo.bson.value(item, "author")
        weight <- mongo.bson.value(item, "weight")
        #if(!has.key(author, h))
          hash:::.set(h, keys=author,values=weight)
      cursor <- mongo.find(mongo_rmongodb, ns = "dataset.repositories", fields=list("_id"=1L,"contributors.user"=1L,stars=1L,forks=1L))
      stars_arr <- c()
      forks_arr <- c()
      sum_arr <- c()
      id_arr <- c()
      while (
        item <- mongo.cursor.value(cursor)
        repo <- mongo.bson.value(item, "_id")
        contrib <- mongo.bson.value(item, "contributors")
        stars <- mongo.bson.value(item, "stars")
        forks <- mongo.bson.value(item, "forks")
        stars_arr <- c(stars_arr, stars)
        forks_arr <- c(forks_arr, forks)
        sum <- 0
        count <- 0
        for(i in 1:length(contrib))
          sum = sum + values(h, keys=contrib[i][[1]]$user)[[1]]
          count = count + 1
        sum_arr <- c(sum_arr, sum/count)
        id_arr <- c(id_arr,
      df <- data.frame(id_arr, stars_arr, forks_arr, sum_arr)
    df <- find_avg_userweight_for_repr()
    fit <- lm(df[,3]~df[,2], data=df)
    ggplot(df, aes(x=df[,3], y=df[,2])) + geom_point() + stat_smooth(method = "lm", formula = as.formula(fit), size = 1, se = FALSE, colour = "black")

    Temporal Language Analysis

    Time series analysis of contributions to various languages reveals interesting trends in the way languages evelove over time. The data required for this analysis is partly retrieved using map-reduce framework in monogodb and partly using code written in R. The map reduce code performs the following steps:

    1. Group all commits made to all the repositories by language and by month and year
    2. For each of these groups find the sum of all the changes made

    Some interesting trends:
    While trying to identify trends in the commit activity for languages over time we found an interesting trend for coffeescript. The commit activity for this scripting language was initially low but there was a sudden peak in the commit activity in 2013, the year when coffeescript was ranked 29th among languages.

    The plots below show the trends in commit activity over time for all languages, major programming languages and coffeescript

    a) Temporal analysis for all languages over time

    We plot the number of commits for each language over the time.

    # Analysis of language evolution over time
    temporal_lang_performance <- function()
      cursor <- mongo.find(mongo_rmongodb, ns = "dataset.temporalcommits_new", sort=list("value.temporal.$.date"=1L) ,fields=list(value=1L))
      h  <- hash(keys=1,values=1)
      while (
        item <- mongo.cursor.value(cursor)
        val <- mongo.bson.value(item, "value")
        lang <- val$lang
          temporal <- val$temporal
          changes_arr <- c()
          date_arr <- c()
          str_arr <- c()
          if(length(temporal) == 0)
          for(i in 1:length(temporal))
            dt <- strptime(temporal[[i]]$date,format="%a, %d %b %Y %H:%M:%S GMT")
            changes <- temporal[[i]]$changes
            date_arr <- append(date_arr , as.Date(dt)) #format(as.Date(dt), "%b,%Y")
            changes_arr <- append(changes_arr, changes)
            str_arr <- append(str_arr, format(as.Date(dt), "%b,%Y"))
          df <- data.frame(date_arr , changes_arr, str_arr)
          hash:::.set(h,keys=lang, values=df)
    ret <- temporal_lang_performance()
    p <- ggplot()
    mat_global <- data.frame()
    lang <- keys(ret)
    for(key in 1:length(lang)) {
        if(!(key %in% c(19)))
        if(length(values(ret, keys=lang[key])) < 3)
        x <- values(ret, keys=lang[key])[[1]]
        y <- values(ret, keys=lang[key])[[2]]
        z <- values(ret, keys=lang[key])[[3]]
        w <- rep(lang[key], length(x))
        mat <- data.frame(x,y,z,w)
        mat <- mat[order(mat[,1]), ]
        mat <- data.frame(mat)
        mat_global <- rbind(mat_global, mat)
    mat_global <- mat_global[order(mat_global[,1]), ]
    ggplot(mat_global, aes(x=x, y=y,
                          colour=w)) +
    geom_line() + scale_y_continuous(limits=c(0, 10000)) + theme(legend.key.size = unit(0.4, "cm"))
    ### End temporal analysis

    b) Temporal analysis for major programming languages

    We plot the number of commits for majorly used languages over the time.

    c) Temporal analysis for CoffeeScript

    We have plotted the temporal data for coffeescript. We see peek a peek in 2013. This is because as of 2013 Coffescript was the 12th most popular language on GitHub in terms of project-count [5]. In 2013 it was also ranked 29th among languages, based on number of questions tagged at Stack Overflow.

    Relationship between Stars and Forks - Linear Regression

    We performed Linear Regression on the data to find the relationship between Stars and Forks. The results weren't good enough as we couldn't find correlation between the two entities. Results are as below:

    find_avg_userweight_for_repr <- function() {
      cursor <- mongo.find(mongo_rmongodb, ns = "dataset.useravgweights", fields=list(author=1L,weight=1L))
      h <- hash(keys=1,values=1)
      while (
        item <- mongo.cursor.value(cursor)
        author <- mongo.bson.value(item, "author")
        weight <- mongo.bson.value(item, "weight")
        #if(!has.key(author, h))
          hash:::.set(h, keys=author,values=weight)
      cursor <- mongo.find(mongo_rmongodb, ns = "dataset.repositories", fields=list("_id"=1L,"contributors.user"=1L,stars=1L,forks=1L))
      stars_arr <- c()
      forks_arr <- c()
      sum_arr <- c()
      id_arr <- c()
      while (
        item <- mongo.cursor.value(cursor)
        repo <- mongo.bson.value(item, "_id")
        contrib <- mongo.bson.value(item, "contributors")
        stars <- mongo.bson.value(item, "stars")
        forks <- mongo.bson.value(item, "forks")
        stars_arr <- c(stars_arr, stars)
        forks_arr <- c(forks_arr, forks)
        sum <- 0
        count <- 0
        for(i in 1:length(contrib))
          sum = sum + values(h, keys=contrib[i][[1]]$user)[[1]]
          count = count + 1
        sum_arr <- c(sum_arr, sum/count)
        id_arr <- c(id_arr,
      df <- data.frame(id_arr, stars_arr, forks_arr, sum_arr)
    df <- find_avg_userweight_for_repr()
    fit <- lm(df[,3]~df[,2], data=df)
    Call: lm(formula = stars_arr ~ forks_arr, data = df) Residuals: Min 1Q Median 3Q Max -41683 -1062 -503 536 21504 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.850e+03 1.166e+02 24.45 <2e-16 *** forks_arr 1.758e+00 6.089e-02 28.88 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2868 on 803 degrees of freedom Multiple R-squared: 0.5095, Adjusted R-squared: 0.5088 F-statistic: 834 on 1 and 803 DF, p-value: < 2.2e-16

    General Data Analysis on Github

    Plot of number of Repositories per language Basis

    get_all_lang <- function()
      mongo <- mongoDbConnect("dataset", host="localhost")
      output <- dbAggregate(mongo, "commits", c('{ "$unwind" : "$changes"}',
                                                ' { "$project" : { "lang" : "$changes.language" } } ',
                                                ' { "$group" : { "_id" : "$lang" } } '))
      Languages_All <- c()
      for(i in 1:length(output))
        bson <- mongo.bson.from.JSON(output[i])
        lang <- mongo.bson.value(bson,"_id")
        if(!is.null(lang) && nchar(lang) > 0)
          Languages_All <- c(Languages_All, lang)
      unique_languages <- unique(Languages_All)
    #unique_languages <- get_all_lang()
    mongo_rmongodb <- mongo.create(host = "localhost")
    user_commit_activity <- data.frame(stringsAsFactors = FALSE)
    user_commit_act <- mongo.find(mongo_rmongodb, ns="dataset.usercommitactvities", fields=list(language=1L,repository=1L,author=1L,changes=1L))
    while (
      item <- mongo.cursor.value(user_commit_act)
      repository <-, "repository"))
      author <-, "author"))
      language <- mongo.bson.value(item, "language")
      changes <- mongo.bson.value(item, "changes")
      if (class(repository) != "NULL" && class(author) != "NULL" && class(language) != "NULL" && class(changes) != "NULL") {
        user_commit_activity <- rbind(user_commit_activity, data.frame(author=author,repository=repository,language=language,changes=changes))
    ddply_lang_project = ddply(user_commit_activity,
                               c("language", "repository"),
                               temp = length(author)
    ddply_projects_per_lang = ddply( ddply_lang_project,
                                     repo_count = length(repository)
    ddply_changes_per_user = ddply( user_commit_activity,
                                    c("language", "author"), 
                                    total_changes = sum(changes)
    ddply_changes_per_user_lang = ddply( ddply_changes_per_user,
                                         avg_commits = mean(total_changes)
    ddply_dev_lang = ddply( user_commit_activity,
                            c("language", "author"), 
    ddply_dev_per_lang = ddply( ddply_dev_lang,
                                dev_count = length(author)
    ddply_projects_per_lang_matrix <- as.matrix(ddply_projects_per_lang[, "repo_count"])
    row.names(ddply_projects_per_lang_matrix) <- ddply_projects_per_lang[, "language"]
    barplot(t(ddply_projects_per_lang_matrix),horiz=TRUE,cex.names=0.8,xlim=c(0,700),xlab="Number of Repositories", main="Number of Repositories for each Language")
    ddply_changes_per_user_lang_matrix <- as.matrix(ddply_changes_per_user_lang[, "avg_commits"])
    row.names(ddply_changes_per_user_lang_matrix) <- ddply_changes_per_user_lang[, "language"]
    barplot(t(ddply_changes_per_user_lang_matrix),horiz=TRUE,xlim=c(0,150000),cex.names=0.8,xlab="Average Line Changes", main="Average Line Changes for each Language")
    ddply_dev_per_lang_matrix <- as.matrix(ddply_dev_per_lang[, "dev_count"])
    row.names(ddply_dev_per_lang_matrix) <- ddply_dev_per_lang[, "language"]
    barplot(t(ddply_dev_per_lang_matrix),horiz=TRUE,xlim=c(0,5000),cex.names=0.8,xlab="Number of Developers", main="Number of Developers for each Language")

    Plot for Number of Changes made per User per Language

    Plot for Number of Developers contributing to each language

    Developer Analysis

    get_all_lang <- function()
      mongo <- mongoDbConnect("dataset", host="localhost")
      output <- dbAggregate(mongo, "commits", c('{ "$unwind" : "$changes"}',
                                                ' { "$project" : { "lang" : "$changes.language" } } ',
                                                ' { "$group" : { "_id" : "$lang" } } '))
      Languages_All <- c()
      for(i in 1:length(output))
        bson <- mongo.bson.from.JSON(output[i])
        lang <- mongo.bson.value(bson,"_id")
        if(!is.null(lang) && nchar(lang) > 0)
          Languages_All <- c(Languages_All, lang)
      unique_languages <- unique(Languages_All)
    #unique_languages <- get_all_lang()
    mongo_rmongodb <- mongo.create(host = "localhost")
    user_commit_activity <- data.frame(stringsAsFactors = FALSE)
    user_commit_act <- mongo.find(mongo_rmongodb, ns="dataset.usercommitactvities", fields=list(language=1L,repository=1L,author=1L,changes=1L))
    while (
      item <- mongo.cursor.value(user_commit_act)
      repository <-, "repository"))
      author <-, "author"))
      language <- mongo.bson.value(item, "language")
      changes <- mongo.bson.value(item, "changes")
      if (class(repository) != "NULL" && class(author) != "NULL" && class(language) != "NULL" && class(changes) != "NULL") {
        user_commit_activity <- rbind(user_commit_activity, data.frame(author=author,repository=repository,language=language,changes=changes))
    getUserName <- function(userId)
      username <- ""
      users <- mongo.find(mongo_rmongodb, ns = "dataset.users", query=list("_id"=mongo.oid.from.string(userId)),fields=list(username=1L))
      while (
        item <- mongo.cursor.value(users)
        username <- mongo.bson.value(item, "username")
    getRepoName <- function(repoId)
      reponame <- ""
      repos <- mongo.find(mongo_rmongodb, ns = "dataset.repositories", query=list("_id"=mongo.oid.from.string(repoId)),fields=list(fullName=1L))
      while (
        item <- mongo.cursor.value(repos)
        reponame <- mongo.bson.value(item, "fullName")    
    ddply_auth_project <- data.frame(stringsAsFactors = FALSE)
    ddply_auth_project = ddply(user_commit_activity,
                               c("author", "repository"),
                               temp = length(language)
    ddply_projects_per_author = ddply( ddply_auth_project,
                                     repo_count = length(repository)
    ddply_projects_per_author_matrix <- as.matrix(ddply_projects_per_author[, "repo_count"])
    #row.names(ddply_projects_per_author_matrix) <- getUserName(ddply_projects_per_author[, "author"])
    barplot(t(ddply_projects_per_author_matrix),horiz=TRUE,cex.names=1.0,xlim=c(0,25),xlab="Number of Repositories", main="Number of Repositories contributed by each user")
    ddply_auth_project = ddply(user_commit_activity,
                               c("author", "repository"),
                               temp = length(language)
    ddply_auth_project$ <- factor(ddply_auth_project$x) 
    ddply_projects_per_repos = ddply( ddply_auth_project,
                                       author_count = length(author)
    ddply_projects_per_repos_matrix <- as.matrix(ddply_projects_per_repos[, "author_count"])
    #row.names(ddply_projects_per_author_matrix) <- getUserName(ddply_projects_per_author[, "author"])
    barplot(t(ddply_projects_per_repos_matrix),horiz=TRUE,cex.names=1.0,xlim=c(0,500),xlab="Number of Contributors", main="Number of Contributors for each Repository")

    Rspository Analysis

    Stars versus Average Contribution Ratios for Developers for each repository

    Forks versus Average Contribution Ratios for Developers for each repository

    K Means Clustering on Developers Based on their Contribution Ratios to Languages

    Number of Clusters : 5

    We tried to cluster users based on their contribution to Languages. The figure shows the various clusters with cluster ids.

    user_contribution_cursor <- mongo.find(mongo_rmongodb, ns = "dataset.contributionratios" )
    user_lang_contribution_df <- matrix(0,nrow=1,ncol=length(unique_languages))
    author_id_vector <-c()
    while (
        item <- mongo.cursor.value(user_contribution_cursor)
        authorId <- mongo.bson.value(item, "author")
        author_id_vector <- c(author_id_vector,
        contributions_list <- mongo.bson.value(item, "contribution_ratio")
            language_zeros <- matrix(0,nrow=1,ncol=length(unique_languages))
            for(contribution in contributions_list)
                #print ("------")
                #print (contribution)
                if(class(contribution$language) != "NULL")
                    # print(match(tolower(contribution$language),tolower(unique_languages)))
                    #print (contribution$language)
                    #print (contribution$ratio)
                     if (class(contribution$ratio) != "NULL") {
                         language_zeros[match(tolower(contribution$language),tolower(unique_languages))] <- contribution$ratio
        user_lang_contribution_df <- rbind(user_lang_contribution_df,language_zeros)
        #user_lang_contribution_df <- rbind(user_lang_contribution_df, data.frame(,languages=language_zeros))
    user_lang_contribution_df <- user_lang_contribution_df[-1,]
    user_lang_contribution_dataframe <- data.frame(user_lang_contribution_df)
    row.names(user_lang_contribution_dataframe) <- author_id_vector
    cl <- kmeans(user_lang_contribution_dataframe, 5)
    plotcluster(user_lang_contribution_df, cl$cluster, pointsbyclvecd = TRUE)
    cluster_vector <- cl$cluster
    y <- which(cluster_vector==3,

    Team Contributions

    • Data:
      Data acquisition effort, Data novelty: Badhrinathan, Kumar and Thakur

    • Analysis: IPython notebook construction effort
      Data analysis Code: Balan, Ganapathy
      Data analysis methods: Badhrinathan, Balan, Thakur, Kumar

    • Documentation:
      Presentation Effort: Thakur, Balan
      project summary effort: Badhrinathan, Balan, Thakur, Kumar

    • Overall: project methods, project creativity, project difficulty :
      Badhrinathan, Kumar, Balan, Thakur, Ganapathy

    Problems Faced

    Issues with data collection:

    The dataset was mainly obtained from GitHub APIs. A major limitation for these APIs is that they are rate limited as a result we can only make a limited number for http requests to the server. Since our data collection involves multiple api requests we had to come up with way around this problem. We developed a new library called job queue which basically works on the concept of producer comsumer model. It must be noted that the consumers are rate limited which ensures proper functioning of data collection. Moreover we had to ensure that all the consumers are suthenticated with the GITHUB server.

    Dirty data:

    The data that we get from the APIS was not in the format suitable for the analysis that we intended to perform. So we had to introduce a new preprocessing stage where we perform the following operations to transform the data : 1) MongoDB Map-Reduce 2) MongoDB Aggregation 3) R ddply The information extracted from these analysis are mentioned in the data schema section.

    Memory and CPU limitations:

    Since the scale of data requred was high and the data collection and data preprocessing was complicated we faced issues with high CPU utilization while running our code. Moreover there were issues with the machine running the code going out of memory. We also faced issues with storage space limitations on SeasNet Servers. As a result we had to parallelize our code and run it across multiple machines. And each of these instances took a lot of time to run.

    Suitable libraries for mongoDB in R:

    Even though libraries exist for extracting information from MongoDB we faced issues with complicated functionality and queries where we had to find documents by `ObjectId`. Also the `RMongo` library had some limitations for extracting required fields so we had t parse the string obtained. Also the documentation was not sufficient. Figuring out to how to use the library in R took a lot of time.

    IPython doesn't support NodeJS and CoffeeScript:

    We initially attempted to run data extraction code in CoffeScript from iPython notebook but since it doesn't support these languages we had to run this code separately

    Language identification from fileNames and content:

    We faced challenges with identifying programming language given a fileName. We used a `linguist` library support for this.

    Summary, Experiences and lessons learnt


    1) Based on the amount of time and efforts spent on the initial data collection and preprocessing, we have realized that data extraction is as important and challenging as data analysis. We intend to develop a framework for data analysis that can ease the pain of data collection and standardize practices of data collection

    2) Solving the issue of rate limit helped us analyze various ways of atacking a particular problem like synchronization, parallel programming, scale etc.

    3) Data analysis on smaller data set before going ahead with bigger data helped us save time during analysis since we could detect earlier whether particular analysis made sense and whether we could use it or not.

    4) We had to do a lot reading, brain storming and consider all possible features and analyze on the ones that made the most sense. Unlike a nomal case where you are given features we had to come up with features and evaluate them. This was a good learning experience.

    6) Experimenting with libraries was another good learning experience. We tried a lot of analysis libraries as well as data extraction libraries this gave us a good idea of what all exists out there apart from what we already know.

    Trends in dataset:

    1) Volume of commits for programming languages have increased across all languages over time.

    2) The fluctuations in the commit activity for a particular language characterizes the evolution of that language like in case of coffeescript

    3) It seems from the data that time plays an important role in the number of stars and forks gained by a particular repository. For example we found repositories like 'eggert/tz' which is a very important repository it has the Time Zone Database (often called tz or zoneinfo) which contains code and data that represent the history of local time for many representative locations around the globe. This repository however has very few stars and forks this in a way could be attributed to year when the repository was created. It seems that older repositories tends to gather less attention and accessibility as compared to recent repositories that might not be as critical.

    4) There are very few languages which have low accessibility and high popularity and viceversa indicating direct proportionality between forks and stars

    5) Average activity and number of committers for a popular langauage is higher.


    [1] GitHub API reference,
    [2] The impact of language choice on github projects,
    [3] GitHub data analysis,
    [4] Mining the Social Web, Chapter 7 - Mining GitHub
    [5] Coffescript Data Analysis (2013) -
    [6] Job Queue -

    Github Repository -