Programming Language Correlation

This sample notebook demonstrates working with GitHub activity, which has been made possible via the publicly accessible GitHub Timeline BigQuery dataset via the BigQuery Sample Tables.

Here is the question that this notebook tackles: "How likely are you to program in X, if you program in Y?" For example, this might be an input into an repository exploration/recommendation/search tool to personalize the results based on your own contributions.

It is based on an example published at http://datahackermd.com/2013/language-use-on-github/. It counts pushes or commits made by all users across all repositories on GitHub and their associated repository languages to determine the correlation between languages.

Related Links:


In [1]:
import google.datalab.bigquery as bq
import matplotlib.pyplot as plot
import numpy as np
import pandas as pd

Understanding the GitHub Timeline

We're going to work with the GitHub Archive project data. It contains all github events (commits, pushes, forks, watches, etc.) along with metadata about the events (e.g., user, time, place). The schema and sample data will help use to further understand this dataset.


In [ ]:
%%bq tables describe --name "publicdata.samples.github_timeline"

The GitHub timeline is a large dataset. A quick lookup of table metadata gives us the row count.


In [3]:
table = bq.Table('publicdata.samples.github_timeline')
table.metadata.rows


Out[3]:
6219749

With over 290 million events, it is important to be able to sample the data. The sample method allows us to sample tables or queries.


In [4]:
bq.Query.from_table(table).execute(sampling=bq.Sampling.default(
    fields=['repository_name',
            'repository_language',
            'created_at',
            'type'])).result()


Out[4]:
repository_namerepository_languagecreated_attype
mongo-php-driverC2012-04-02 16:21:58ForkEvent
php-srcC2012-04-01 07:06:57ForkEvent
zerorpc-pythonPython2012-03-27 15:54:49ForkEvent
jquery-tokeninputJavaScript2012-03-22 13:53:17ForkEvent
pdf.jsJavaScript2012-03-21 04:35:38ForkEvent

(rows: 5, time: 0.9s, cached, job: job_X9uodioFyk46brx3_xIpIbLwOKs)

Querying the Data

The first step in our analysis to correlate languages is retrieving the appropriate slice of data.

We'll need to retrieve the list of PushEvents from the timeline. This is a large list of events, and there are several ways to get a more manageable resultset:

  • Limiting the analysis to the top 25 languages (from an otherwise long list of languages that simply add noise).
  • Limiting the analysis to just pushes made during 1 year time window; we will use 2012.
  • Further sampling to get a small, but still interesting sample set to further analyze for correlation.

In [5]:
%%bq query --name popular_languages
SELECT repository_language AS language, COUNT(repository_language) as pushes
FROM `publicdata.samples.github_timeline`
WHERE type = 'PushEvent'
  AND repository_language != ''
  AND CAST(created_at AS TIMESTAMP) >= TIMESTAMP("2012-01-01")
  AND CAST(created_at AS TIMESTAMP) < TIMESTAMP("2013-01-01")
GROUP BY language
ORDER BY pushes DESC
LIMIT 25

In [6]:
%%bq query --name pushes --subqueries popular_languages
SELECT timeline.actor AS user,
       timeline.repository_language AS language,
       COUNT(timeline.repository_language) AS push_count
FROM `publicdata.samples.github_timeline` AS timeline
JOIN popular_languages AS languages
  ON timeline.repository_language = languages.language
WHERE type = 'PushEvent'
  AND CAST(created_at AS TIMESTAMP) >= TIMESTAMP("2012-01-01") 
  AND CAST(created_at AS TIMESTAMP) < TIMESTAMP("2013-01-01") 
GROUP BY user, language

In [7]:
%%bq query --name pushes_sample --subqueries popular_languages pushes
SELECT user, language, push_count
FROM pushes
WHERE MOD(ABS(FARM_FINGERPRINT(user)), 100) < 5
ORDER BY push_count DESC

Checking the Results


In [8]:
popular_languages.execute().result()


Out[8]:
languagepushes
JavaScript455158
Java341750
Ruby324837
Python261187
PHP246018
C++163494
C161677
Shell75076
C#60039
Objective-C45619
VimL41959
Perl36578
Scala20502
Emacs Lisp16672
CoffeeScript16164
Haskell16063
Clojure11781
Lua11209
ActionScript8789
Go8087
Erlang7944
Groovy7277
R6842
Matlab6826
Puppet5128

(rows: 25, time: 0.2s, cached, job: job_iyBPYSthdPRua2Ng2CeP3Nkrlk4)

In [9]:
query = pushes_sample.execute()

query.result()


Out[9]:
userlanguagepush_count
clayyountJavaScript647
radarRuby520
bhearsumPython470
DamonOehlmanJavaScript430
thatch45Python426
kraihPerl406
zolexJava381
mjgC371
capensisJavaScript322
miltontonyPython296
smichrPython288
fluxerShell267
jeffreykeglerPerl260
buildserver-neo4jJava255
athanatosC++248
tatshJavaScript247
beijingyoungJavaScript244
andreasrongeRuby240
0xd34df00dC++235
buggtbPuppet235
freundlichC++225
chris-taylorHaskell222
rlrosaMatlab219
impleriPHP216
ksevelyarHaskell212

(rows: 9793, time: 0.2s, cached, job: job_vpcZ-FjEszyr8G_lEHNZuR2uqPk)

Analyzing the Data

The next step is to integrate the BigQuery SQL queries with the analysis capabilities provided by Python and pandas. The query defined earlier can easily be materialized into a pandas dataframe.


In [10]:
df = query.result().to_dataframe()

Great! We've successfully populated a pandas dataframe with our dataset. Let's dig into our dataset a further using the dataframe to see if our data makes sense.


In [11]:
df[:10]


Out[11]:
user language push_count
0 clayyount JavaScript 647
1 radar Ruby 520
2 bhearsum Python 470
3 DamonOehlman JavaScript 430
4 thatch45 Python 426
5 kraih Perl 406
6 zolex Java 381
7 mjg C 371
8 capensis JavaScript 322
9 miltontony Python 296

In [12]:
summary = df['user'].describe()

print('DataFrame contains %d with %d unique users' % (summary['count'], summary['unique']))


DataFrame contains 9793 with 7249 unique users

Let's see who is the most polyglot user of the mix.


In [13]:
print('%s has contributions in %d languages' % (summary['top'], summary['freq']))

df[df['user'] == summary['top']]


narkisr has contributions in 9 languages
Out[13]:
user language push_count
472 narkisr Clojure 46
670 narkisr Ruby 36
850 narkisr JavaScript 30
4665 narkisr Groovy 4
5618 narkisr VimL 3
6423 narkisr Java 2
6997 narkisr Shell 2
8958 narkisr Haskell 1
9321 narkisr Puppet 1

Reshaping the Data

So far, our results have multiple rows for each user -- specifically, one per language. The next step is to pivot that data, so that we have one row per user, and one column per language. The resulting matrix will be extremely sparse. We'll just fill in 0 (no pushes) for user/language pairs that have no data.

Pandas offers a built-in pivot() method, which helps here.


In [14]:
dfp = df.pivot(index = 'user', columns = 'language', values = 'push_count').fillna(0)
dfp


Out[14]:
language ActionScript C C# C++ Clojure CoffeeScript Emacs Lisp Erlang Go Groovy ... Objective-C PHP Perl Puppet Python R Ruby Scala Shell VimL
user
0li 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0xPr0xy 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 24.0 0.0 0.0 0.0 0.0 0.0
0xd34df00d 0.0 0.0 0.0 235.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
100kV 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
123ndy 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1ntello 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
21studios 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2bt 0.0 30.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
30abc3f4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 7.0 0.0 0.0 0.0 0.0 0.0
32kda 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3rddog 0.0 0.0 0.0 0.0 0.0 0.0 6.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4141done 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 7.0 0.0 0.0 0.0
46Bit 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 26.0 0.0 0.0 0.0
4ndr3j 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
4np 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 11.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5263 0.0 0.0 0.0 45.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
640774n6 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
84zume 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8ozStudios 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
99ko-project 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 6.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
AAS 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0
ADITYAJAIN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
AHinMaine 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
AJCStriker 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
AKurilin 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
AZiegler71 0.0 0.0 6.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
AaronKenny 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
AaronVoelker 0.0 0.0 0.0 8.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 11.0 0.0 0.0 0.0 0.0 0.0
AbelShen 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
AccaliaDeElementia 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 ... 0.0 0.0 0.0 0.0 6.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
zhouqt 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 8.0 0.0 0.0 0.0 0.0 0.0
zhuolij 0.0 0.0 0.0 14.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zielmicha 0.0 13.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zigorou 0.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zillakot 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zilongchang 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zivel 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
ziykon 0.0 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zmack 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 2.0
zmaril 0.0 0.0 0.0 0.0 4.0 5.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zolex 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 20.0 0.0 0.0 0.0 0.0 34.0 0.0 0.0 0.0
zolitch 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zoltankiss 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0
zoomika 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zorgoz 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zortness 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zory 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zrusilla 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zshao4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zsiciarz 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0
zstone 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 0.0 0.0
zucaritas 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zueblin 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zvolsky 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zxv 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0
zy-sunshine 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 9.0 0.0 0.0 0.0 0.0 0.0
zybee 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zyll 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
zyv 0.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
zzamboni 0.0 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 7.0 4.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0

7249 rows × 25 columns

Now, compute the correlation for each pair of languages (again, built into the pandas library).


In [15]:
corr = dfp.corr(method = 'spearman')
corr


Out[15]:
language ActionScript C C# C++ Clojure CoffeeScript Emacs Lisp Erlang Go Groovy ... Objective-C PHP Perl Puppet Python R Ruby Scala Shell VimL
language
ActionScript 1.000000 -0.019668 -0.008849 -0.018644 0.013340 -0.008979 -0.009148 -0.005480 -0.004690 -0.006332 ... 0.002602 -0.021518 -0.000706 -0.005197 -0.021662 -0.004899 -0.031604 -0.009477 -0.021437 -0.011152
C -0.019668 1.000000 -0.041722 0.091977 -0.018635 -0.022726 -0.008592 0.019320 0.001470 -0.022530 ... -0.027976 -0.078497 0.031357 -0.010882 -0.018370 -0.017433 -0.092800 -0.006119 0.043795 0.004249
C# -0.008849 -0.041722 1.000000 -0.022566 -0.016604 -0.014728 -0.022120 -0.002714 0.000944 -0.006062 ... -0.026598 -0.055277 -0.029467 -0.012568 -0.059672 -0.011846 -0.060994 -0.022916 -0.025267 -0.036163
C++ -0.018644 0.091977 -0.022566 1.000000 -0.017171 -0.010940 -0.007680 0.013005 -0.006752 -0.021634 ... -0.033094 -0.089876 -0.012894 -0.009807 -0.023949 -0.007493 -0.109523 -0.022772 0.026046 -0.026096
Clojure 0.013340 -0.018635 -0.016604 -0.017171 1.000000 0.007950 0.072216 -0.005208 0.026872 0.017342 ... -0.014913 -0.020316 0.012820 0.023381 -0.008344 -0.004656 0.006351 -0.009006 0.009091 0.070577
CoffeeScript -0.008979 -0.022726 -0.014728 -0.010940 0.007950 1.000000 -0.011368 -0.006810 -0.005828 0.009858 ... -0.004298 -0.023941 -0.015144 -0.006459 -0.012874 -0.006088 0.030429 0.000235 0.023398 0.026067
Emacs Lisp -0.009148 -0.008592 -0.022120 -0.007680 0.072216 -0.011368 1.000000 -0.006938 -0.005938 -0.008016 ... -0.005141 -0.037006 0.021519 -0.006580 -0.000071 0.016456 0.012437 0.023586 0.039857 -0.005877
Erlang -0.005480 0.019320 -0.002714 0.013005 -0.005208 -0.006810 -0.006938 1.000000 -0.003557 -0.004802 ... -0.011901 -0.017671 -0.009242 -0.003942 -0.013889 -0.003715 0.001382 -0.007187 0.019488 0.025912
Go -0.004690 0.001470 0.000944 -0.006752 0.026872 -0.005828 -0.005938 -0.003557 1.000000 -0.004110 ... -0.010186 -0.020947 -0.007910 -0.003374 -0.001109 -0.003180 -0.018608 -0.006151 -0.013915 -0.012348
Groovy -0.006332 -0.022530 -0.006062 -0.021634 0.017342 0.009858 -0.008016 -0.004802 -0.004110 1.000000 ... -0.013752 -0.028280 -0.010679 0.025982 -0.025058 -0.004293 -0.019047 0.008412 0.004796 0.018871
Haskell -0.008454 0.005140 -0.013467 -0.007964 0.044069 0.002895 0.028593 -0.006411 -0.005487 0.011370 ... -0.010224 -0.037758 -0.004153 0.016686 -0.007016 -0.005732 -0.014206 0.014172 0.028374 0.037552
Java -0.026229 -0.076359 -0.061377 -0.080997 -0.000788 -0.030200 -0.029789 -0.019908 -0.025931 0.039547 ... -0.052355 -0.133123 -0.039648 -0.018439 -0.107891 -0.008624 -0.151245 0.028767 -0.042609 -0.062348
JavaScript -0.019008 -0.110587 -0.050463 -0.097483 -0.015414 0.082814 -0.014227 -0.012577 -0.032162 -0.003042 ... -0.050843 -0.051829 -0.039723 -0.013652 -0.050252 -0.023377 -0.053072 -0.007367 -0.021918 -0.004030
Lua -0.007084 0.035958 -0.008337 0.001122 0.013818 0.007217 0.006817 -0.005372 -0.004598 -0.006208 ... -0.015385 -0.021945 -0.000031 -0.005096 -0.005381 -0.004803 -0.010234 0.005963 0.055387 0.035812
Matlab -0.005660 -0.012189 -0.003458 0.012106 -0.005379 -0.007034 0.031925 -0.004293 -0.003674 -0.004960 ... 0.010975 -0.025280 -0.009546 -0.004071 0.021949 -0.003838 -0.021423 -0.007424 0.000724 0.004906
Objective-C 0.002602 -0.027976 -0.026598 -0.033094 -0.014913 -0.004298 -0.005141 -0.011901 -0.010186 -0.013752 ... 1.000000 -0.056056 -0.009972 -0.011288 -0.048862 -0.010640 -0.044589 -0.020582 -0.033890 -0.011872
PHP -0.021518 -0.078497 -0.055277 -0.089876 -0.020316 -0.023941 -0.037006 -0.017671 -0.020947 -0.028280 ... -0.056056 1.000000 -0.016039 -0.009324 -0.102239 -0.015379 -0.123634 -0.034899 -0.043961 -0.028203
Perl -0.000706 0.031357 -0.029467 -0.012894 0.012820 -0.015144 0.021519 -0.009242 -0.007910 -0.010679 ... -0.009972 -0.016039 1.000000 0.039654 -0.010676 -0.008263 -0.032850 -0.007067 0.023131 0.028883
Puppet -0.005197 -0.010882 -0.012568 -0.009807 0.023381 -0.006459 -0.006580 -0.003942 -0.003374 0.025982 ... -0.011288 -0.009324 0.039654 1.000000 0.007248 -0.003524 0.039798 -0.006817 0.080931 0.018086
Python -0.021662 -0.018370 -0.059672 -0.023949 -0.008344 -0.012874 -0.000071 -0.013889 -0.001109 -0.025058 ... -0.048862 -0.102239 -0.010676 0.007248 1.000000 0.007830 -0.118334 -0.022362 0.043741 0.019852
R -0.004899 -0.017433 -0.011846 -0.007493 -0.004656 -0.006088 0.016456 -0.003715 -0.003180 -0.004293 ... -0.010640 -0.015379 -0.008263 -0.003524 0.007830 1.000000 -0.027271 -0.006426 0.015470 -0.001684
Ruby -0.031604 -0.092800 -0.060994 -0.109523 0.006351 0.030429 0.012437 0.001382 -0.018608 -0.019047 ... -0.044589 -0.123634 -0.032850 0.039798 -0.118334 -0.027271 1.000000 -0.024751 -0.002014 0.052680
Scala -0.009477 -0.006119 -0.022916 -0.022772 -0.009006 0.000235 0.023586 -0.007187 -0.006151 0.008412 ... -0.020582 -0.034899 -0.007067 -0.006817 -0.022362 -0.006426 -0.024751 1.000000 0.025294 -0.001377
Shell -0.021437 0.043795 -0.025267 0.026046 0.009091 0.023398 0.039857 0.019488 -0.013915 0.004796 ... -0.033890 -0.043961 0.023131 0.080931 0.043741 0.015470 -0.002014 0.025294 1.000000 0.090571
VimL -0.011152 0.004249 -0.036163 -0.026096 0.070577 0.026067 -0.005877 0.025912 -0.012348 0.018871 ... -0.011872 -0.028203 0.028883 0.018086 0.019852 -0.001684 0.052680 -0.001377 0.090571 1.000000

25 rows × 25 columns

Visualizing the Results

The correlation table, above, contains the results, but isn't very telling. A plot will make the data speak a lot louder, and highlight the highly correlated languages, as well as the highly uncorrelated languages.


In [16]:
# Plotting helper function
def plot_correlation(data):
  min_value = 0
  max_value = 0

  for i in range(len(data.columns)):
    for j in range(len(data.columns)):
      if i != j:
        min_value = min(min_value, data.iloc[i, j])
        max_value = max(max_value, data.iloc[i, j])
  span = max(abs(min_value), abs(max_value))
  span = round(span + .05, 1)

  items = data.columns.tolist()
  ticks = np.arange(0.5, len(items) + 0.5)

  plot.figure(figsize = (11, 7))
  plot.pcolor(data.values, cmap = 'RdBu', vmin = -span, vmax = span)
  plot.colorbar().set_label('correlation')
  plot.xticks(ticks, items, rotation = 'vertical')
  plot.yticks(ticks, items)
  plot.show()

In [17]:
plot_correlation(corr)


Takeaways and Observations

We can see that JavaScript and CSS pushes have strong positive correlation, as well as C and C++, and Objective-C and Swift (good sanity check).

  • No surprise ... JavaScript is highly correlated with CSS, and C with C++. These are just good sanity checks.
  • A bit surprising ... Java has some strong uncorrelations - with JavaScript, PHP and Ruby ... static vs. dynamic languages?
  • Also surprising ... R seems not so correlated with Python?
  • Go is not uncorrelated with anything other than Java, even if not highly correlated. Go programmers do not exclusively program in Go.
  • And PHP is uncorrelated with many languages. Maybe a different developer persona altogether?