This sample notebook demonstrates working with GitHub activity, which has been made possible via the publicly accessible GitHub Timeline BigQuery dataset via the BigQuery Sample Tables.
Here is the question that this notebook tackles: "How likely are you to program in X, if you program in Y?" For example, this might be an input into an repository exploration/recommendation/search tool to personalize the results based on your own contributions.
It is based on an example published at http://datahackermd.com/2013/language-use-on-github/. It counts pushes or commits made by all users across all repositories on GitHub and their associated repository languages to determine the correlation between languages.
Related Links:
In [1]:
import google.datalab.bigquery as bq
import matplotlib.pyplot as plot
import numpy as np
import pandas as pd
We're going to work with the GitHub Archive project data. It contains all github events (commits, pushes, forks, watches, etc.) along with metadata about the events (e.g., user, time, place). The schema and sample data will help use to further understand this dataset.
In [ ]:
%%bq tables describe --name "publicdata.samples.github_timeline"
The GitHub timeline is a large dataset. A quick lookup of table metadata gives us the row count.
In [3]:
table = bq.Table('publicdata.samples.github_timeline')
table.metadata.rows
Out[3]:
With over 290 million events, it is important to be able to sample the data. The sample method allows us to sample tables or queries.
In [4]:
bq.Query.from_table(table).execute(sampling=bq.Sampling.default(
fields=['repository_name',
'repository_language',
'created_at',
'type'])).result()
Out[4]:
The first step in our analysis to correlate languages is retrieving the appropriate slice of data.
We'll need to retrieve the list of PushEvents from the timeline. This is a large list of events, and there are several ways to get a more manageable resultset:
In [5]:
%%bq query --name popular_languages
SELECT repository_language AS language, COUNT(repository_language) as pushes
FROM `publicdata.samples.github_timeline`
WHERE type = 'PushEvent'
AND repository_language != ''
AND CAST(created_at AS TIMESTAMP) >= TIMESTAMP("2012-01-01")
AND CAST(created_at AS TIMESTAMP) < TIMESTAMP("2013-01-01")
GROUP BY language
ORDER BY pushes DESC
LIMIT 25
In [6]:
%%bq query --name pushes --subqueries popular_languages
SELECT timeline.actor AS user,
timeline.repository_language AS language,
COUNT(timeline.repository_language) AS push_count
FROM `publicdata.samples.github_timeline` AS timeline
JOIN popular_languages AS languages
ON timeline.repository_language = languages.language
WHERE type = 'PushEvent'
AND CAST(created_at AS TIMESTAMP) >= TIMESTAMP("2012-01-01")
AND CAST(created_at AS TIMESTAMP) < TIMESTAMP("2013-01-01")
GROUP BY user, language
In [7]:
%%bq query --name pushes_sample --subqueries popular_languages pushes
SELECT user, language, push_count
FROM pushes
WHERE MOD(ABS(FARM_FINGERPRINT(user)), 100) < 5
ORDER BY push_count DESC
In [8]:
popular_languages.execute().result()
Out[8]:
In [9]:
query = pushes_sample.execute()
query.result()
Out[9]:
The next step is to integrate the BigQuery SQL queries with the analysis capabilities provided by Python and pandas. The query defined earlier can easily be materialized into a pandas dataframe.
In [10]:
df = query.result().to_dataframe()
Great! We've successfully populated a pandas dataframe with our dataset. Let's dig into our dataset a further using the dataframe to see if our data makes sense.
In [11]:
df[:10]
Out[11]:
In [12]:
summary = df['user'].describe()
print('DataFrame contains %d with %d unique users' % (summary['count'], summary['unique']))
Let's see who is the most polyglot user of the mix.
In [13]:
print('%s has contributions in %d languages' % (summary['top'], summary['freq']))
df[df['user'] == summary['top']]
Out[13]:
So far, our results have multiple rows for each user -- specifically, one per language. The next step is to pivot that data, so that we have one row per user, and one column per language. The resulting matrix will be extremely sparse. We'll just fill in 0 (no pushes) for user/language pairs that have no data.
Pandas offers a built-in pivot() method, which helps here.
In [14]:
dfp = df.pivot(index = 'user', columns = 'language', values = 'push_count').fillna(0)
dfp
Out[14]:
Now, compute the correlation for each pair of languages (again, built into the pandas library).
In [15]:
corr = dfp.corr(method = 'spearman')
corr
Out[15]:
The correlation table, above, contains the results, but isn't very telling. A plot will make the data speak a lot louder, and highlight the highly correlated languages, as well as the highly uncorrelated languages.
In [16]:
# Plotting helper function
def plot_correlation(data):
min_value = 0
max_value = 0
for i in range(len(data.columns)):
for j in range(len(data.columns)):
if i != j:
min_value = min(min_value, data.iloc[i, j])
max_value = max(max_value, data.iloc[i, j])
span = max(abs(min_value), abs(max_value))
span = round(span + .05, 1)
items = data.columns.tolist()
ticks = np.arange(0.5, len(items) + 0.5)
plot.figure(figsize = (11, 7))
plot.pcolor(data.values, cmap = 'RdBu', vmin = -span, vmax = span)
plot.colorbar().set_label('correlation')
plot.xticks(ticks, items, rotation = 'vertical')
plot.yticks(ticks, items)
plot.show()
In [17]:
plot_correlation(corr)
We can see that JavaScript and CSS pushes have strong positive correlation, as well as C and C++, and Objective-C and Swift (good sanity check).