Dirk Loss / @dloss, v1.0, 2014-07-11
The LibreSSL project has a git mirror of their CVS repository. Let's clone it and see if we can use it to answer some simple questions.
In [1]:
%time !git clone https://github.com/libressl-portable/openbsd.git
In [2]:
cd openbsd/
In [3]:
!git log --reverse | head -10
In [4]:
!git log -1
So we have commits from 1995 to today.
In [5]:
!git log --oneline | wc -l
First let's see how much space the current checkout (excluding the .git repo) takes:
In [6]:
!du -hs -I\.git
In [7]:
!cloc .
I'll save the commit authors and timestamps as a CSV file, that can be imported and analysed using the excellent pandas library:
In [8]:
!git log --format=format:"%ai,%an,%H" > ../commits
In [9]:
cd ..
In [10]:
import pandas as pd
In [11]:
df=pd.read_csv("commits", header=None, names=["time", "author", "id"], index_col="time", parse_dates=True)
df.sort(ascending=True, inplace=True)
df.head()
Out[11]:
We are only interested in the commits since the OpenSSL valhalla rampage started. That was in April 2014:
In [12]:
df = df["2014-04-01":]
Pandas provides a convenience function that shows how often each value occurs in a given column:
In [13]:
commits_per_author=df.author.value_counts()
commits_per_author
Out[13]:
Let's visualize the commit counts with Matplotlib. But first import seaborn, which gives us much prettier graphics:
In [14]:
import seaborn as sns
In [15]:
%matplotlib inline
In [16]:
commits_per_author.plot(kind="bar", figsize=(10,6))
Out[16]:
Introduce counter:
In [17]:
df["c"]=1 # counter
commits_over_time=df.c.cumsum().plot()
commits_over_time
Out[17]:
In [18]:
authors = commits_per_author.index
timelines=pd.DataFrame(index=df.index)
for author in authors:
timelines[author]=df.c.where(df.author==author)
In [19]:
default_palette = sns.color_palette()
In [20]:
top = 10
sns.set_palette("Set1", top)
top_authors=authors[:top]
timelines[top_authors].cumsum().plot(style="o",figsize=(20,10), title="Commit activity of the Top%s authors to LibreSSL" % top)
Out[20]:
In [21]:
sns.set_palette(default_palette)
Let's see how many authors where active together, e.g. during a 3 month period:
In [22]:
per_months=timelines.resample("1D", how="sum")
per_months["nauthors"]=per_months.applymap(lambda x: min(x, 1)).sum(axis=1)
per_months["nauthors"].plot(kind="bar", figsize=(20,5))
Out[22]:
Seems like the valhalla rampage started on 2014-04-13.
In [23]:
cd openbsd/
For now we just cound the number of files:
In [24]:
%%time
filecounts = []
for commit in df["id"]:
cfiles =! git ls-tree -r --name-only $commit
filecounts.append(len(cfiles))
In [25]:
filestats=pd.DataFrame({"filecount": filecounts}, index=df.index)
filestats.plot(figsize=(10,6))
Out[25]:
The idea for the following git command comes from Gary Bernhardt's gitchurn. We can simplify it though, because we have Python and pandas:
In [26]:
file_changes =! git log --all -M -C --name-only --since "2014-04-01" --format='format:' | grep -v '^$'
dfc = pd.Series(list(file_changes))
dfc.value_counts()
Out[26]:
In [27]:
c_changes=dfc.where(dfc.str.endswith(".c")).value_counts()
c_changes
Out[27]:
In [28]:
c_changes.plot()
Out[28]:
As expected, a few files are changed very often and most files are changed infrequently.
What about header files?
In [29]:
h_changes=dfc.where(dfc.str.endswith(".h")).value_counts()
h_changes
Out[29]:
To be continued... ;-)
In [29]: