Dirk Loss / @dloss, v1.0, 2014-04-17
The OpenSSL project has a public git repository. Let's clone it and see if we can use it to answer some simple questions.
In [1]:
%time !git clone git://git.openssl.org/openssl.git
In [2]:
from IPython.display import IFrame
In [3]:
IFrame("http://en.wikipedia.org/wiki/OpenSSL#History_of_the_OpenSSL_project", 800, 400)
Out[3]:
So the official start of the OpenSSL project was on December 23, 1998. Now let's see what we have in our repository:
In [4]:
cd openssl/
In [5]:
!git log --reverse | head -40
In [6]:
!git log -1
So we have commits from two days earlier that the official start up to today. More than 15 years of history. Good.
In [7]:
!git log --oneline | wc -l
About twelve thousand commits.
First let's see how much space the current checkout (excluding the .git repo) takes:
In [8]:
!du -hs -I\.git
For a deeper analysis, we use David Wheeler's SLOCCount:
In [9]:
!sloccount .
So we have nearly 430kSLOC -- mostly C as expected, but roughly a quarter is Perl. And we have nearly 10000 lines of assembler code.
I'll save the commit authors and timestamps as a CSV file, that can be imported and analysed using the excellent pandas library:
In [10]:
!git log --format=format:"%ai,%an,%H" > ../commits
In [11]:
cd ..
In [12]:
import pandas as pd
In [13]:
df=pd.read_csv("commits", header=None, names=["time", "author", "id"], index_col="time", parse_dates=True)
df.sort(ascending=True, inplace=True)
df.head()
Out[13]:
Pandas provides a convenience function that shows how often each value occurs in a given column:
In [14]:
commits_per_author=df.author.value_counts()
commits_per_author
Out[14]:
So we have 10 People with more than 100 commits. Not a lot. But no suprises, either: The top 11 committers are exactly the current development team mentioned on the OpenSSL homepage.
Let's visualize the commit counts with Matplotlib. But first import seaborn, which gives us much prettier graphics:
In [15]:
import seaborn as sns
In [16]:
%matplotlib inline
In [17]:
commits_per_author.plot(kind="bar", figsize=(10,6))
Out[17]:
Dr. Stephen Henson clearly dominates.
Introduce counter:
In [18]:
df["c"]=1 # counter
commits_over_time=df.c.cumsum().plot()
commits_over_time
Out[18]:
In [19]:
authors = commits_per_author.index
timelines=pd.DataFrame(index=df.index)
for author in authors:
timelines[author]=df.c.where(df.author==author)
timelines.head()
Out[19]:
In [20]:
default_palette = sns.color_palette()
In [21]:
sns.set_palette("Set1")
top_authors=authors[:10]
timelines[top_authors].cumsum().plot(style="o",figsize=(20,10))
Out[21]:
In [22]:
sns.set_palette(default_palette)
Let's see how many authors where active together, e.g. during a 3 month period:
In [23]:
per_months=timelines.resample("3M", how="sum")
per_months["nauthors"]=per_months.applymap(lambda x: min(x, 1)).sum(axis=1)
per_months["nauthors"].plot(kind="bar", figsize=(20,5))
Out[23]:
So there have been 3 to 13 authors per quarter year.
For now we just cound the number of files:
In [24]:
cd openssl/
In [25]:
%%time
filecounts = []
for commit in df["id"]:
cfiles =! git ls-tree -r --name-only $commit
filecounts.append(len(cfiles))
In [26]:
filestats=pd.DataFrame({"filecount": filecounts}, index=df.index)
filestats.plot(figsize=(10,6))
Out[26]:
As we have seen before, at the beginning code was imported from SSLeay, so the graph starts with more than 1000 files.
The idea for the following git command comes from Gary Bernhardt's gitchurn. We can simplify it though, because we have Python and pandas:
In [27]:
file_changes =! git log --all -M -C --name-only --format='format:' | grep -v '^$'
dfc = pd.Series(list(file_changes))
dfc.value_counts()
Out[27]:
In [28]:
c_changes=dfc.where(dfc.str.endswith(".c")).value_counts()
c_changes
Out[28]:
In [29]:
c_changes.plot()
Out[29]:
As expected, a few files are changed very often and most files are changed infrequently.
What about header files?
In [30]:
h_changes=dfc.where(dfc.str.endswith(".h")).value_counts()
h_changes
Out[30]:
To be continued... ;-)