LibreSSL Repository Mining

Dirk Loss / @dloss, v1.0, 2014-07-11

The LibreSSL project has a git mirror of their CVS repository. Let's clone it and see if we can use it to answer some simple questions.


In [1]:
%time !git clone https://github.com/libressl-portable/openbsd.git


Cloning into 'openbsd'...
remote: Counting objects: 46494, done.
remote: Compressing objects: 100% (8988/8988), done.
remote: Total 46494 (delta 33293), reused 46490 (delta 33289)
Receiving objects: 100% (46494/46494), 21.28 MiB | 759.00 KiB/s, done.
Resolving deltas: 100% (33293/33293), done.
Checking connectivity... done.
CPU times: user 259 ms, sys: 80 ms, total: 339 ms
Wall time: 34.3 s

In [2]:
cd openbsd/


/Users/dirk/projekte/repo-libressl/openbsd

In [3]:
!git log --reverse | head -10


commit dcac718930cb87c958bb05ea34b0ef6284f5e10b
Author: deraadt <>
Date:   Wed Oct 18 08:42:23 1995 +0000

    initial import of NetBSD tree

commit bf056200690ad2990feba19909f02f57687666fb
Author: deraadt <>
Date:   Wed Oct 18 08:49:34 1995 +0000


In [4]:
!git log -1


commit ea6dfc7cc887f405dc17af99e705e3082ea177e9
Author: beck <>
Date:   Fri Jul 11 17:18:11 2014 +0000

    formatting
    ok bcook@

So we have commits from 1995 to today.


In [5]:
!git log --oneline | wc -l


    3002

How large is the code base?

First let's see how much space the current checkout (excluding the .git repo) takes:


In [6]:
!du -hs -I\.git


 19M	.

For a deeper analysis, we use CLOC. Compared with SLOCCount it has nicer output, and is well maintained.


In [7]:
!cloc .


    1933 text files.
    1865 unique files.                              
     595 files ignored.

http://cloc.sourceforge.net v 1.60  T=15.03 s (88.2 files/s, 28688.5 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                              939          28728          63620         200170
Perl                           116           7275           6966          60669
C/C++ Header                   138           4781          11844          26481
Assembly                        13           1223           1350           9027
Bourne Shell                    25            481            406           2901
make                            90            376            217           2331
m4                               1            514              0           1585
C++                              3             24             48             55
-------------------------------------------------------------------------------
SUM:                          1325          43402          84451         303219
-------------------------------------------------------------------------------

Who contributed?

I'll save the commit authors and timestamps as a CSV file, that can be imported and analysed using the excellent pandas library:


In [8]:
!git log --format=format:"%ai,%an,%H" > ../commits

In [9]:
cd ..


/Users/dirk/projekte/repo-libressl

In [10]:
import pandas as pd

In [11]:
df=pd.read_csv("commits", header=None, names=["time", "author", "id"], index_col="time", parse_dates=True)
df.sort(ascending=True, inplace=True)
df.head()


Out[11]:
author id
time
1995-10-18 08:42:23 deraadt dcac718930cb87c958bb05ea34b0ef6284f5e10b
1995-10-18 08:49:34 deraadt bf056200690ad2990feba19909f02f57687666fb
1995-11-01 16:43:27 deraadt cb274af2d57ba262232e24a6b348a75e4ddf39c3
1995-12-14 02:16:48 deraadt 8eb715a636f0c8677814890f93cd9fbd1dc2d0cb
1995-12-15 01:46:48 deraadt 0e76731307f064af3c9f5201066a17f720f5935d

We are only interested in the commits since the OpenSSL valhalla rampage started. That was in April 2014:


In [12]:
df = df["2014-04-01":]

Pandas provides a convenience function that shows how often each value occurs in a given column:


In [13]:
commits_per_author=df.author.value_counts()
commits_per_author


Out[13]:
jsing       384
miod        265
tedu        201
deraadt     141
beck         80
guenther     33
bcook        21
matthew      18
jsg          17
reyk         16
logan        14
sthen        11
jim           8
otto          8
mpi           6
lteo          6
jmc           5
afresh1       4
giovanni      4
mcbride       4
tobiasu       2
schwarze      2
chl           2
jca           2
djm           2
kettenis      2
espie         1
millert       1
avsm          1
naddy         1
halex         1
dtype: int64

Let's visualize the commit counts with Matplotlib. But first import seaborn, which gives us much prettier graphics:


In [14]:
import seaborn as sns

In [15]:
%matplotlib inline

In [16]:
commits_per_author.plot(kind="bar", figsize=(10,6))


Out[16]:
<matplotlib.axes.AxesSubplot at 0x10962ec10>

Has development speed increased or slowed down over time?

Introduce counter:


In [17]:
df["c"]=1   # counter
commits_over_time=df.c.cumsum().plot()
commits_over_time


Out[17]:
<matplotlib.axes.AxesSubplot at 0x10979ff10>

Has the number of authors increased or decreased over time?


In [18]:
authors = commits_per_author.index
timelines=pd.DataFrame(index=df.index)
for author in authors:
    timelines[author]=df.c.where(df.author==author)

In [19]:
default_palette = sns.color_palette()

In [20]:
top = 10
sns.set_palette("Set1", top)
top_authors=authors[:top]
timelines[top_authors].cumsum().plot(style="o",figsize=(20,10), title="Commit activity of the Top%s authors to LibreSSL" % top)


Out[20]:
<matplotlib.axes.AxesSubplot at 0x1098972d0>

In [21]:
sns.set_palette(default_palette)

Let's see how many authors where active together, e.g. during a 3 month period:


In [22]:
per_months=timelines.resample("1D", how="sum")
per_months["nauthors"]=per_months.applymap(lambda x: min(x, 1)).sum(axis=1)
per_months["nauthors"].plot(kind="bar", figsize=(20,5))


Out[22]:
<matplotlib.axes.AxesSubplot at 0x10a248350>

Seems like the valhalla rampage started on 2014-04-13.

How much has the code base increased over time?


In [23]:
cd openbsd/


/Users/dirk/projekte/repo-libressl/openbsd

For now we just cound the number of files:


In [24]:
%%time 
filecounts = []
for commit in df["id"]:
    cfiles =! git ls-tree -r --name-only $commit
    filecounts.append(len(cfiles))


CPU times: user 2.01 s, sys: 4.16 s, total: 6.17 s
Wall time: 23.6 s

In [25]:
filestats=pd.DataFrame({"filecount": filecounts}, index=df.index)
filestats.plot(figsize=(10,6))


Out[25]:
<matplotlib.axes.AxesSubplot at 0x10b1e4510>

Which files have been changed most often?

The idea for the following git command comes from Gary Bernhardt's gitchurn. We can simplify it though, because we have Python and pandas:


In [26]:
file_changes =! git log --all -M -C --name-only --since "2014-04-01" --format='format:' | grep -v '^$'
dfc = pd.Series(list(file_changes))
dfc.value_counts()


Out[26]:
src/lib/libssl/src/ssl/s3_clnt.c          52
src/lib/libssl/src/ssl/ssl_lib.c          51
src/lib/libssl/src/ssl/t1_enc.c           50
src/lib/libssl/src/ssl/s3_lib.c           50
src/lib/libssl/src/apps/apps.c            46
src/lib/libssl/src/ssl/s3_srvr.c          46
src/lib/libssl/src/apps/s_server.c        44
src/lib/libcrypto/crypto/Makefile         44
src/lib/libssl/src/ssl/ssl_ciph.c         43
src/lib/libssl/src/ssl/ssl.h              43
src/lib/libssl/src/apps/s_client.c        43
src/lib/libssl/src/ssl/ssl_locl.h         40
src/lib/libssl/src/crypto/bio/b_sock.c    40
src/lib/libssl/src/ssl/t1_lib.c           38
src/lib/libssl/src/apps/ca.c              38
...
src/lib/libssl/src/crypto/des/des_enc.c       1
src/lib/libssl/src/demos/x509/mkcert.c        1
src/lib/libssl/src/times/sparc2               1
src/lib/libssl/src/crypto/threads/README      1
src/lib/libssl/src/crypto/ripemd/rmd_one.c    1
src/lib/libssl/src/demos/tunala/ip.c          1
src/lib/libssl/src/crypto/bn/asm/mips3.s      1
src/lib/libssl/src/times/486-66.dos           1
src/lib/libssl/src/ms/bcb4.bat                1
src/lib/libssl/src/demos/prime/Makefile       1
src/lib/libssl/src/bugs/MS                    1
src/lib/libssl/src/apps/tsget                 1
src/lib/libssl/src/crypto/des/rpw.c           1
src/lib/libssl/src/crypto/des/DES.xs          1
src/lib/libc/string/bzero.c                   1
Length: 2216, dtype: int64

In [27]:
c_changes=dfc.where(dfc.str.endswith(".c")).value_counts()
c_changes


Out[27]:
src/lib/libssl/src/ssl/s3_clnt.c          52
src/lib/libssl/src/ssl/ssl_lib.c          51
src/lib/libssl/src/ssl/s3_lib.c           50
src/lib/libssl/src/ssl/t1_enc.c           50
src/lib/libssl/src/ssl/s3_srvr.c          46
src/lib/libssl/src/apps/apps.c            46
src/lib/libssl/src/apps/s_server.c        44
src/lib/libssl/src/apps/s_client.c        43
src/lib/libssl/src/ssl/ssl_ciph.c         43
src/lib/libssl/src/crypto/bio/b_sock.c    40
src/lib/libssl/src/ssl/t1_lib.c           38
src/lib/libssl/src/apps/ca.c              38
src/lib/libssl/src/ssl/s3_enc.c           35
src/lib/libssl/src/ssl/s3_pkt.c           30
src/lib/libssl/src/apps/req.c             30
...
src/lib/libssl/src/demos/smime/smenc.c                   1
src/lib/libssl/src/demos/asn1/ocsp.c                     1
src/lib/libssl/src/crypto/bn/exp.c                       1
src/lib/libc/string/strnlen.c                            1
src/lib/libssl/src/apps/winrand.c                        1
src/lib/libssl/src/demos/cms/cms_ver.c                   1
src/lib/libssl/src/crypto/dh/p192.c                      1
src/lib/libssl/src/demos/engines/ibmca/hw_ibmca_err.c    1
src/lib/libssl/src/crypto/ecdsa/ecs_asn1.c               1
src/regress/lib/libcrypto/hmac/hmactest.c                1
src/lib/libssl/src/crypto/idea/i_ofb64.c                 1
src/lib/libssl/src/engines/ccgost/gost_ctl.c             1
src/lib/libssl/src/demos/sign/sign.c                     1
src/lib/libssl/src/engines/e_atalla.c                    1
src/lib/libssl/src/demos/tunala/sm.c                     1
Length: 982, dtype: int64

In [28]:
c_changes.plot()


Out[28]:
<matplotlib.axes.AxesSubplot at 0x1072e5850>

As expected, a few files are changed very often and most files are changed infrequently.

What about header files?


In [29]:
h_changes=dfc.where(dfc.str.endswith(".h")).value_counts()
h_changes


Out[29]:
src/lib/libssl/src/ssl/ssl.h                 43
src/lib/libssl/src/ssl/ssl_locl.h            40
src/lib/libssl/src/apps/apps.h               21
src/lib/libssl/src/crypto/crypto.h           19
src/lib/libssl/src/crypto/engine/engine.h    17
src/lib/libssl/src/crypto/evp/evp.h          16
src/lib/libssl/src/ssl/ssl3.h                15
src/lib/libssl/src/crypto/asn1/asn1.h        14
src/lib/libssl/src/crypto/cryptlib.h         14
src/lib/libssl/src/crypto/bio/bio.h          11
src/lib/libssl/src/ssl/dtls1.h               11
src/lib/libssl/src/ssl/tls1.h                11
src/lib/libssl/src/crypto/bn/bn_lcl.h        10
src/lib/libssl/src/e_os.h                    10
src/lib/libssl/src/crypto/bn/bn.h            10
...
src/lib/libssl/src/engines/vendor_defns/aep.h                   1
src/lib/libssl/src/engines/e_aep_err.h                          1
src/lib/libssl/src/engines/e_sureware_err.h                     1
src/lib/libssl/src/demos/engines/zencod/hw_zencod.h             1
src/lib/libssl/src/demos/engines/zencod/hw_zencod_err.h         1
src/lib/libssl/src/crypto/ebcdic.h                              1
src/lib/libssl/src/demos/engines/cluster_labs/cluster_labs.h    1
src/lib/libssl/src/crypto/dsa/dsa_locl.h                        1
src/lib/libssl/src/engines/vendor_defns/atalla.h                1
src/lib/libssl/src/MacOS/_MWERKS_GUSI_prefix.h                  1
src/lib/libssl/src/crypto/modes/modes.h                         1
src/lib/libssl/src/crypto/rand/rand_lcl.h                       1
src/lib/libssl/src/engines/vendor_defns/cswift.h                1
src/lib/libssl/src/engines/vendor_defns/hw_ubsec.h              1
src/lib/libssl/src/engines/ccgost/gost2001_keyx.h               1
Length: 203, dtype: int64

To be continued... ;-)


In [29]: