OpenSSL Repository Mining

Dirk Loss / @dloss, v1.0, 2014-04-17

The OpenSSL project has a public git repository. Let's clone it and see if we can use it to answer some simple questions.


In [1]:
%time !git clone git://git.openssl.org/openssl.git


Cloning into 'openssl'...
remote: Counting objects: 138452, done.
remote: Compressing objects: 100% (29938/29938), done.
remote: Total 138452 (delta 110293), reused 135887 (delta 108259)
Receiving objects: 100% (138452/138452), 37.76 MiB | 481.00 KiB/s, done.
Resolving deltas: 100% (110293/110293), done.
Checking connectivity... done.
CPU times: user 513 ms, sys: 137 ms, total: 650 ms
Wall time: 1min 6s

Which part of OpenSSL's history is publicly available?


In [2]:
from IPython.display import IFrame

In [3]:
IFrame("http://en.wikipedia.org/wiki/OpenSSL#History_of_the_OpenSSL_project", 800, 400)


Out[3]:

So the official start of the OpenSSL project was on December 23, 1998. Now let's see what we have in our repository:


In [4]:
cd openssl/


/Users/dirk/projekte/openssl-git/openssl

In [5]:
!git log --reverse | head -40


commit 90718ac5274e07cd7b1933f068e9546d12e621f5
Author: Ralf S. Engelschall <rse@openssl.org>
Date:   Mon Dec 21 10:52:45 1998 +0000

    This commit was generated by cvs2svn to track changes on a CVS vendor
    branch.

commit ec96f926b98721d6b84c7023fde0ecc5fe98e644
Author: Ralf S. Engelschall <rse@openssl.org>
Date:   Mon Dec 21 10:52:45 1998 +0000

    Import of old SSLeay release: SSLeay 0.8.1b

commit b7896b3cb86d80206af14a14d69b0717786f2729
Merge: 90718ac d02b48c
Author: Ralf S. Engelschall <rse@openssl.org>
Date:   Mon Dec 21 10:52:47 1998 +0000

    This commit was generated by cvs2svn to track changes on a CVS vendor
    branch.

commit d02b48c63a58ea4367a0e905979f140b7d090f86
Author: Ralf S. Engelschall <rse@openssl.org>
Date:   Mon Dec 21 10:52:47 1998 +0000

    Import of old SSLeay release: SSLeay 0.8.1b

commit eda1f21f1af8b6f77327e7b37573af9c1ba73726
Merge: b7896b3 c7e9169
Author: Ralf S. Engelschall <rse@openssl.org>
Date:   Mon Dec 21 10:56:30 1998 +0000

    This commit was generated by cvs2svn to track changes on a CVS vendor
    branch.

commit c7e91699977f0dcf5025c00670d9dde0c2296641
Author: Ralf S. Engelschall <rse@openssl.org>
Date:   Mon Dec 21 10:56:30 1998 +0000

    Import of old SSLeay release: SSLeay 0.9.0b

In [6]:
!git log -1


commit 300b9f0b704048f60776881f1d378c74d9c32fbd
Author: Dr. Stephen Henson <steve@openssl.org>
Date:   Tue Apr 15 18:48:54 2014 +0100

    Extension checking fixes.
    
    When looking for an extension we need to set the last found
    position to -1 to properly search all extensions.
    
    PR#3309.

So we have commits from two days earlier that the official start up to today. More than 15 years of history. Good.


In [7]:
!git log --oneline | wc -l


   11856

About twelve thousand commits.

How large is the code base?

First let's see how much space the current checkout (excluding the .git repo) takes:


In [8]:
!du -hs -I\.git


 28M	.

For a deeper analysis, we use David Wheeler's SLOCCount:


In [9]:
!sloccount .


Have a non-directory at the top, so creating directory top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./ACKNOWLEDGMENTS to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./CHANGES to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./CHANGES.SSLeay to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./Configure to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./FAQ to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./GitConfigure to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./GitMake to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.DJGPP to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.MacOS to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.NW to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.OS2 to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.VMS to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.W32 to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.W64 to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./INSTALL.WCE to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./LICENSE to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./Makefile.fips to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./Makefile.org to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./Makefile.shared to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./NEWS to top_dir
Creating filelist for Netware
Adding /Users/dirk/projekte/openssl-git/openssl/./PROBLEMS to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./README to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./README.ASN1 to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./README.ECC to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./README.ENGINE to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./README.FIPS to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./TABLE to top_dir
Creating filelist for VMS
Creating filelist for apps
Creating filelist for bugs
Creating filelist for certs
Adding /Users/dirk/projekte/openssl-git/openssl/./config to top_dir
Creating filelist for crypto
Creating filelist for demos
Creating filelist for doc
Adding /Users/dirk/projekte/openssl-git/openssl/./e_os.h to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./e_os2.h to top_dir
Creating filelist for engines
Creating filelist for fips
Creating filelist for include
Adding /Users/dirk/projekte/openssl-git/openssl/./install.com to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./makevms.com to top_dir
Creating filelist for ms
Adding /Users/dirk/projekte/openssl-git/openssl/./openssl.doxy to top_dir
Adding /Users/dirk/projekte/openssl-git/openssl/./openssl.spec to top_dir
Creating filelist for os2
Creating filelist for perl
Creating filelist for shlib
Creating filelist for ssl
Creating filelist for test
Creating filelist for times
Creating filelist for tools
Creating filelist for util
Categorizing files.
Finding a working MD5 command....
Found a working MD5 command.
Computing results.
pod without closing cut in file /Users/dirk/projekte/openssl-git/openssl/crypto/sha/asm/sha256-c64xplus.pl


SLOC	Directory	SLOC-by-Language (Sorted)
283856  crypto          ansic=184902,perl=88876,asm=9463,cpp=605,sh=10
46606   ssl             ansic=46606
36042   apps            ansic=35535,perl=355,sh=152
20548   fips            ansic=18413,perl=2017,sh=118
17520   engines         ansic=16476,perl=1044
10418   demos           ansic=9638,sh=550,cpp=218,perl=12
7769    util            perl=7207,sh=562
3554    test            perl=1562,sh=1304,ansic=688
1471    top_dir         sh=764,ansic=707
543     ms              ansic=320,perl=223
446     Netware         perl=446
260     shlib           sh=260
241     times           cpp=225,perl=16
177     tools           perl=146,sh=31
166     bugs            ansic=166
31      VMS             perl=31
27      os2             perl=27
24      doc             lisp=24
0       certs           (none)
0       include         (none)
0       perl            (none)


Totals grouped by language (dominant language first):
ansic:       313451 (72.95%)
perl:        101962 (23.73%)
asm:           9463 (2.20%)
sh:            3751 (0.87%)
cpp:           1048 (0.24%)
lisp:            24 (0.01%)




Total Physical Source Lines of Code (SLOC)                = 429,699
Development Effort Estimate, Person-Years (Person-Months) = 116.37 (1,396.48)
 (Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))
Schedule Estimate, Years (Months)                         = 3.26 (39.18)
 (Basic COCOMO model, Months = 2.5 * (person-months**0.38))
Estimated Average Number of Developers (Effort/Schedule)  = 35.64
Total Estimated Cost to Develop                           = $ 15,720,421
 (average salary = $56,286/year, overhead = 2.40).
SLOCCount, Copyright (C) 2001-2004 David A. Wheeler
SLOCCount is Open Source Software/Free Software, licensed under the GNU GPL.
SLOCCount comes with ABSOLUTELY NO WARRANTY, and you are welcome to
redistribute it under certain conditions as specified by the GNU GPL license;
see the documentation for details.
Please credit this data as "generated using David A. Wheeler's 'SLOCCount'."

So we have nearly 430kSLOC -- mostly C as expected, but roughly a quarter is Perl. And we have nearly 10000 lines of assembler code.

Who contributed to OpenSSL over the years?

I'll save the commit authors and timestamps as a CSV file, that can be imported and analysed using the excellent pandas library:


In [10]:
!git log --format=format:"%ai,%an,%H" > ../commits

In [11]:
cd ..


/Users/dirk/projekte/openssl-git

In [12]:
import pandas as pd

In [13]:
df=pd.read_csv("commits", header=None, names=["time", "author", "id"], index_col="time", parse_dates=True)
df.sort(ascending=True, inplace=True)
df.head()


Out[13]:
author id
time
1998-12-21 10:52:45 Ralf S. Engelschall 90718ac5274e07cd7b1933f068e9546d12e621f5
1998-12-21 10:52:45 Ralf S. Engelschall ec96f926b98721d6b84c7023fde0ecc5fe98e644
1998-12-21 10:52:47 Ralf S. Engelschall b7896b3cb86d80206af14a14d69b0717786f2729
1998-12-21 10:52:47 Ralf S. Engelschall d02b48c63a58ea4367a0e905979f140b7d090f86
1998-12-21 10:56:30 Ralf S. Engelschall eda1f21f1af8b6f77327e7b37573af9c1ba73726

5 rows × 2 columns

Pandas provides a convenience function that shows how often each value occurs in a given column:


In [14]:
commits_per_author=df.author.value_counts()
commits_per_author


Out[14]:
Dr. Stephen Henson            3558
Richard Levitte               2331
Andy Polyakov                 1800
Bodo Möller                   1699
Ulf Möller                     661
Ben Laurie                     590
Geoff Thorpe                   408
Lutz Jänicke                   300
Nils Larsch                    197
Ralf S. Engelschall            189
Mark J. Cox                     18
Paul C. Sutton                  11
Adam Langley                    11
Daniel Kahn Gillmor             10
Rob Stradling                   10
Scott Deboy                      7
stephen                          5
Trevor Perrin                    5
Carlos Alberto Lopez Perez       4
Bodo Moeller                     4
Trevor                           3
Piotr Sikora                     3
Michael Tuexen                   3
Lubomir Rintel                   2
Robin Seggelmann                 2
Kurt Roeckx                      2
Matt Caswell                     2
Jeff Trawick                     2
Scott Schaefer                   2
Kaspar Brand                     2
Nick Mathewson                   2
Krzysztof Kwiatkowski            1
Emilia Kasper                    1
Lutz Jaenicke                    1
Steve Marquess                   1
Eric Young                       1
Ard Biesheuvel                   1
David Woodhouse                  1
Veres Lajos                      1
Klaus-Peter Junghanns            1
Mat                              1
Tim Hudson                       1
Jeff Walton                      1
Nick Alcock                      1
dtype: int64

So we have 10 People with more than 100 commits. Not a lot. But no suprises, either: The top 11 committers are exactly the current development team mentioned on the OpenSSL homepage.

Let's visualize the commit counts with Matplotlib. But first import seaborn, which gives us much prettier graphics:


In [15]:
import seaborn as sns

In [16]:
%matplotlib inline

In [17]:
commits_per_author.plot(kind="bar", figsize=(10,6))


Out[17]:
<matplotlib.axes.AxesSubplot at 0x109a5e7d0>

Dr. Stephen Henson clearly dominates.

Has development speed increased or slowed down over time?

Introduce counter:


In [18]:
df["c"]=1   # counter
commits_over_time=df.c.cumsum().plot()
commits_over_time


Out[18]:
<matplotlib.axes.AxesSubplot at 0x10999ded0>

Has the number of authors increased or decreased over time?


In [19]:
authors = commits_per_author.index
timelines=pd.DataFrame(index=df.index)
for author in authors:
    timelines[author]=df.c.where(df.author==author)
timelines.head()


Out[19]:
Dr. Stephen Henson Richard Levitte Andy Polyakov Bodo Möller Ulf Möller Ben Laurie Geoff Thorpe Lutz Jänicke Nils Larsch Ralf S. Engelschall Mark J. Cox Paul C. Sutton Adam Langley Daniel Kahn Gillmor Rob Stradling Scott Deboy stephen Trevor Perrin Carlos Alberto Lopez Perez Bodo Moeller
time
1998-12-21 10:52:45 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
1998-12-21 10:52:45 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
1998-12-21 10:52:47 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
1998-12-21 10:52:47 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
1998-12-21 10:56:30 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...

5 rows × 44 columns


In [20]:
default_palette = sns.color_palette()

In [21]:
sns.set_palette("Set1")
top_authors=authors[:10]
timelines[top_authors].cumsum().plot(style="o",figsize=(20,10))


Out[21]:
<matplotlib.axes.AxesSubplot at 0x109f9dd10>

In [22]:
sns.set_palette(default_palette)

Let's see how many authors where active together, e.g. during a 3 month period:


In [23]:
per_months=timelines.resample("3M", how="sum")
per_months["nauthors"]=per_months.applymap(lambda x: min(x, 1)).sum(axis=1)
per_months["nauthors"].plot(kind="bar", figsize=(20,5))


Out[23]:
<matplotlib.axes.AxesSubplot at 0x10a16d550>

So there have been 3 to 13 authors per quarter year.

How much has the code base increased over time?

For now we just cound the number of files:


In [24]:
cd openssl/


/Users/dirk/projekte/openssl-git/openssl

In [25]:
%%time 
filecounts = []
for commit in df["id"]:
    cfiles =! git ls-tree -r --name-only $commit
    filecounts.append(len(cfiles))


CPU times: user 16.7 s, sys: 36.6 s, total: 53.3 s
Wall time: 4min 2s

In [26]:
filestats=pd.DataFrame({"filecount": filecounts}, index=df.index)
filestats.plot(figsize=(10,6))


Out[26]:
<matplotlib.axes.AxesSubplot at 0x109d19f10>

As we have seen before, at the beginning code was imported from SSLeay, so the graph starts with more than 1000 files.

Which files have been changed most often?

The idea for the following git command comes from Gary Bernhardt's gitchurn. We can simplify it though, because we have Python and pandas:


In [27]:
file_changes =! git log --all -M -C --name-only --format='format:' | grep -v '^$'
dfc = pd.Series(list(file_changes))
dfc.value_counts()


Out[27]:
CHANGES            2993
Configure          1440
Makefile.org        713
ssl/ssl.h           665
TABLE               628
util/libeay.num     567
ssl/s3_srvr.c       539
STATUS              513
ssl/ssl_lib.c       484
apps/s_server.c     439
ssl/s3_clnt.c       435
FAQ                 407
ssl/t1_lib.c        387
config              384
ssl/s3_lib.c        376
...
fips/sha/fips_sha.h                             1
VMS/compaq/cpq-axpvms-ssl-t0100--1.pcsi$text    1
fips/testvectors/des3/sample/TCBCinvperm.sam    1
fips/testvectors/des3/sample/TOFBvarkey.sam     1
fips/testvectors/des2/req/TCFB1permop.req       1
fips/testvectors/dsa/req/SigVer.req             1
cpq-axpvms-ssl-t0100--1.pcsi$desc               1
fips/testvectors/des3/req/TCFB8Monte2.req       1
doc/crypto/d2i_ASN1_OBJECT.pod                  1
fips/testvectors/des2/sample/TCFB64MMT2.sam     1
fips/testvectors/des2/req/TECBinvperm.req       1
demos/vms_examples/ssl$simple_serv.c            1
fips-1.0/fipsalgtest.pl                         1
fips/testvectors/des3/sample/TCBCvarkey.sam     1
fips/testvectors/des3/req/TCBCMonte2.req        1
Length: 4236, dtype: int64

In [28]:
c_changes=dfc.where(dfc.str.endswith(".c")).value_counts()
c_changes


Out[28]:
ssl/s3_srvr.c             539
ssl/ssl_lib.c             484
apps/s_server.c           439
ssl/s3_clnt.c             435
ssl/t1_lib.c              387
ssl/s3_lib.c              376
apps/s_client.c           375
apps/apps.c               321
apps/ca.c                 296
apps/speed.c              286
ssl/ssltest.c             269
crypto/x509/x509_vfy.c    248
ssl/ssl_err.c             248
ssl/s3_pkt.c              237
ssl/ssl_ciph.c            236
...
fips/ecdsa/fips_ecdsa_lib.c                         1
demos/vms_examples/ssl$serv_sess_reuse.c            1
demos/err/main.c                                    1
crypto/evp/evp_aead.c                               1
fips/sha/fips_sha1dgst.c                            1
demos/vms_examples/ssl$serv_verify_client.c         1
demos/vms_examples/ssl$serv_sess_reuse_cli_ver.c    1
crypto/poly1305/poly1305_arm.c                      1
engines/ccgost/md_gost.c                            1
demos/vms_examples/ssl$cli_sess_renego.c            1
crypto/poly1305/poly1305.c                          1
engines/ccgost/pmeth.c                              1
demos/vms_examples/ssl$simple_cli.c                 1
crypto/ts/ts_resp_sign.c                            1
crypto/poly1305/poly1305test.c                      1
Length: 1288, dtype: int64

In [29]:
c_changes.plot()


Out[29]:
<matplotlib.axes.AxesSubplot at 0x10b498390>

As expected, a few files are changed very often and most files are changed infrequently.

What about header files?


In [30]:
h_changes=dfc.where(dfc.str.endswith(".h")).value_counts()
h_changes


Out[30]:
ssl/ssl.h                   665
crypto/evp/evp.h            366
ssl/ssl_locl.h              343
crypto/opensslv.h           329
crypto/asn1/asn1.h          280
crypto/x509/x509.h          260
crypto/objects/obj_dat.h    234
crypto/bn/bn.h              225
e_os.h                      214
crypto/objects/obj_mac.h    194
crypto/rsa/rsa.h            190
crypto/crypto.h             190
apps/apps.h                 189
crypto/x509v3/x509v3.h      182
ssl/ssl3.h                  176
...
engines/ccgost/keywrap.h                  1
apps/term_sock.h                          1
fips/sha/fips_md32_common.h               1
crypto/poly1305/poly1305.h                1
engines/vendor_defns/cswift.h             1
fips/sha/fips_sha_locl.h                  1
engines/ccgost/paramset.h                 1
fips/sha/fips_sha.h                       1
crypto/o_dir.h                            1
crypto/engine/vendor_defns/keyclient.h    1
demos/err/test_err.h                      1
crypto/chacha/chacha.h                    1
engines/vendor_defns/hw_4758_cca.h        1
engines/vendor_defns/hw_ubsec.h           1
engines/vendor_defns/atalla.h             1
Length: 270, dtype: int64

To be continued... ;-)