LibreSSL Repository Mining

Dirk Loss / @dloss, v1.0, 2014-07-11

The LibreSSL project has a git mirror of their CVS repository. Let's clone it and see if we can use it to answer some simple questions.



In [1]:

    
%time !git clone https://github.com/libressl-portable/openbsd.git









    



Cloning into 'openbsd'...
remote: Counting objects: 46494, done.
remote: Compressing objects: 100% (8988/8988), done.
remote: Total 46494 (delta 33293), reused 46490 (delta 33289)
Receiving objects: 100% (46494/46494), 21.28 MiB | 759.00 KiB/s, done.
Resolving deltas: 100% (33293/33293), done.
Checking connectivity... done.
CPU times: user 259 ms, sys: 80 ms, total: 339 ms
Wall time: 34.3 s



In [2]:

    
cd openbsd/









    



/Users/dirk/projekte/repo-libressl/openbsd



In [3]:

    
!git log --reverse | head -10









    



commit dcac718930cb87c958bb05ea34b0ef6284f5e10b
Author: deraadt <>
Date:   Wed Oct 18 08:42:23 1995 +0000

    initial import of NetBSD tree

commit bf056200690ad2990feba19909f02f57687666fb
Author: deraadt <>
Date:   Wed Oct 18 08:49:34 1995 +0000



In [4]:

    
!git log -1









    



commit ea6dfc7cc887f405dc17af99e705e3082ea177e9
Author: beck <>
Date:   Fri Jul 11 17:18:11 2014 +0000

    formatting
    ok bcook@

So we have commits from 1995 to today.



In [5]:

    
!git log --oneline | wc -l

How large is the code base?

First let's see how much space the current checkout (excluding the .git repo) takes:



In [6]:

    
!du -hs -I\.git

For a deeper analysis, we use CLOC. Compared with SLOCCount it has nicer output, and is well maintained.



In [7]:

    
!cloc .









    



    1933 text files.
    1865 unique files.                              
     595 files ignored.

http://cloc.sourceforge.net v 1.60  T=15.03 s (88.2 files/s, 28688.5 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
C                              939          28728          63620         200170
Perl                           116           7275           6966          60669
C/C++ Header                   138           4781          11844          26481
Assembly                        13           1223           1350           9027
Bourne Shell                    25            481            406           2901
make                            90            376            217           2331
m4                               1            514              0           1585
C++                              3             24             48             55
-------------------------------------------------------------------------------
SUM:                          1325          43402          84451         303219
-------------------------------------------------------------------------------

Who contributed?

I'll save the commit authors and timestamps as a CSV file, that can be imported and analysed using the excellent pandas library:



In [8]:

    
!git log --format=format:"%ai,%an,%H" > ../commits



In [9]:

    
cd ..









    



/Users/dirk/projekte/repo-libressl



In [10]:

    
import pandas as pd



In [11]:

    
df=pd.read_csv("commits", header=None, names=["time", "author", "id"], index_col="time", parse_dates=True)
df.sort(ascending=True, inplace=True)
df.head()









    Out[11]:






  
    
      
      author
      id
    
    
      time
      
      
    
  
  
    
      1995-10-18 08:42:23
       deraadt
       dcac718930cb87c958bb05ea34b0ef6284f5e10b
    
    
      1995-10-18 08:49:34
       deraadt
       bf056200690ad2990feba19909f02f57687666fb
    
    
      1995-11-01 16:43:27
       deraadt
       cb274af2d57ba262232e24a6b348a75e4ddf39c3
    
    
      1995-12-14 02:16:48
       deraadt
       8eb715a636f0c8677814890f93cd9fbd1dc2d0cb
    
    
      1995-12-15 01:46:48
       deraadt
       0e76731307f064af3c9f5201066a17f720f5935d

We are only interested in the commits since the OpenSSL valhalla rampage started. That was in April 2014:



In [12]:

    
df = df["2014-04-01":]

Pandas provides a convenience function that shows how often each value occurs in a given column:



In [13]:

    
commits_per_author=df.author.value_counts()
commits_per_author









    Out[13]:





jsing       384
miod        265
tedu        201
deraadt     141
beck         80
guenther     33
bcook        21
matthew      18
jsg          17
reyk         16
logan        14
sthen        11
jim           8
otto          8
mpi           6
lteo          6
jmc           5
afresh1       4
giovanni      4
mcbride       4
tobiasu       2
schwarze      2
chl           2
jca           2
djm           2
kettenis      2
espie         1
millert       1
avsm          1
naddy         1
halex         1
dtype: int64

Let's visualize the commit counts with Matplotlib. But first import seaborn, which gives us much prettier graphics:



In [14]:

    
import seaborn as sns



In [15]:

    
%matplotlib inline



In [16]:

    
commits_per_author.plot(kind="bar", figsize=(10,6))









    Out[16]:





<matplotlib.axes.AxesSubplot at 0x10962ec10>

Has development speed increased or slowed down over time?

Introduce counter:



In [17]:

    
df["c"]=1   # counter
commits_over_time=df.c.cumsum().plot()
commits_over_time









    Out[17]:





<matplotlib.axes.AxesSubplot at 0x10979ff10>

Has the number of authors increased or decreased over time?



In [18]:

    
authors = commits_per_author.index
timelines=pd.DataFrame(index=df.index)
for author in authors:
    timelines[author]=df.c.where(df.author==author)



In [19]:

    
default_palette = sns.color_palette()



In [20]:

    
top = 10
sns.set_palette("Set1", top)
top_authors=authors[:top]
timelines[top_authors].cumsum().plot(style="o",figsize=(20,10), title="Commit activity of the Top%s authors to LibreSSL" % top)









    Out[20]:





<matplotlib.axes.AxesSubplot at 0x1098972d0>



In [21]:

    
sns.set_palette(default_palette)

Let's see how many authors where active together, e.g. during a 3 month period:



In [22]:

    
per_months=timelines.resample("1D", how="sum")
per_months["nauthors"]=per_months.applymap(lambda x: min(x, 1)).sum(axis=1)
per_months["nauthors"].plot(kind="bar", figsize=(20,5))









    Out[22]:





<matplotlib.axes.AxesSubplot at 0x10a248350>

Seems like the valhalla rampage started on 2014-04-13.

How much has the code base increased over time?



In [23]:

    
cd openbsd/









    



/Users/dirk/projekte/repo-libressl/openbsd

For now we just cound the number of files:



In [24]:

    
%%time 
filecounts = []
for commit in df["id"]:
    cfiles =! git ls-tree -r --name-only $commit
    filecounts.append(len(cfiles))









    



CPU times: user 2.01 s, sys: 4.16 s, total: 6.17 s
Wall time: 23.6 s



In [25]:

    
filestats=pd.DataFrame({"filecount": filecounts}, index=df.index)
filestats.plot(figsize=(10,6))









    Out[25]:





<matplotlib.axes.AxesSubplot at 0x10b1e4510>

Which files have been changed most often?

The idea for the following git command comes from Gary Bernhardt's gitchurn. We can simplify it though, because we have Python and pandas:



In [26]:

    
file_changes =! git log --all -M -C --name-only --since "2014-04-01" --format='format:' | grep -v '^$'
dfc = pd.Series(list(file_changes))
dfc.value_counts()









    Out[26]:





src/lib/libssl/src/ssl/s3_clnt.c          52
src/lib/libssl/src/ssl/ssl_lib.c          51
src/lib/libssl/src/ssl/t1_enc.c           50
src/lib/libssl/src/ssl/s3_lib.c           50
src/lib/libssl/src/apps/apps.c            46
src/lib/libssl/src/ssl/s3_srvr.c          46
src/lib/libssl/src/apps/s_server.c        44
src/lib/libcrypto/crypto/Makefile         44
src/lib/libssl/src/ssl/ssl_ciph.c         43
src/lib/libssl/src/ssl/ssl.h              43
src/lib/libssl/src/apps/s_client.c        43
src/lib/libssl/src/ssl/ssl_locl.h         40
src/lib/libssl/src/crypto/bio/b_sock.c    40
src/lib/libssl/src/ssl/t1_lib.c           38
src/lib/libssl/src/apps/ca.c              38
...
src/lib/libssl/src/crypto/des/des_enc.c       1
src/lib/libssl/src/demos/x509/mkcert.c        1
src/lib/libssl/src/times/sparc2               1
src/lib/libssl/src/crypto/threads/README      1
src/lib/libssl/src/crypto/ripemd/rmd_one.c    1
src/lib/libssl/src/demos/tunala/ip.c          1
src/lib/libssl/src/crypto/bn/asm/mips3.s      1
src/lib/libssl/src/times/486-66.dos           1
src/lib/libssl/src/ms/bcb4.bat                1
src/lib/libssl/src/demos/prime/Makefile       1
src/lib/libssl/src/bugs/MS                    1
src/lib/libssl/src/apps/tsget                 1
src/lib/libssl/src/crypto/des/rpw.c           1
src/lib/libssl/src/crypto/des/DES.xs          1
src/lib/libc/string/bzero.c                   1
Length: 2216, dtype: int64



In [27]:

    
c_changes=dfc.where(dfc.str.endswith(".c")).value_counts()
c_changes









    Out[27]:





src/lib/libssl/src/ssl/s3_clnt.c          52
src/lib/libssl/src/ssl/ssl_lib.c          51
src/lib/libssl/src/ssl/s3_lib.c           50
src/lib/libssl/src/ssl/t1_enc.c           50
src/lib/libssl/src/ssl/s3_srvr.c          46
src/lib/libssl/src/apps/apps.c            46
src/lib/libssl/src/apps/s_server.c        44
src/lib/libssl/src/apps/s_client.c        43
src/lib/libssl/src/ssl/ssl_ciph.c         43
src/lib/libssl/src/crypto/bio/b_sock.c    40
src/lib/libssl/src/ssl/t1_lib.c           38
src/lib/libssl/src/apps/ca.c              38
src/lib/libssl/src/ssl/s3_enc.c           35
src/lib/libssl/src/ssl/s3_pkt.c           30
src/lib/libssl/src/apps/req.c             30
...
src/lib/libssl/src/demos/smime/smenc.c                   1
src/lib/libssl/src/demos/asn1/ocsp.c                     1
src/lib/libssl/src/crypto/bn/exp.c                       1
src/lib/libc/string/strnlen.c                            1
src/lib/libssl/src/apps/winrand.c                        1
src/lib/libssl/src/demos/cms/cms_ver.c                   1
src/lib/libssl/src/crypto/dh/p192.c                      1
src/lib/libssl/src/demos/engines/ibmca/hw_ibmca_err.c    1
src/lib/libssl/src/crypto/ecdsa/ecs_asn1.c               1
src/regress/lib/libcrypto/hmac/hmactest.c                1
src/lib/libssl/src/crypto/idea/i_ofb64.c                 1
src/lib/libssl/src/engines/ccgost/gost_ctl.c             1
src/lib/libssl/src/demos/sign/sign.c                     1
src/lib/libssl/src/engines/e_atalla.c                    1
src/lib/libssl/src/demos/tunala/sm.c                     1
Length: 982, dtype: int64



In [28]:

    
c_changes.plot()









    Out[28]:





<matplotlib.axes.AxesSubplot at 0x1072e5850>

As expected, a few files are changed very often and most files are changed infrequently.

What about header files?



In [29]:

    
h_changes=dfc.where(dfc.str.endswith(".h")).value_counts()
h_changes









    Out[29]:





src/lib/libssl/src/ssl/ssl.h                 43
src/lib/libssl/src/ssl/ssl_locl.h            40
src/lib/libssl/src/apps/apps.h               21
src/lib/libssl/src/crypto/crypto.h           19
src/lib/libssl/src/crypto/engine/engine.h    17
src/lib/libssl/src/crypto/evp/evp.h          16
src/lib/libssl/src/ssl/ssl3.h                15
src/lib/libssl/src/crypto/asn1/asn1.h        14
src/lib/libssl/src/crypto/cryptlib.h         14
src/lib/libssl/src/crypto/bio/bio.h          11
src/lib/libssl/src/ssl/dtls1.h               11
src/lib/libssl/src/ssl/tls1.h                11
src/lib/libssl/src/crypto/bn/bn_lcl.h        10
src/lib/libssl/src/e_os.h                    10
src/lib/libssl/src/crypto/bn/bn.h            10
...
src/lib/libssl/src/engines/vendor_defns/aep.h                   1
src/lib/libssl/src/engines/e_aep_err.h                          1
src/lib/libssl/src/engines/e_sureware_err.h                     1
src/lib/libssl/src/demos/engines/zencod/hw_zencod.h             1
src/lib/libssl/src/demos/engines/zencod/hw_zencod_err.h         1
src/lib/libssl/src/crypto/ebcdic.h                              1
src/lib/libssl/src/demos/engines/cluster_labs/cluster_labs.h    1
src/lib/libssl/src/crypto/dsa/dsa_locl.h                        1
src/lib/libssl/src/engines/vendor_defns/atalla.h                1
src/lib/libssl/src/MacOS/_MWERKS_GUSI_prefix.h                  1
src/lib/libssl/src/crypto/modes/modes.h                         1
src/lib/libssl/src/crypto/rand/rand_lcl.h                       1
src/lib/libssl/src/engines/vendor_defns/cswift.h                1
src/lib/libssl/src/engines/vendor_defns/hw_ubsec.h              1
src/lib/libssl/src/engines/ccgost/gost2001_keyx.h               1
Length: 203, dtype: int64

To be continued... ;-)



In [29]:

	author	id
time
1995-10-18 08:42:23	deraadt	dcac718930cb87c958bb05ea34b0ef6284f5e10b
1995-10-18 08:49:34	deraadt	bf056200690ad2990feba19909f02f57687666fb
1995-11-01 16:43:27	deraadt	cb274af2d57ba262232e24a6b348a75e4ddf39c3
1995-12-14 02:16:48	deraadt	8eb715a636f0c8677814890f93cd9fbd1dc2d0cb
1995-12-15 01:46:48	deraadt	0e76731307f064af3c9f5201066a17f720f5935d