The idea

In my previous blog post, we got to know the idea of "indentation-based complexity". We took a static view on the Linux kernel to spot the most complex areas.

This time, we wanna track the evolution of the indentation-based complexity of a software system over time. We are especially interested in it's correlation between the lines of code. Because if we have a more or less stable development of the lines of codes of our system, but an increasing number of indentation per source code file, we surely got a complexity problem.

Again, this analysis is higly inspired by Adam Tornhill's book "Software Design X-Ray" , which I currently always recommend if you want to get a deep dive into software data analysis.

The data

For the calculation of the evolution of our software system, we can use data from the version control system. In our case, we can get all changes to Java source code files with Git. We just need so say the right magic words, which is

git log -p -- *.java

This gives us data like the following:

commit e5254156eca3a8461fa758f17dc5fae27e738ab5
Author: Antoine Rey <antoine.rey@gmail.com>
Date:   Fri Aug 19 18:54:56 2016 +0200

    Convert Controler's integration test to unit test

diff --git a/src/test/java/org/springframework/samples/petclinic 
/web/CrashControllerTests.java b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
index ee83b8a..a83255b 100644
--- a/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
+++ b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
@@ -1,8 +1,5 @@
 package org.springframework.samples.petclinic.web;

-import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.get;
-import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.*;
-
 import org.junit.Before;
 import org.junit.Test;
 import org.junit.runner.RunWith;

We have the

commit sha
commit e5254156eca3a8461fa758f17dc5fae27e738ab5
author's name
Author: Antoine Rey <antoine.rey@gmail.com>
date of the commit
Date: Fri Aug 19 18:54:56 2016 +0200
commit message
Convert Controler's integration test to unit test
names of the files that changes (after and before)
diff --git a/src/test/java/org/springframework/samples/petclinic /web/CrashControllerTests.java b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
the extended index header
index ee83b8a..a83255b 100644
--- a/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java +++ b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java

and the full file diff where we can see additions or modifications (+) and deletions (-)

  package org.springframework.samples.petclinic.web;

  -import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.get;
  -import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.*;     
  -
   import org.junit.Before;

We "just" have to get this data into our favorite data analysis framework, which is, of course, Pandas :-). We can actually do that! Let's see how!

Advanced data wangling

Reading in such a semi-structured data is a little challenge. But we can do it with some tricks. First, we read in the whole Git diff history by standard means, using read_csv and the separator \n to get one row per line. We make sure to give the columns a nice name as well.



In [1]:

    
import pandas as pd

diff_raw = pd.read_csv(
    "../../buschmais-spring-petclinic_fork/git_diff.log",
    sep="\n",
    names=["raw"])
diff_raw.head(16)









    



---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-1-194884588aee> in <module>()
      4     "../../buschmais-spring-petclinic_fork/git_diff.log",
      5     sep="\n",
----> 6     names=["raw"])
      7 diff_raw.head(16)

C:\dev\apps\Anaconda3\lib\site-packages\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    707                     skip_blank_lines=skip_blank_lines)
    708 
--> 709         return _read(filepath_or_buffer, kwds)
    710 
    711     parser_f.__name__ = name

C:\dev\apps\Anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    447 
    448     # Create the parser.
--> 449     parser = TextFileReader(filepath_or_buffer, **kwds)
    450 
    451     if chunksize or iterator:

C:\dev\apps\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, f, engine, **kwds)
    816             self.options['has_index_names'] = kwds['has_index_names']
    817 
--> 818         self._make_engine(self.engine)
    819 
    820     def close(self):

C:\dev\apps\Anaconda3\lib\site-packages\pandas\io\parsers.py in _make_engine(self, engine)
   1047     def _make_engine(self, engine='c'):
   1048         if engine == 'c':
-> 1049             self._engine = CParserWrapper(self.f, **self.options)
   1050         else:
   1051             if engine == 'python':

C:\dev\apps\Anaconda3\lib\site-packages\pandas\io\parsers.py in __init__(self, src, **kwds)
   1693         kwds['allow_leading_cols'] = self.index_col is not False
   1694 
-> 1695         self._reader = parsers.TextReader(src, **kwds)
   1696 
   1697         # XXX

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: File b'../../buschmais-spring-petclinic_fork/git_diff.log' does not exist

The output is the commit data that I've describe above where each in line the text file represents one row in the DataFrame (without blank lines).

Cleansing

We skip all the data we don't need for sure. Especially the "extended index header" with the two lines that being with +++ and --- are candidates to mix with the real diff data that begins also with a + or a -. Furtunately, we can identify these rows easily: These are the rows that begin with the row that starts with index. Using the shift operation starting at the row with index, we can get rid of all those lines.



In [ ]:

    
index_row = diff_raw.raw.str.startswith("index ")
ignored_diff_rows = (index_row.shift(1) | index_row.shift(2))
diff_raw = diff_raw[~(index_row | ignored_diff_rows)]
diff_raw.head(10)

Extracting metadata

Next, we extract some metadata of a commit. We can identify the different entries by using a regular expression that looks up a specific key word for each line. We extract each individual information into a new Series/column because we need it for each change line during the software's history.



In [ ]:

    
diff_raw['commit'] = diff_raw.raw.str.split("^commit ").str[1]
diff_raw['timestamp'] = pd.to_datetime(diff_raw.raw.str.split("^Date: ").str[1])
diff_raw['path'] = diff_raw.raw.str.extract("^diff --git.* b/(.*)", expand=True)[0]
diff_raw.head()

To assign each commit's metadata to the remaining rows, we forward fill those rows with the metadata by using the fillna method.



In [ ]:

    
diff_raw = diff_raw.fillna(method='ffill')
diff_raw.head(8)

Identifying source code lines

We can now focus on the changed source code lines. We can identify



In [ ]:

    
%%timeit
diff_raw.raw.str.extract("^\+( *).*$", expand=True)[0].str.len()



In [ ]:

    
diff_raw["i"] = diff_raw.raw.str[1:].str.len() - diff_raw.raw.str[1:].str.lstrip().str.len()
diff_raw



In [ ]:

    
%%timeit
diff_raw.raw.str[0] + diff_raw.raw.str.[1:].str.lstrip().str.len()



In [ ]:

    
diff_raw['added'] = diff_raw.line.str.extract("^\+( *).*$", expand=True)[0].str.len()
diff_raw['deleted'] = diff_raw.line.str.extract("^-( *).*$", expand=True)[0].str.len()
diff_raw.head()

For our later indentation-based complexity calculation, we have to make sure that each line



In [ ]:

    
diff_raw['line'] = diff_raw.raw.str.replace("\t", "    ")
diff_raw.head()



In [ ]:

    
diff = \
    diff_raw[
        (~diff_raw['added'].isnull()) | 
             (~diff_raw['deleted'].isnull())].copy()
diff.head()



In [ ]:

    
diff['is_comment'] = diff.line.str[1:].str.match(r' *(//|/*\*).*')
diff['is_empty'] = diff.line.str[1:].str.replace(" ","").str.len() == 0
diff['is_source'] = ~(diff['is_empty'] | diff['is_comment'])
diff.head()



In [ ]:

    
diff.raw.str[0].value_counts()



In [ ]:

    
diff['lines_added'] = (~diff.added.isnull()).astype('int')
diff['lines_deleted'] = (~diff.deleted.isnull()).astype('int')
diff.head()



In [ ]:

    
diff = diff.fillna(0)
#diff.to_excel("temp.xlsx")
diff.head()



In [ ]:

    
commits_per_day = diff.set_index('timestamp').resample("D").sum()
commits_per_day.head()



In [ ]:

    
%matplotlib inline
commits_per_day.cumsum().plot()



In [ ]:

    
(commits_per_day.added - commits_per_day.deleted).cumsum().plot()



In [ ]:

    
(commits_per_day.lines_added - commits_per_day.lines_deleted).cumsum().plot()



In [ ]:

    
diff_sum = diff.sum()
diff_sum.lines_added - diff_sum.lines_deleted



In [ ]:

    
3913