In this blog post, I want to show you a nice complexity metric that works for most major programming languages that we use for our software systems – the indentation-based complexity metric. Adam Tornhill describes this metric in his fabulous book Software Design X-Rays on page 25 as follows:
With indentation-based complexity we count the leading tabs and whitespaces to convert them into logical indentations. ... This works because indentations in code carry meaning. Indentations are used to increase readability by separating code blocks from each other.
He further says that the trend of the indentation-based complexity in combination with the lines of code metric a good indicator for complexity-growth. If the lines of code don't change but the indentation does, you got a complexity problem in your code.
In this blog post, we are just interested in a static view of the indentation of a code (the evolution of the indentation-based complexity will be discussed in another blog post). I want to show you how you can spot complex areas in your application using the Pandas data analysis framework. As application under analysis, I choose the Linux kernel project because we want to apply our analysis at big scale.
This analysis works as follows:
glob
read_csv()
methodSo let's start!
In [1]:
import glob
file_list = glob.glob("../../linux/**/*.[c|h]", recursive=True)
file_list[:5]
Out[1]:
With the file_list
at hand, we can now read in the content of the source code files. We could have used a standard Python with <path> as f
and read in the content with a f.read()
, but we want to get the data as early into a Pandas' DataFrame (abbreviated "df") to leverage the power of the data analysis framework.
For each path in our file list, we create a single DataFrame content_df
. We use a line break \n
as column separator, ensuring that we create a single row for each source code line. We also specify the encoding
parameter with the file encoding latin-1
to make sure that we can read in weird file contents, too (this is totally normal when working in an international team). We also set skip_blank_lines
to False
to keep blank lines. Keeping the blank lines is necessary for debugging purposes (you can inspect certain source code lines with a text editor more easily) as well as for later analysis.
The process of reading in all the source code could take a while for the thousands of files of the Linux kernel. After looping through and reading in all the data, we concatenate all DataFrames with the concat
method. This gives us a single DataFrame with all source code content for all source code files.
In [2]:
import pandas as pd
content_dfs = []
for path in file_list:
content_df = pd.read_csv(
path,
encoding='latin-1',
sep='\n',
skip_blank_lines=False,
names=['line']
)
content_df.insert(0, 'filepath', path)
content_dfs.append(content_df)
content = pd.concat(content_dfs)
content.head()
Out[2]:
Let's see what we've got here in our DataFrame named content
.
In [3]:
content.info()
We have over 22 million lines of code (including comments and blank lines) in our DataFrame. Let's clean up that data a little bit.
In [4]:
content['filepath'] = pd.Categorical(content['filepath'])
content['filepath'] = content['filepath']\
.str.replace("\\", "/")\
.str.replace("../../linux/", "")
content.head(1)
Out[4]:
We further replace all empty lines with blanks and tabular spaces with a suitable number of whitespaces (we assume here that a tab corresponds to four whitespaces).
In [5]:
content['line'] = content['line'].fillna("")
FOUR_SPACES = " " * 4
content['line'] = content['line'].str.replace("\t", FOUR_SPACES)
content.head(1)
Out[5]:
In [6]:
content['line_number'] = content.index + 1
content = content.reset_index(drop=True)
content.head(1)
Out[6]:
Next, we mark comment lines and blank lines. For the identification of the comments, we use a simple heuristic that should be sufficient in most cases.
In [7]:
content['is_comment'] = content['line'].str.match(r'^ *(//|/\*|\*).*')
content['is_empty'] = content['line'].str.replace(" ","").str.len() == 0
content.head(1)
Out[7]:
Now we come to the key part: We calculate the indentation for each source code line. An indent is the number of whitespaces that come before the first non-blank character in a line. We just subtract the length from a line without the preceding whitespaces from the length of the whole line.
In [8]:
content['indent'] = content['line'].str.len() - content['line'].str.lstrip().str.len()
content.head(1)
Out[8]:
We make sure to only inspect the real source code lines. Because we have the information about the blank lines and comments, we can filter these out very easily. We immediately aggregate the indentations with a count
to get the number of source code lines as well as the summary of all indents for each filepath
aka source code file. We also rename the columns of our new DataFrame accordingly.
In [9]:
source_code_content = content[content['is_comment'] | content['is_empty']]
source_code = source_code_content.groupby('filepath')['indent'].agg(['count', 'sum'])
source_code.columns = ['lines', 'indents']
source_code.head()
Out[9]:
To get a quick overview of the data we have now, we plot the results in a scatter diagram.
In [10]:
%matplotlib inline
source_code.plot.scatter('lines', 'indents', alpha=0.3);
Alright, we have a very few outliners in both dimensions and a slight correlation between the lines of code and the indentation (who guessed it?). But for now, we leave the outliers where the are.
In [11]:
source_code['complexity'] = source_code['indents'] / source_code['lines']
source_code.head(1)
Out[11]:
So how complex is our code in general? Let's get also an overview of the distribution of the complexity. We use a histogram to visualize this.
In [12]:
source_code['complexity'].hist(bins=50)
Out[12]:
Almost 30,000 files are probably okayish regarding our complexity measure.
In [13]:
source_code['component'] = source_code.index\
.str.split("/", n=2)\
.str[:2].str.join(":")
source_code.head(1)
Out[13]:
We then sum up all our measures per component.
In [14]:
measures_per_component = source_code.groupby('component').sum()
measures_per_component.head()
Out[14]:
Last, we print out the TOP 10 most complex parts of the Linux kernel measured by our indents per lines ratio complexity
.
In [15]:
measures_per_component['complexity'].nlargest(10)
Out[15]:
Finally, we plot our the complexity per component with a treemap. We use the Python visualization library pygal
(http://www.pygal.org) which is very easy to use and just fits our use case perfectly. As size for the treemap's rectangles, we use the lines of code of the components. As color, we choose red and visualize with the help of the red color's alpha level (aka normalized complexity comp_norm
between 0 and 1), how red a rectangle of the treemap should be.
This gives us a treemap with the following properties:
We render the treemap as PNG image (to save space) and display it directly in the notebook.
In [16]:
import pygal
from IPython.display import Image
config = pygal.Config(show_legend=False)
treemap = pygal.Treemap(config)
max_comp = measures_per_component['complexity'].max()
for row in measures_per_component.iterrows():
filename = row[0]
entry = row[1]
comp_norm = entry['complexity'] / max_comp
data = {}
data['value'] = entry['lines']
data['color'] = 'rgba(255,0,0,' + str(comp_norm) + ')'
treemap.add(filename, [data])
Image(treemap.render_to_png())
Out[16]:
There is also an interactive treemap where you can interact with the graphic. But beware: It's almost a megabyte big!
In this blog post, I showed you how we can obtain an almost programming language agnostic measure for complexity by measuring the indentation of source code. We also spotted the TOP 10 most complex components and visualized the completed results as a treemap.
All this was relatively easy and straightforward to do by using standard Data Science tooling. I hope you liked it and I would be happy, if you could provide some feedback in the comment section.