Introduction

In his book Software Design X-Rays, Adam Tornhill shows a nice metric to find out if some parts of your code are coupled regarding their conjoint changes: Temporal Coupling.

In this and the next blog posts, I'm playing around with Adam's ideas (and more) to find hidden dependencies of code parts based on version control data.

In this part, we just want to spot co-changing files which are files that change within the same commit.

As almost always, we are using Python and pandas for this analysis.

Data

With the help of a little helper library, we extract relevant log data from a Git repository. In this case, we are just using a synthetic repository to easier check that everything is working as expected.

Here are all files and all the commits from the repository:


In [1]:
from lib.ozapfdis.git_tc import log_numstat
commits = log_numstat("../../synthetic_repo//")
commits


Out[1]:
additions deletions file sha timestamp author
1 1.0 0.0 b a2abe69 2018-07-19 13:53:09 Markus Harrer
2 1.0 0.0 d a2abe69 2018-07-19 13:53:09 Markus Harrer
4 1.0 0.0 a f80a5af 2018-07-19 13:52:48 Markus Harrer
5 1.0 0.0 b f80a5af 2018-07-19 13:52:48 Markus Harrer
7 0.0 0.0 e fcf1498 2018-07-19 13:52:31 Markus Harrer
9 1.0 0.0 b 7e6d738 2018-07-19 13:52:10 Markus Harrer
10 1.0 0.0 d 7e6d738 2018-07-19 13:52:10 Markus Harrer
12 0.0 0.0 d 2b4d97d 2018-07-19 13:51:14 Markus Harrer
14 1.0 0.0 b 732ebbb 2018-07-19 10:51:03 Markus Harrer
15 1.0 0.0 c 732ebbb 2018-07-19 10:51:03 Markus Harrer
17 1.0 0.0 a 72f5268 2018-07-19 10:50:49 Markus Harrer
18 1.0 0.0 b 72f5268 2018-07-19 10:50:49 Markus Harrer
20 2.0 0.0 a f3c99c6 2018-07-19 10:50:38 Markus Harrer
21 1.0 0.0 b f3c99c6 2018-07-19 10:50:38 Markus Harrer
23 0.0 0.0 c 5d5fba5 2018-07-19 10:50:13 Markus Harrer
25 0.0 0.0 a eb668d1 2018-07-19 10:49:16 Markus Harrer
26 0.0 0.0 b eb668d1 2018-07-19 10:49:16 Markus Harrer

We see that some files change often together (like "a" and "b" or "b" and "d") and some files are completely changing alone (like "e").

Let's get rid of all the unneeded columns first by just the columns that we really need for this analysis.


In [2]:
commits = commits[['file', 'sha']]
commits.head()


Out[2]:
file sha
1 b a2abe69
2 d a2abe69
4 a f80a5af
5 b f80a5af
7 e fcf1498

Idea

In this analysis, we need to create a relationship from each changed file to all changed file within the same commit.

I tried different things there with various data transformations, but in the end, the following stupid straightforward approach worked best: We just assign to each file in a commit all files of the same commit and count the occurrence of these relationships.

This gives us the perspectives on co-working changes that we want.

Analysis

To implement the idea of above, we can use the pd.merge command of pandas to combine the commits DataFrame with itself. The key here is to use an outer join to expand each file in a commit (designated by the value sha) to all the files of a commit (again, designated by the values in sha).


In [3]:
import pandas as pd

commit_counts = pd.merge(
    commits,
    commits,
    left_on='sha',
    right_on='sha',
    suffixes=['','_other'],
    how='outer')
commit_counts.head()


Out[3]:
file sha file_other
0 b a2abe69 b
1 b a2abe69 d
2 d a2abe69 b
3 d a2abe69 d
4 a f80a5af a

With this, e. g. the last commit (= first entries in the DataFrame) was expanded from

0   b   a2abe69
1   d   a2abe69

to

0   b   a2abe69     b
1   b   a2abe69     d
2   d   a2abe69     b
3   d   a2abe69     d

with the additional column of all files of the commit.

Because we're only interested of co-changing files, we can filter out all entries for file changes of the same file.


In [4]:
commit_counts = commit_counts[commit_counts['file'] != commit_counts['file_other']]
commit_counts.head()


Out[4]:
file sha file_other
1 b a2abe69 d
2 d a2abe69 b
5 a f80a5af b
6 b f80a5af a
10 b 7e6d738 d

We then can count the same commit relationships with the groupby command.


In [5]:
commit_coupling = commit_counts.groupby(['file', 'file_other']).count()
commit_coupling.head()


Out[5]:
sha
file file_other
a b 4
b a 4
c 1
d 2
c b 1

For also want to know the amount of all changes for each file to get the degree of the overall coupling between co-changing files. For this, we can use the groupby command on the file index column together with the transform method to calculate the number of changes per file.


In [6]:
commit_coupling['all_changes'] = commit_coupling.groupby(['file']).sha.transform('sum')
commit_coupling.head()


Out[6]:
sha all_changes
file file_other
a b 4 4
b a 4 7
c 1 7
d 2 7
c b 1 1

We further calculate the ratio between each changed file to the number of all changes for all files per commit. A high ratio gives us an indicator for pairwise files that change together very often.


In [7]:
commit_coupling['ratio'] = commit_coupling['sha'] / commit_coupling['all_changes']
commit_coupling


Out[7]:
sha all_changes ratio
file file_other
a b 4 4 1.000000
b a 4 7 0.571429
c 1 7 0.142857
d 2 7 0.285714
c b 1 1 1.000000
d b 2 2 1.000000

At last step, we do some housekeeping of the data to get a nicely sorted list of co-changing files.


In [8]:
coupling_list = commit_coupling.reset_index().sort_values(
    by=['ratio', 'file'], ascending=False)
coupling_list.rename(columns={"sha" : "co-changing"})
coupling_list = coupling_list.rename(columns={"sha" : "co-changing"})
coupling_list = coupling_list.reset_index(drop=True)
coupling_list


Out[8]:
file file_other co-changing all_changes ratio
0 d b 2 2 1.000000
1 c b 1 1 1.000000
2 a b 4 4 1.000000
3 b a 4 7 0.571429
4 b d 2 7 0.285714
5 b c 1 7 0.142857

With this result, we can e.g. see in the first three rows that the files "d","c" and "a" always change with the file "b".

In detail, you can read and interpret the table like this:

  • Row with index 0: For all changes of "d", "b" was always changed. This shows a high change dependency from the file "d" to the file "b". In other words: If one changes "d", something has to be changed in "b", too.
  • Row with index 3: If "b" was changed, "a" was changed in 4 out of 7 cases (=commits) as well. Together with the row indexed 2, we can see that "a" changes always with "b", but "b" doesn't always change with "a".
  • Row with index 5: If "b" was changed, "c" was changed in one case. This shows a slight (or even negligible) dependency from "b" to "c" (and maybe even "c" to "b", because only one commit could also be coincident).

If you are more into graphs, here are all the change relationships between the files with their ratio measure:

Note: The file "e" isn't occurring in the table because it's getting changed completely independent.

Visualizations

We can also try to draw a diagram that suits the tiny amount of data that we have. In our case, we use a D3 chord diagram to explore the coupling of co-changing files interactively. Pandas can output the data in a JSON format needed by the used D3 visualization.


In [9]:
coupling_list[['file','file_other','co-changing']].to_json(
    "chord_coupling_data.json", orient='values')

The chord diagram gives us hint about the inherent coupling of files based on co-changing.

You can find the interactive version of this visualization here.

Summary

We've seen that we can spot co-changing files with the help of pandas straightforward.

Doing it step by step allows us also to step-wise refine the analysis to our own circumstances. For example, we could define co-changing files as files that not only change within a commit, but on the same day. We could also find peaks of co-changing actions that could lead us to chaotic changes in code.

But for now, we leave it there.