Introduction

In his book Software Design X-Rays, Adam Tornhill shows a nice metric to find out if some parts of your code are coupled regarding their conjoint changes: Temporal Coupling.

In this and the next blog posts, I'm playing around with Adam's ideas (and more) to find hidden dependencies of code parts based on version control data.

In this part, we just want to spot co-changing files which are files that change within the same commit.

As almost always, we are using Python and pandas for this analysis.

Data

With the help of a little helper library, we extract relevant log data from a Git repository. In this case, we are just using a synthetic repository to easier check that everything is working as expected.

Here are all files and all the commits from the repository:



In [1]:

    
from lib.ozapfdis.git_tc import log_numstat
commits = log_numstat("../../synthetic_repo//")
commits









    Out[1]:







  
    
      
      additions
      deletions
      file
      sha
      timestamp
      author
    
  
  
    
      1
      1.0
      0.0
      b
      a2abe69
      2018-07-19 13:53:09
      Markus Harrer
    
    
      2
      1.0
      0.0
      d
      a2abe69
      2018-07-19 13:53:09
      Markus Harrer
    
    
      4
      1.0
      0.0
      a
      f80a5af
      2018-07-19 13:52:48
      Markus Harrer
    
    
      5
      1.0
      0.0
      b
      f80a5af
      2018-07-19 13:52:48
      Markus Harrer
    
    
      7
      0.0
      0.0
      e
      fcf1498
      2018-07-19 13:52:31
      Markus Harrer
    
    
      9
      1.0
      0.0
      b
      7e6d738
      2018-07-19 13:52:10
      Markus Harrer
    
    
      10
      1.0
      0.0
      d
      7e6d738
      2018-07-19 13:52:10
      Markus Harrer
    
    
      12
      0.0
      0.0
      d
      2b4d97d
      2018-07-19 13:51:14
      Markus Harrer
    
    
      14
      1.0
      0.0
      b
      732ebbb
      2018-07-19 10:51:03
      Markus Harrer
    
    
      15
      1.0
      0.0
      c
      732ebbb
      2018-07-19 10:51:03
      Markus Harrer
    
    
      17
      1.0
      0.0
      a
      72f5268
      2018-07-19 10:50:49
      Markus Harrer
    
    
      18
      1.0
      0.0
      b
      72f5268
      2018-07-19 10:50:49
      Markus Harrer
    
    
      20
      2.0
      0.0
      a
      f3c99c6
      2018-07-19 10:50:38
      Markus Harrer
    
    
      21
      1.0
      0.0
      b
      f3c99c6
      2018-07-19 10:50:38
      Markus Harrer
    
    
      23
      0.0
      0.0
      c
      5d5fba5
      2018-07-19 10:50:13
      Markus Harrer
    
    
      25
      0.0
      0.0
      a
      eb668d1
      2018-07-19 10:49:16
      Markus Harrer
    
    
      26
      0.0
      0.0
      b
      eb668d1
      2018-07-19 10:49:16
      Markus Harrer

We see that some files change often together (like "a" and "b" or "b" and "d") and some files are completely changing alone (like "e").

Let's get rid of all the unneeded columns first by just the columns that we really need for this analysis.



In [2]:

    
commits = commits[['file', 'sha']]
commits.head()

Idea

In this analysis, we need to create a relationship from each changed file to all changed file within the same commit.

I tried different things there with various data transformations, but in the end, the following ~~stupid~~ straightforward approach worked best: We just assign to each file in a commit all files of the same commit and count the occurrence of these relationships.

This gives us the perspectives on co-working changes that we want.

Analysis

To implement the idea of above, we can use the pd.merge command of pandas to combine the commits DataFrame with itself. The key here is to use an outer join to expand each file in a commit (designated by the value sha) to all the files of a commit (again, designated by the values in sha).



In [3]:

    
import pandas as pd

commit_counts = pd.merge(
    commits,
    commits,
    left_on='sha',
    right_on='sha',
    suffixes=['','_other'],
    how='outer')
commit_counts.head()

With this, e. g. the last commit (= first entries in the DataFrame) was expanded from

0   b   a2abe69
1   d   a2abe69

0   b   a2abe69     b
1   b   a2abe69     d
2   d   a2abe69     b
3   d   a2abe69     d

with the additional column of all files of the commit.

Because we're only interested of co-changing files, we can filter out all entries for file changes of the same file.



In [4]:

    
commit_counts = commit_counts[commit_counts['file'] != commit_counts['file_other']]
commit_counts.head()

We then can count the same commit relationships with the groupby command.



In [5]:

    
commit_coupling = commit_counts.groupby(['file', 'file_other']).count()
commit_coupling.head()

For also want to know the amount of all changes for each file to get the degree of the overall coupling between co-changing files. For this, we can use the groupby command on the file index column together with the transform method to calculate the number of changes per file.



In [6]:

    
commit_coupling['all_changes'] = commit_coupling.groupby(['file']).sha.transform('sum')
commit_coupling.head()









    Out[6]:







  
    
      
      
      sha
      all_changes
    
    
      file
      file_other
      
      
    
  
  
    
      a
      b
      4
      4
    
    
      b
      a
      4
      7
    
    
      c
      1
      7
    
    
      d
      2
      7
    
    
      c
      b
      1
      1

We further calculate the ratio between each changed file to the number of all changes for all files per commit. A high ratio gives us an indicator for pairwise files that change together very often.



In [7]:

    
commit_coupling['ratio'] = commit_coupling['sha'] / commit_coupling['all_changes']
commit_coupling









    Out[7]:







  
    
      
      
      sha
      all_changes
      ratio
    
    
      file
      file_other
      
      
      
    
  
  
    
      a
      b
      4
      4
      1.000000
    
    
      b
      a
      4
      7
      0.571429
    
    
      c
      1
      7
      0.142857
    
    
      d
      2
      7
      0.285714
    
    
      c
      b
      1
      1
      1.000000
    
    
      d
      b
      2
      2
      1.000000

At last step, we do some housekeeping of the data to get a nicely sorted list of co-changing files.



In [8]:

    
coupling_list = commit_coupling.reset_index().sort_values(
    by=['ratio', 'file'], ascending=False)
coupling_list.rename(columns={"sha" : "co-changing"})
coupling_list = coupling_list.rename(columns={"sha" : "co-changing"})
coupling_list = coupling_list.reset_index(drop=True)
coupling_list









    Out[8]:







  
    
      
      file
      file_other
      co-changing
      all_changes
      ratio
    
  
  
    
      0
      d
      b
      2
      2
      1.000000
    
    
      1
      c
      b
      1
      1
      1.000000
    
    
      2
      a
      b
      4
      4
      1.000000
    
    
      3
      b
      a
      4
      7
      0.571429
    
    
      4
      b
      d
      2
      7
      0.285714
    
    
      5
      b
      c
      1
      7
      0.142857

With this result, we can e.g. see in the first three rows that the files "d","c" and "a" always change with the file "b".

In detail, you can read and interpret the table like this:

Row with index 0: For all changes of "d", "b" was always changed. This shows a high change dependency from the file "d" to the file "b". In other words: If one changes "d", something has to be changed in "b", too.
Row with index 3: If "b" was changed, "a" was changed in 4 out of 7 cases (=commits) as well. Together with the row indexed 2, we can see that "a" changes always with "b", but "b" doesn't always change with "a".
Row with index 5: If "b" was changed, "c" was changed in one case. This shows a slight (or even negligible) dependency from "b" to "c" (and maybe even "c" to "b", because only one commit could also be coincident).

If you are more into graphs, here are all the change relationships between the files with their ratio measure:

Note: The file "e" isn't occurring in the table because it's getting changed completely independent.

Visualizations

We can also try to draw a diagram that suits the tiny amount of data that we have. In our case, we use a D3 chord diagram to explore the coupling of co-changing files interactively. Pandas can output the data in a JSON format needed by the used D3 visualization.



In [9]:

    
coupling_list[['file','file_other','co-changing']].to_json(
    "chord_coupling_data.json", orient='values')

The chord diagram gives us hint about the inherent coupling of files based on co-changing.

You can find the interactive version of this visualization here.

Summary

We've seen that we can spot co-changing files with the help of pandas straightforward.

Doing it step by step allows us also to step-wise refine the analysis to our own circumstances. For example, we could define co-changing files as files that not only change within a commit, but on the same day. We could also find peaks of co-changing actions that could lead us to chaotic changes in code.

But for now, we leave it there.

	additions	file	sha	timestamp	author
1	1.0	b	a2abe69	2018-07-19 13:53:09	Markus Harrer
2	1.0	d	a2abe69	2018-07-19 13:53:09	Markus Harrer
4	1.0	a	f80a5af	2018-07-19 13:52:48	Markus Harrer
5	1.0	b	f80a5af	2018-07-19 13:52:48	Markus Harrer
7	0.0	e	fcf1498	2018-07-19 13:52:31	Markus Harrer
9	1.0	b	7e6d738	2018-07-19 13:52:10	Markus Harrer
10	1.0	d	7e6d738	2018-07-19 13:52:10	Markus Harrer
12	0.0	d	2b4d97d	2018-07-19 13:51:14	Markus Harrer
14	1.0	b	732ebbb	2018-07-19 10:51:03	Markus Harrer
15	1.0	c	732ebbb	2018-07-19 10:51:03	Markus Harrer
17	1.0	a	72f5268	2018-07-19 10:50:49	Markus Harrer
18	1.0	b	72f5268	2018-07-19 10:50:49	Markus Harrer
20	2.0	a	f3c99c6	2018-07-19 10:50:38	Markus Harrer
21	1.0	b	f3c99c6	2018-07-19 10:50:38	Markus Harrer
23	0.0	c	5d5fba5	2018-07-19 10:50:13	Markus Harrer
25	0.0	a	eb668d1	2018-07-19 10:49:16	Markus Harrer
26	0.0	b	eb668d1	2018-07-19 10:49:16	Markus Harrer

		sha	all_changes	ratio
file	file_other
a	b	4	4	1.000000
b	a	4	7	0.571429
	c	1	7	0.142857
	d	2	7	0.285714
c	b	1	1	1.000000
d	b	2	2	1.000000