Introduction

In my previous blog post, we looked at the similarity within and across modules by only looking at the change data of each source code file.

In this analysis, we use same data analysis approach, but visualize the result in a 3D scatter plot.

Data Wrangling

We just repeat the stuff explained in the mentioned blog post. The only difference is, that we are going from a 2D representation of the distance matrix to a 3D representation.



In [103]:

    
import pandas as pd
from sklearn.metrics.pairwise import cosine_distances
from sklearn.manifold import MDS
import numpy as np
from matplotlib import cm
from matplotlib.colors import rgb2hex
import ipyvolume as ipv

# read, filter and prepare data
git_log = pd.read_csv("https://git.io/Jez2h")
prod_code = git_log.copy()
prod_code = prod_code[prod_code.file.str.endswith(".java")]
prod_code = prod_code[prod_code.file.str.startswith("backend/src/main")]
prod_code = prod_code[~prod_code.file.str.endswith("package-info.java")]
prod_code['hit'] = 1

# pivot table to get a change vector per file
commit_matrix = prod_code.reset_index().pivot_table(
    index='file',
    columns='sha',
    values='hit',
    fill_value=0)
commit_matrix.iloc[0:5,50:55]

# calculate distance between files based on changes
dissimilarity_matrix = cosine_distances(commit_matrix)

# break down matrix to 3D representation
model = MDS(dissimilarity='precomputed', random_state=0, n_components=3)
dissimilarity_3d = model.fit_transform(dissimilarity_matrix)

# extract module names
dissimilarity_3d_df = pd.DataFrame(
    dissimilarity_3d,
    index=commit_matrix.index,
    columns=["x", "y", "z"])
dissimilarity_3d_df['module'] = dissimilarity_3d_df.index.str.split("/").str[6]
dissimilarity_3d_df.head()









    Out[103]:







  
    
      
      x
      y
      z
      module
    
    
      file
      
      
      
      
    
  
  
    
      backend/src/main/java/at/dropover/comment/boundary/AddCommentRequestModel.java
      0.204507
      -0.500405
      0.328621
      comment
    
    
      backend/src/main/java/at/dropover/comment/boundary/ChangeCommentRequestModel.java
      0.420250
      -0.287899
      0.239667
      comment
    
    
      backend/src/main/java/at/dropover/comment/boundary/CommentData.java
      0.384984
      -0.418287
      0.185055
      comment
    
    
      backend/src/main/java/at/dropover/comment/boundary/GetCommentRequestModel.java
      0.305661
      -0.373012
      0.299084
      comment
    
    
      backend/src/main/java/at/dropover/comment/boundary/GetCommentResponseModel.java
      0.176830
      -0.580814
      0.123574
      comment

Visualization

So this part is new: We brew a color for each module.



In [104]:

    
modules = dissimilarity_3d_df[['module']].drop_duplicates()
rgb_colors = [x for x in cm.Spectral(np.linspace(0,1,len(modules)))]
modules['color'] = rgb_colors
modules = modules.set_index("module", drop=True)
dissimilarity_3d_df['color'] = dissimilarity_3d_df['module'].map(modules['color'].to_dict())
dissimilarity_3d_df.head()









    Out[104]:







  
    
      
      x
      y
      z
      module
      color
    
    
      file
      
      
      
      
      
    
  
  
    
      backend/src/main/java/at/dropover/comment/boundary/AddCommentRequestModel.java
      0.204507
      -0.500405
      0.328621
      comment
      [0.6196078431372549, 0.00392156862745098, 0.25...
    
    
      backend/src/main/java/at/dropover/comment/boundary/ChangeCommentRequestModel.java
      0.420250
      -0.287899
      0.239667
      comment
      [0.6196078431372549, 0.00392156862745098, 0.25...
    
    
      backend/src/main/java/at/dropover/comment/boundary/CommentData.java
      0.384984
      -0.418287
      0.185055
      comment
      [0.6196078431372549, 0.00392156862745098, 0.25...
    
    
      backend/src/main/java/at/dropover/comment/boundary/GetCommentRequestModel.java
      0.305661
      -0.373012
      0.299084
      comment
      [0.6196078431372549, 0.00392156862745098, 0.25...
    
    
      backend/src/main/java/at/dropover/comment/boundary/GetCommentResponseModel.java
      0.176830
      -0.580814
      0.123574
      comment
      [0.6196078431372549, 0.00392156862745098, 0.25...

And then, we visualize this data with ipyvolume.



In [106]:

    
x = dissimilarity_3d_df['x']
y = dissimilarity_3d_df['y']
z = dissimilarity_3d_df['z']
color = dissimilarity_3d_df['color'].values.tolist()
ipv.quickscatter(x, y, z, color=color, size=7, marker="sphere")

Summary

That's it! A nice 3D representation of our little software program.

We see some spheres with the same color near to each other. These modules the were change together in the first place. But there are also some mixed up areas. The reasons for this are explained here.

	x	y	z	module
file
backend/src/main/java/at/dropover/comment/boundary/AddCommentRequestModel.java	0.204507	-0.500405	0.328621	comment
backend/src/main/java/at/dropover/comment/boundary/ChangeCommentRequestModel.java	0.420250	-0.287899	0.239667	comment
backend/src/main/java/at/dropover/comment/boundary/CommentData.java	0.384984	-0.418287	0.185055	comment
backend/src/main/java/at/dropover/comment/boundary/GetCommentRequestModel.java	0.305661	-0.373012	0.299084	comment
backend/src/main/java/at/dropover/comment/boundary/GetCommentResponseModel.java	0.176830	-0.580814	0.123574	comment