Introduction

I recently watched Michael Feathers' talk about Strategic Code Deletion. Michael said (among other very good things) that if we want to delete code, we have to know the actual usage of our code.

In this post, I want to show you how you can very easily gather some data and create insights about unused code to delete.

Gathering production coverage data

There are many ways to collect some data that fits our needs. You could use some kind of logging or performance profiling. But the easiest way is to use a code coverage tool like JaCoCo or Cobertura (for more tools, have a look at this comparison) for measuring the coverage in production. The common use of those tools is measuring the coverage of the execution paths through tests. But they can also be used to measure the execution paths through the user's usage.

Using JaCoCo for measuring code coverage in production

In this post, we'll use the coverage tool JaCoCo. It's easy to integrate into test execution as well as production systems. You don't have to instrument your code and create a special artifact, but simply add it as Java agent to your servlet container or application server.

As application example, we use a Spring Petclinic fork that runs in a Tomcat servlet container. We simply build the project and put the WAR file into Tomcat's webapps directory. Then we add JaCoCo as a Java agent by setting the JAVA_OPTS variable in conf/catalina.bat (on Windows systems):

set "JAVA_OPTS=%JAVA_OPTS% -javaagent:<path_to_lib>\jacocoagent.jar=destfile=C:\Temp\jacoco.exec"

After this, we fire up Tomcat and click around in the application, simulating "production usage" (in my case I booked a visit for an owner's pet). After shutting down Tomcat, we get a jacoco.exec file with the information needed.

Generate a JaCoCo code coverage report

JaCoCo is usually integrated in the build itself via Maven or Ant to create the report. But there is a nice little library that allows us to do it at the command line. You can download it via the snapshot all in one zip package. There is also a great documentation site available that leaves no question unanswered.

In the end, the command for the report generation is long, but not very special:

java -jar jacococli.jar report <path_to_jacoco.exec> --classfiles <path_to_classfiles_dir> --csv jacoco.csv

This will give you a nice little CSV file with all the coverage information at class level.

Analyzing the coverage report

Let's have a look into the produced CSV file.



In [1]:

    
JACOCO_CSV_FILE = r'input/spring-petclinic/jacoco.csv'

with open (JACOCO_CSV_FILE) as log:
    [print(line, end='') for line in log.readlines()[:4]]









    



GROUP,PACKAGE,CLASS,INSTRUCTION_MISSED,INSTRUCTION_COVERED,BRANCH_MISSED,BRANCH_COVERED,LINE_MISSED,LINE_COVERED,COMPLEXITY_MISSED,COMPLEXITY_COVERED,METHOD_MISSED,METHOD_COVERED
JaCoCo Coverage Report,org.springframework.samples.petclinic,PetclinicInitializer,0,124,0,0,0,24,0,8,0,8
JaCoCo Coverage Report,org.springframework.samples.petclinic.model,NamedEntity,3,10,0,0,1,4,1,3,1,3
JaCoCo Coverage Report,org.springframework.samples.petclinic.model,Specialty,0,3,0,0,0,1,0,1,0,1

It contains the package and class name as well as diverse measures that show use the coverage.

Next, we fire up our favorite data analysis framework Pandas and read in the file.



In [2]:

    
import pandas as pd
coverage= pd.read_csv(JACOCO_CSV_FILE)
coverage.head(3)









    Out[2]:






  
    
      
      GROUP
      PACKAGE
      CLASS
      INSTRUCTION_MISSED
      INSTRUCTION_COVERED
      BRANCH_MISSED
      BRANCH_COVERED
      LINE_MISSED
      LINE_COVERED
      COMPLEXITY_MISSED
      COMPLEXITY_COVERED
      METHOD_MISSED
      METHOD_COVERED
    
  
  
    
      0
      JaCoCo Coverage Report
      org.springframework.samples.petclinic
      PetclinicInitializer
      0
      124
      0
      0
      0
      24
      0
      8
      0
      8
    
    
      1
      JaCoCo Coverage Report
      org.springframework.samples.petclinic.model
      NamedEntity
      3
      10
      0
      0
      1
      4
      1
      3
      1
      3
    
    
      2
      JaCoCo Coverage Report
      org.springframework.samples.petclinic.model
      Specialty
      0
      3
      0
      0
      0
      1
      0
      1
      0
      1

Nice, Pandas recognizes the format of the CSV file automagically!

We don't need all the information in the CSV file for our purpose, so we just get the relevant columns.



In [3]:

    
coverage = coverage[['PACKAGE', 'CLASS', 'LINE_MISSED', 'LINE_COVERED']]
coverage.head()









    Out[3]:






  
    
      
      PACKAGE
      CLASS
      LINE_MISSED
      LINE_COVERED
    
  
  
    
      0
      org.springframework.samples.petclinic
      PetclinicInitializer
      0
      24
    
    
      1
      org.springframework.samples.petclinic.model
      NamedEntity
      1
      4
    
    
      2
      org.springframework.samples.petclinic.model
      Specialty
      0
      1
    
    
      3
      org.springframework.samples.petclinic.model
      PetType
      0
      1
    
    
      4
      org.springframework.samples.petclinic.model
      Vets
      4
      0

Let's add some custom measures. For the size of class, we simply add the line information together. Based upon this, we can also calculate the ratio between covered lines and all lines in our code. With this information, we can see which features are used (high values) and which are not in use (low values).



In [4]:

    
coverage['line_size'] = coverage['LINE_MISSED'] + coverage['LINE_COVERED']
coverage['line_covered_ratio'] =  coverage['LINE_COVERED'] / coverage['line_size']
coverage.head()









    Out[4]:






  
    
      
      PACKAGE
      CLASS
      LINE_MISSED
      LINE_COVERED
      line_size
      line_covered_ratio
    
  
  
    
      0
      org.springframework.samples.petclinic
      PetclinicInitializer
      0
      24
      24
      1.0
    
    
      1
      org.springframework.samples.petclinic.model
      NamedEntity
      1
      4
      5
      0.8
    
    
      2
      org.springframework.samples.petclinic.model
      Specialty
      0
      1
      1
      1.0
    
    
      3
      org.springframework.samples.petclinic.model
      PetType
      0
      1
      1
      1.0
    
    
      4
      org.springframework.samples.petclinic.model
      Vets
      4
      0
      4
      0.0

That's all we need! Let's get to the interesting stuff!

Examing feature usages

The first question we are asking is: Are there any unused features in our application?

In our little example program, we can derive some information about the features in the application based on the class names. We create a list of possible features in a list and check the class names against it. For all the other code that is not in the list we assign the keyword "Framework" to it.

For big software systems, it's not so easy to derive the implemented features from the pure naming schema, because you will simply have a mess here. But there are sophisticated ways to do this e. g. with jQAssistant, which I will cover in another post.



In [18]:

    
features = ['Owner', 'Pet', 'Visit', 'Vet', 'Specialty', 'Clinic']

for feature in features:
    coverage.ix[coverage['CLASS'].str.contains(feature), 'feature'] = feature

coverage.ix[coverage['feature'].isnull(), 'feature'] = "Framework"

coverage[['CLASS', 'feature']].head()









    Out[18]:






  
    
      
      CLASS
      feature
    
  
  
    
      0
      PetclinicInitializer
      Pet
    
    
      1
      NamedEntity
      Framework
    
    
      2
      Specialty
      Specialty
    
    
      3
      PetType
      Pet
    
    
      4
      Vets
      Vet

By grouping all the classes by their feature and by forming the average of all line covered ratios, we can approximate the feature usage.



In [19]:

    
feature_usage = coverage.groupby('feature').mean().sort_values(by='line_covered_ratio')[['line_covered_ratio']]
feature_usage









    Out[19]:






  
    
      
      line_covered_ratio
    
    
      feature
      
    
  
  
    
      Vet
      0.147222
    
    
      Visit
      0.385417
    
    
      Owner
      0.549495
    
    
      Pet
      0.594156
    
    
      Framework
      0.640278
    
    
      Clinic
      0.888889
    
    
      Specialty
      1.000000

We can see that the "Vet" feature isn't used very often. If we want to know which classes are affected, we can filter depending on the corresponding feature.



In [7]:

    
classes_to_delete_by_feature = coverage[coverage['feature'] == feature_usage.index[0]][['PACKAGE', 'CLASS', 'line_covered_ratio', 'line_size']]
classes_to_delete_by_feature









    Out[7]:






  
    
      
      PACKAGE
      CLASS
      line_covered_ratio
      line_size
    
  
  
    
      4
      org.springframework.samples.petclinic.model
      Vets
      0.000000
      4
    
    
      10
      org.springframework.samples.petclinic.model
      Vet
      0.083333
      12
    
    
      17
      org.springframework.samples.petclinic.web
      VetController
      0.300000
      10
    
    
      23
      org.springframework.samples.petclinic.reposito...
      JdbcVetRepositoryImpl
      0.000000
      17
    
    
      25
      org.springframework.samples.petclinic.reposito...
      JdbcVetRepositoryImpl.new BeanPropertyRowMappe...
      0.000000
      2
    
    
      32
      org.springframework.samples.petclinic.reposito...
      JpaVetRepositoryImpl
      0.500000
      2

Based on this list, we can approximate how much code we could save if we delete that feature / all the classes.



In [22]:

    
classes_to_delete_by_feature['line_size'].sum() / coverage['line_size'].sum()









    Out[22]:





0.09894736842105263

Result: Almost 10% of the code base can be removed if we give up the "Vet" feature.

Examing technology usage

The next question is: Are there technological parts in my software that aren't used in production at all?

This time, we use the information that is implicitly encoded in the package names. Our heuristic simply takes the last package name as technology related information.



In [23]:

    
coverage['technology'] = coverage['PACKAGE'].str.split(".").str.get(-1)
coverage[['PACKAGE', 'technology']].head()









    Out[23]:






  
    
      
      PACKAGE
      technology
    
  
  
    
      0
      org.springframework.samples.petclinic
      petclinic
    
    
      1
      org.springframework.samples.petclinic.model
      model
    
    
      2
      org.springframework.samples.petclinic.model
      model
    
    
      3
      org.springframework.samples.petclinic.model
      model
    
    
      4
      org.springframework.samples.petclinic.model
      model

Looks good! Then, same game as before, we simply group the classes by the new information.



In [24]:

    
technology_usage = coverage.groupby('technology').mean().sort_values(by='line_covered_ratio')[['line_covered_ratio']]
technology_usage









    Out[24]:






  
    
      
      line_covered_ratio
    
    
      technology
      
    
  
  
    
      jdbc
      0.000000
    
    
      util
      0.270833
    
    
      web
      0.639809
    
    
      jpa
      0.691558
    
    
      model
      0.739048
    
    
      service
      0.888889
    
    
      petclinic
      1.000000

There is one part that isn't used at all: "jdbc". There is a great chance that in the long term we don't need that technology anymore. Let's see how many classes we could probably delete here.



In [25]:

    
classes_to_delete_by_technology = coverage[coverage['technology'] == technology_usage.index[0]][['PACKAGE', 'CLASS', 'line_covered_ratio', 'line_size']]
classes_to_delete_by_technology









    Out[25]:






  
    
      
      PACKAGE
      CLASS
      line_covered_ratio
      line_size
    
  
  
    
      21
      org.springframework.samples.petclinic.reposito...
      JdbcVisitRowMapper
      0.0
      7
    
    
      22
      org.springframework.samples.petclinic.reposito...
      JdbcPetRepositoryImpl
      0.0
      33
    
    
      23
      org.springframework.samples.petclinic.reposito...
      JdbcVetRepositoryImpl
      0.0
      17
    
    
      24
      org.springframework.samples.petclinic.reposito...
      JdbcPetRowMapper
      0.0
      9
    
    
      25
      org.springframework.samples.petclinic.reposito...
      JdbcVetRepositoryImpl.new BeanPropertyRowMappe...
      0.0
      2
    
    
      26
      org.springframework.samples.petclinic.reposito...
      JdbcVisitRepositoryImpl
      0.0
      26
    
    
      27
      org.springframework.samples.petclinic.reposito...
      JdbcPet
      0.0
      7
    
    
      28
      org.springframework.samples.petclinic.reposito...
      JdbcPetVisitExtractor
      0.0
      8
    
    
      29
      org.springframework.samples.petclinic.reposito...
      JdbcOwnerRepositoryImpl
      0.0
      43

Again, we also calculate the part of these classes regarding our whole application.



In [26]:

    
classes_to_delete_by_technology['line_size'].sum() / coverage['line_size'].sum()









    Out[26]:





0.32

Result: 32% of the code base can be reduced by simply deleting an unused technology related part.

In summary, we can fairly say that we can delete this amount of code:



In [13]:

    
print("{:.0%}".format(
    (classes_to_delete_by_feature['line_size'].sum() + 
     classes_to_delete_by_technology['line_size'].sum()) / 
     coverage['line_size'].sum()))

42%

42! I mean, this is the answer! The answer to (keeping legacy systems a-) life :-D !

Visualizing the results

But there is one more thing!

We can easily visualize the whole code base including the production coverage measures in a nice hierarchical bubble chart with the JavaScript visualization library D3 to support even more exploration. At this point, kudos to Mike Bostock and Adam Thornhill for their excellent examples!

For the visualization, we need something hierarchical to display. We use the information of the PACKAGE column that contains the Java package name with dots as separator, which fits just perfect as hierachical data. We also need a size measure for the size of the bubbles. We can use the line_size information here that we've calculated before. Additionally, we use the colors of the bubbles as identifier for "how much is something covered". For this, we use matplotlib's color brewer to assign colors from blue (=cold) to red (=hot) depending of the line_covered_ratio (including the conversion to CSS colors because this is what D3 needs).

So let's get a new color column first:



In [14]:

    
import matplotlib.cm as cm
import matplotlib.colors

def assign_rgb_color(value):
    color_code = cm.coolwarm(value)
    return matplotlib.colors.rgb2hex(color_code) 

plot_data = coverage.copy()
plot_data['color'] = plot_data['line_covered_ratio'].apply(assign_rgb_color)
plot_data[['line_covered_ratio', 'color']].head(5)









    Out[14]:






  
    
      
      line_covered_ratio
      color
    
  
  
    
      0
      1.0
      #b40426
    
    
      1
      0.8
      #ee8468
    
    
      2
      1.0
      #b40426
    
    
      3
      1.0
      #b40426
    
    
      4
      0.0
      #3b4cc0

OK, let's transform our data frame into the hierarchical JSON format "flare" that is needed by D3 to visualize the bubble chart. We just match our columns to more generic names that are used by D3 later on for displaying the various measures:



In [15]:

    
import json

def create_flare_json(data, 
                      column_name_with_hierarchical_data, 
                      separator=".", 
                      name_column="name", 
                      size_column="size",
                      color_column="color"):
    
    json_data = {}
    json_data['name'] = 'flare'
    json_data['children'] = []

    for row in data.iterrows():
        series = row[1]
        hierarchical_data = series[column_name_with_hierarchical_data]

        last_children = None
        children = json_data['children']

        for part in hierarchical_data.split(separator):
            entry = None

            # build up the tree
            for child in children:
                if "name" in child and child["name"] == part:
                    entry = child
            if not entry:
                entry = {}
                children.append(entry)
            
            # create a new entry section
            entry['name'] = part
            if not 'children' in entry: 
                entry['children'] = []
            children = entry['children']
        
            last_children = children

        # add data to leaf node
        last_children.append({ 
            'name' : series[name_column], 
            'size' : series[size_column],
            'color' : series[color_column]
        })
        
    return json_data
    
json_data = create_flare_json(plot_data, "PACKAGE", ".", "CLASS", "line_size")

print(json.dumps(json_data, indent=3)[0:1000])









    



{
   "name": "flare",
   "children": [
      {
         "name": "org",
         "children": [
            {
               "name": "springframework",
               "children": [
                  {
                     "name": "samples",
                     "children": [
                        {
                           "name": "petclinic",
                           "children": [
                              {
                                 "name": "PetclinicInitializer",
                                 "size": 24,
                                 "color": "#b40426"
                              },
                              {
                                 "name": "model",
                                 "children": [
                                    {
                                       "name": "NamedEntity",
                                       "size": 5,
                                       "color": "#ee8468"
                                    },

Finally, we save the produced JSON data into a file for later displaying.



In [16]:

    
FLARE_JSON_FILE = r'vis/flare.json'

with open (FLARE_JSON_FILE, mode='w', encoding='utf-8') as json_file:
    json_file.write(json.dumps(json_data, indent=3))

Together with some code for the D3 bubble chart visualization, this gives us a nice visualization of the production coverage in our code.

In the overview, you can quickly get to the interesting parts of the software. The heavily used parts are colored in red ("hot spots"), the unused parts are blue ("cold").

We can also deep dive to view all models (were we see the seldom used "Vet" feature)...

...as well as the unused code of the JDBC part.

You can even have a look at the live example.

Summary

I don't know what you think (I would love to hear it!), but I think, it's not much work to do such kind of analysis. For the first proof of concept, I needed around an hour from gathering the coverage data to the first bubble chart (including all the coding).

The key is to get enough representative data, so you have to enable JaCoCo directly in your production environments. The analysis part is easy.

	GROUP	PACKAGE	CLASS	INSTRUCTION_MISSED	INSTRUCTION_COVERED	LINE_MISSED	LINE_COVERED	COMPLEXITY_MISSED	COMPLEXITY_COVERED	METHOD_MISSED	METHOD_COVERED
0	JaCoCo Coverage Report	org.springframework.samples.petclinic	PetclinicInitializer	0	124	0	24	0	8	0	8
1	JaCoCo Coverage Report	org.springframework.samples.petclinic.model	NamedEntity	3	10	1	4	1	3	1	3
2	JaCoCo Coverage Report	org.springframework.samples.petclinic.model	Specialty	0	3	0	1	0	1	0	1

	CLASS	feature
0	PetclinicInitializer	Pet
1	NamedEntity	Framework
2	Specialty	Specialty
3	PetType	Pet
4	Vets	Vet

	line_covered_ratio
feature
Vet	0.147222
Visit	0.385417
Owner	0.549495
Pet	0.594156
Framework	0.640278
Clinic	0.888889
Specialty	1.000000

	PACKAGE	CLASS	line_covered_ratio	line_size
4	org.springframework.samples.petclinic.model	Vets	0.000000	4
10	org.springframework.samples.petclinic.model	Vet	0.083333	12
17	org.springframework.samples.petclinic.web	VetController	0.300000	10
23	org.springframework.samples.petclinic.reposito...	JdbcVetRepositoryImpl	0.000000	17
25	org.springframework.samples.petclinic.reposito...	JdbcVetRepositoryImpl.new BeanPropertyRowMappe...	0.000000	2
32	org.springframework.samples.petclinic.reposito...	JpaVetRepositoryImpl	0.500000	2

	PACKAGE	technology
0	org.springframework.samples.petclinic	petclinic
1	org.springframework.samples.petclinic.model	model
2	org.springframework.samples.petclinic.model	model
3	org.springframework.samples.petclinic.model	model
4	org.springframework.samples.petclinic.model	model

	line_covered_ratio
technology
jdbc	0.000000
util	0.270833
web	0.639809
jpa	0.691558
model	0.739048
service	0.888889
petclinic	1.000000

	PACKAGE	CLASS	line_size
21	org.springframework.samples.petclinic.reposito...	JdbcVisitRowMapper	7
22	org.springframework.samples.petclinic.reposito...	JdbcPetRepositoryImpl	33
23	org.springframework.samples.petclinic.reposito...	JdbcVetRepositoryImpl	17
24	org.springframework.samples.petclinic.reposito...	JdbcPetRowMapper	9
25	org.springframework.samples.petclinic.reposito...	JdbcVetRepositoryImpl.new BeanPropertyRowMappe...	2
26	org.springframework.samples.petclinic.reposito...	JdbcVisitRepositoryImpl	26
27	org.springframework.samples.petclinic.reposito...	JdbcPet	7
28	org.springframework.samples.petclinic.reposito...	JdbcPetVisitExtractor	8
29	org.springframework.samples.petclinic.reposito...	JdbcOwnerRepositoryImpl	43