Analyzer

Analyzer is a python program that tries to gauge the evolvability and maintainability of java software. To achieve this, it tries to measure the complexity of the software under evalutation.

A. What is software evolvability and maintainability?

We define software evolvability as the ease with which a software system or a component can evolve while preserving its design as much as possible. In the case of OO class libraries, we restrict the preservation of the design to the preservation of the library interface. This is important when we consider that the evolution of a system that uses a library is directly influenced by the evolvability of the library. For instance, a system that uses version i of a library can easily be upgraded with version i+1 of the same library if the new version preserves the interface of the older one.

B. What is software complexity?

As the Wikipedia article (https://en.wikipedia.org/wiki/Programming_complexity) on programming complexity states :

"As the number of entities increases, the number of interactions between them would increase exponentially, and it would get to a point where it would be impossible to know and understand all of them. Similarly, higher levels of complexity in software increase the risk of unintentionally interfering with interactions and so increases the chance of introducing defects when making changes. In more extreme cases, it can make modifying the software virtually impossible."

C. How can we measure sofware complexity?

To measure sofware complexity, we have to break this down into metrics. Therefore, we propose to use the metrics as proposed by Sanjay Misra en Ferid Cafer in their paper 'ESTIMATING COMPLEXITY OF PROGRAMS IN PYTHON LANGUAGE'. To quote from this paper :

"Complexity of a system depends on the following factors :

Complexity due to classes. Class is a basic unit of object oriented software development. All the functions are distributed in different classes. Further classes in the object-oriented code either are in inheritance hierarchy or distinctly distributed. Accordingly, the complexity of all the classes is due to classes in inheritance hierarchy and the complexity of distinct classes.
Complexity due to global factors: The second important factor, which is normally neglected in calculating complexity of object-oriented codes, is the complexity of global factors in main program.
Complexity due to coupling: Coupling is one of the important factors for increasing complexity of object- oriented code."

Whitin the Analyzer program, we try to measure complexity using following metrics :

Commit frequency. This can find the 'hotspots' in code where many changes were performed and which can be problem zones. This idea was proposed by Adam Tornhill in 'Your Code as a Crime Scene'.
Distinct number of committers. This metric will tell us how many people worked on the code, thereby increasing complexity.
Class reference count. This metric measures the degree of coupling between classes by counting the references to them.
Inheritance count. This is a measure of the coupling that exists because of inheritance.
Lines of code. A rather crude metric that tries to measure the length of our software system.
Number of methods. This is a measure of the complexity of the system.
Halstead complexity measures : https://en.wikipedia.org/wiki/Halstead_complexity_measures
Cyclomatic Complexity : https://en.wikipedia.org/wiki/Cyclomatic_complexity

D. Interpreting the metrics

Now we try to interpret these measures by clustering, or grouping together the results from analyzing 134 open-source Apache Java projects. To do that, we will use the k-means algorithm, a classic machine-learning algorithm originally developed in 1957. Clustering is an unsupervised learning technique and we use clustering algorithms for exploring data. Using clustering allows us to group similar software projects together, and we can explore the trends in each cluster independently.



In [1]:

    
# Imports and directives

%matplotlib inline
import numpy as np
from math import log
import matplotlib.pyplot as plt
from matplotlib.mlab import PCA as mlabPCA
import javalang
import os, re, requests, zipfile, json, operator
from collections import Counter
import colorsys
import random
from StringIO import StringIO
from subprocess import Popen, PIPE
from sklearn.cluster import KMeans
from tabulate import tabulate
from sklearn import svm



In [2]:

    
# Variables

USER = 'apache' # github user of the repo that is analysed
REPO = 'tomcat' # repository to investigate
BASE_PATH = '/Users/philippepossemiers/Documents/Dev/Spark/data/analyzer/' # local expansion path
COMMENT_LINES = ['/*', '//', '*/', '* '] # remove comments from code
KEY_WORDS = ['abstract','continue','for','new','switch','assert','default','goto','synchronized',
            'boolean','do','if','private','this','break','double','implements','protected','throw',
            'byte','else','public','throws','case','enum','instanceof','return','transient',
            'catch','extends','int','short','try','char','final','interface','static','void',
            'class','finally','long','strictfp','volatile','const','float','native','super','while'
            'true','false','null']
TOP = 25 # number of items to show in graphs
# list of operators to find in source code
OPERATORS = ['\+\+','\-\-','\+=','\-=','\*\*','==','!=','>=','<=','\+','=','\-','\*','/','%','!','&&', \
             '\|\|','\?','instanceof','~','<<','>>','>>>','&','\^','<','>']
# list of variable types to find in source code
OPERANDS = ['boolean','byte','char','short','int','long','float','double','String']
GIT_COMMIT_FIELDS = ['author_name', 'committer name', 'date', 'message', 'name']
GIT_LOG_FORMAT = ['%an', '%cn', '%ad', '%s']
GIT_LOG_FORMAT = '%x1f'.join(GIT_LOG_FORMAT) + '%x1e'



In [3]:

    
# List of Apache Java projects on github

APACHE_PROJECTS = ['abdera', 'accumulo', 'ace', 'activemq', 'airavata', 'ambari', 'ant', 'ant-antlibs-antunit', \
                   'any23', 'archiva', 'aries', 'webservices-axiom', 'axis2-java', \
                   'bigtop', 'bookkeeper', 'bval', 'calcite', 'camel', 'cassandra', 'cayenne', \
                   'chainsaw', 'chukwa', 'clerezza', 'commons-bcel', \
                   'commons-beanutils', 'commons-bsf', 'commons-chain', 'commons-cli', 'commons-codec', \
                   'commons-collections', 'commons-compress', 'commons-configuration', 'commons-daemon', \
                   'commons-dbcp', 'commons-dbutils', 'commons-digester', 'commons-discovery', \
                   'commons-email', 'commons-exec', 'commons-fileupload', 'commons-functor', 'httpcomponents-client', \
                   'commons-io', 'commons-jci', 'commons-jcs', 'commons-jelly', 'commons-jexl', 'commons-jxpath', \
                   'commons-lang', 'commons-launcher', 'commons-logging', 'commons-math', \
                   'commons-net', 'commons-ognl', 'commons-pool', 'commons-proxy', 'commons-rng', 'commons-scxml', \
                   'commons-validator', 'commons-vfs', 'commons-weaver', 'continuum', 'crunch', \
                   'ctakes', 'curator', 'cxf', 'derby', 'directmemory', \
                   'directory-server', 'directory-studio', 'drill', 'empire-db', 'falcon', 'felix', 'flink', \
                   'flume', 'fop', 'directory-fortress-core', 'ftpserver', 'geronimo', 'giraph', 'gora', \
                   'groovy', 'hadoop', 'hama', 'harmony', 'hbase', 'helix', 'hive', 'httpcomponents-client', \
                   'httpcomponents-core', 'jackrabbit', 'jena', 'jmeter', 'lens', 'log4j', \
                   'lucene-solr', 'maven', 'maven-doxia', 'metamodel', 'mina', 'mrunit', 'myfaces', 'nutch', 'oozie', \
                   'openjpa', 'openmeetings', 'openwebbeans', 'orc', 'phoenix', 'pig', 'poi','rat', 'river', \
                   'shindig', 'sling', \
                   'sqoop', 'struts', 'synapse', 'syncope', 'tajo', 'tika', 'tiles', 'tomcat', 'tomee', \
                   'vxquery', 'vysper', 'whirr', 'wicket', 'wink', 'wookie', 'xmlbeans', 'zeppelin', 'zookeeper']

print len(APACHE_PROJECTS)



In [4]:

    
# Global dictionaries

joined = [] # list with all source files
commit_dict = {} # commits per class
reference_dict = {} # number of times a class is referenced
lines_dict = {} # number of lines per class
methods_dict = {} # number of functions per class
operators_dict = {} # number of operators per class
operands_dict = {} # number of operands per class
halstead_dict = {} # Halstead complexity measures
cyclomatic_dict = {} # cyclomatic complexity



In [5]:

    
# Utility functions

# TODO : check if we can use this
def sanitize(contents):
    lines = contents.split('\n')

    # remove stop lines
    for stop_line in COMMENT_LINES:
        lines = [line.lower().lstrip().replace(';', '') for line in lines if stop_line not in line and line <> '']
    
    return '\n'.join(lines)

def find_whole_word(word):
    return re.compile(r'\b({0})\b'.format(word), flags=re.IGNORECASE).search

def all_files(directory):
    for path, dirs, files in os.walk(directory):
        for f in files:
            yield os.path.join(path, f)

def build_joined(repo):
    src_list = []
    repo_url = 'https://github.com/' + repo[0] + '/' + repo[1]
    os.chdir(BASE_PATH)
    os.system('git clone {}'.format(repo_url))
    
    # get all java source files
    src_files = [f for f in all_files(BASE_PATH + repo[1]) if f.endswith('.java')]

    for f in src_files:
        try:
            # read contents
            code = open(f, 'r').read()
            # https://github.com/c2nes/javalang
            tree = javalang.parse.parse(code)
            # create tuple with package + class name and code + tree + file path
            src_list.append((tree.package.name + '.' + tree.types[0].name, (code, tree, f)))
        except:
            pass

    return src_list

def parse_git_log(repo_dir, src):
    # first the dictionary with all classes 
    # and their commit count
    total = 0
    p = Popen('git log --name-only --pretty=format:', shell=True, stdout=PIPE, cwd=repo_dir)
    (log, _) = p.communicate()
    log = log.strip('\n\x1e').split('\x1e')
    log = [r.strip().split('\n') for r in log]
    log = [r for r in log[0] if '.java' in r]
    log2 = []
    for f1 in log:
        for f2 in src:
            if f2[1][2].find(f1) > -1:
                log2.append(f2[0])
    cnt_dict = Counter(log2)
    
    for key, value in cnt_dict.items():
        total += value
        
    cnt_dict['total'] = total

    # and then the list of commits as dictionaries
    p = Popen('git log --format="%s"' % GIT_LOG_FORMAT, shell=True, stdout=PIPE, cwd=repo_dir)
    (log, _) = p.communicate()
    log = log.strip('\n\x1e').split("\x1e")
    log = [row.strip().split("\x1f") for row in log]
    log = [dict(zip(GIT_COMMIT_FIELDS, row)) for row in log]
    # now get list of distinct committers
    committers = len(set([x['committer name'] for x in log]))
    cnt_dict['committers'] = committers
    
    return cnt_dict
    
def count_inheritance(src):
    count = 0
    for name, tup in src:
        if find_whole_word('extends')(tup[0]):
            count += 1
        
    return count
    
def count_references(src):
    names, tups = zip(*src)
    dict = {e : 0 for i, e in enumerate(names)}
    total = 0
    
    for name in names:
        c_name = name[name.rfind('.'):]
        for tup in tups:
            if find_whole_word(c_name)(tup[0]):
                dict[name] += 1
                total += 1
                
    dict['total'] = total
    
    # sort by amount of references
    return {k: v for k, v in dict.iteritems() if v > 1}

def count_lines(src):
    dict = {e : 0 for i, e in enumerate(src)}
    total = 0
    
    for name, tup in src:
        dict[name] = 0
        lines = tup[0].split('\n')
        for line in lines:
            if line != '\n':
                dict[name] += 1
                total += 1
    
    dict['total'] = total
    
    # sort by amount of lines
    return {k: v for k, v in dict.iteritems()}

# constructors not counted
def count_methods(src):
    dict = {e : 0 for i, e in enumerate(src)}
    total = 0
    
    for name, tup in src:
        dict[name] = len(tup[1].types[0].methods)
        total += dict[name]
        
    dict['total'] = total
        
    # sort by amount of functions
    return {k: v for k, v in dict.iteritems()}

def count_operators(src):
    dict = {key: 0 for key in OPERATORS}
    
    for name, tup in src:
        for op in OPERATORS:
            # if operator is in list, match it without anything preceding or following it
            # eg +, but not ++ or +=
            if op in ['\+','\-','!','=']:
                # regex excludes followed_by (?!) and preceded_by (?<!)
                dict[op] += len(re.findall('(?!\-|\*|&|>|<|>>)(?<!\-|\+|=|\*|&|>|<)' + op, tup[0])) 
            else:
                dict[op] += len(re.findall(op, tup[0]))
    
    # TODO : correct bug with regex for the '++'
    dict['\+'] -= dict['\+\+']
    
    total = 0
    distinct = 0
    for key in dict:
        if dict[key] > 0:
            total += dict[key]
            distinct += 1
    dict['total'] = total
    dict['distinct'] = distinct
    
    return dict

def count_operands(src):
    dict = {key: 0 for key in OPERANDS}
    
    for name, tup in src:
        lines = tup[0].split('\n')
        for line in lines:
            for op in OPERANDS:
                if op in line:
                    dict[op] += 1 + line.count(',')

    total = 0
    distinct = 0
    for key in dict:
        if dict[key] > 0:
            total += dict[key]
            distinct += 1
    dict['total'] = total
    dict['distinct'] = distinct

    return dict

def calc_cyclomatic_complexity(src):
    dict = {}
    total = 0
    
    for name, tup in src:
        dict[name] = 1
        dict[name] += len(re.findall('if|else|for|switch|while', tup[0]))
        total += dict[name]
        
    dict['total'] = total
        
    # sort by amount of complexity
    return {k: v for k, v in dict.iteritems()}

def make_hbar_plot(dictionary, title, x_label, top=TOP):
    # show top classes
    vals = sorted(dictionary.values(), reverse=True)[:top]
    lbls = sorted(dictionary, key=dictionary.get, reverse=True)[:top]

    # make plot
    fig = plt.figure(figsize=(10, 7))
    fig.suptitle(title, fontsize=15)
    ax = fig.add_subplot(111)

    # set ticks
    y_pos = np.arange(len(lbls)) + 0.5
    ax.barh(y_pos, vals, align='center', alpha=0.4, color='lightblue')
    ax.set_yticks(y_pos)
    ax.set_yticklabels(lbls)
    ax.set_xlabel(x_label)

    plt.show()
    pass

# Clustering
def random_centroid_selector(total_clusters , clusters_plotted):
    random_list = []
    for i in range(0, clusters_plotted):
        random_list.append(random.randint(0, total_clusters - 1))
        
    return random_list

def plot_cluster(kmeansdata, centroid_list, names, num_cluster, title):
    mlab_pca = mlabPCA(kmeansdata)
    cutoff = mlab_pca.fracs[1]
    users_2d = mlab_pca.project(kmeansdata, minfrac=cutoff)
    centroids_2d = mlab_pca.project(centroid_list, minfrac=cutoff)

    # make plot
    fig = plt.figure(figsize=(20, 15))
    fig.suptitle(title, fontsize=15)
    ax = fig.add_subplot(111)
    
    plt.xlim([users_2d[:, 0].min() - 3, users_2d[:, 0].max() + 3])
    plt.ylim([users_2d[:, 1].min() - 3, users_2d[:, 1].max() + 3])

    random_list = random_centroid_selector(num_cluster, 50)

    for i, position in enumerate(centroids_2d):
        if i in random_list:
            plt.scatter(centroids_2d[i, 0], centroids_2d[i, 1], marker='o', c='red', s=100)

    for i, position in enumerate(users_2d):
        plt.scatter(users_2d[i, 0], users_2d[i, 1], marker='o', c='lightgreen')

    for label, x, y in zip(names, users_2d[:, 0], users_2d[:, 1]):
        ax.annotate(
            label,
            xy = (x, y), xytext=(-15, 15),
            textcoords = 'offset points', ha='right', va='bottom',
            bbox = dict(boxstyle='round,pad=0.5', fc='white', alpha=0.5),
            arrowprops = dict(arrowstyle = '->', connectionstyle='arc3,rad=0'))

    pass

Analyzing one project



In [6]:

    
# first build list of source files
joined = build_joined((USER, REPO))

1. Commit frequency



In [16]:

    
commit_dict = parse_git_log(BASE_PATH + REPO, joined)
make_hbar_plot(commit_dict, 'Commit frequency', 'Commits', TOP)









    



Distinct committers : 35

2. Distinct committers



In [ ]:

    
print 'Distinct committers : ' + str(commit_dict['committers'])

3. Class reference count



In [20]:

    
reference_dict = count_references(joined)
make_hbar_plot(reference_dict, 'Top 25 referenced classes', 'References', TOP)

4. Inheritance count



In [21]:

    
inheritance_count = count_inheritance(joined)
print 'Inheritance count : ' + inheritance_count

5. Lines of code



In [22]:

    
lines_dict = count_lines(joined)
make_hbar_plot(lines_dict, 'Largest 25 classes', 'Lines of code', TOP)

6. Number of methods



In [23]:

    
methods_dict = count_methods(joined)
make_hbar_plot(methods_dict, 'Top 25 classes in nr of methods', 'Number of methods', TOP)

7. Halstead complexity measures

To measure the Halstead complexity, following metrics are taken into account :

the number of distinct operators (https://docs.oracle.com/javase/tutorial/java/nutsandbolts/opsummary.html)
the number of distinct operands
the total number of operators
the total number of operands

a) Number of operators



In [24]:

    
operators_dict = count_operators(joined)
make_hbar_plot(operators_dict, 'Top 25 operators', 'Number of operators', TOP)

b) Number of operands



In [25]:

    
operands_dict = count_operands(joined)
make_hbar_plot(operands_dict, 'Top 25 operand types', 'Number of operands', TOP)

Complexity measures



In [27]:

    
halstead_dict['PROGRAM_VOCABULARY'] = operators_dict['distinct'] + operands_dict['distinct']
halstead_dict['PROGRAM_LENGTH'] = round(operators_dict['total'] + operands_dict['total'], 0)
halstead_dict['VOLUME'] = round(halstead_dict['PROGRAM_LENGTH'] * log(halstead_dict['PROGRAM_VOCABULARY'], 2), 0)
halstead_dict['DIFFICULTY'] = (operators_dict['distinct'] / 2) * (operands_dict['total'] / operands_dict['distinct'])
halstead_dict['EFFORT'] = round(halstead_dict['VOLUME'] * halstead_dict['DIFFICULTY'], 0)
halstead_dict['TIME'] = round(halstead_dict['EFFORT'] / 18, 0)
halstead_dict['BUGS'] = round(halstead_dict['VOLUME'] / 3000, 0)

print halstead_dict









    



{'PROGRAM_VOCABULARY': 34, 'BUGS': 89.0, 'VOLUME': 267977.0, 'DIFFICULTY': 8484, 'PROGRAM_LENGTH': 52674.0, 'TIME': 126306493.0, 'EFFORT': 2273516868.0}

8. Cyclomatic complexity



In [28]:

    
cyclomatic_dict = calc_cyclomatic_complexity(joined)
make_hbar_plot(cyclomatic_dict, 'Top 25 classes with cyclomatic complexity', 'Level of complexity', TOP)

Analyzing Apache Java projects



In [35]:

    
# featurize all metrics
def make_features(repo, dict):
    features = []
    
    for key, value in dict.items():
        features.append(int(value))
    
    return features

# iterate all repos and build
# dictionary with all metrics
def make_rows(repos):
    rows = []
    
    try:
        for repo in repos:
            dict = {}
            joined = build_joined(repo)
            
            github_dict = parse_git_log(BASE_PATH + repo[1], joined)
            dict['commits'] = github_dict['total']
            #dict['committers'] = github_dict['committers'] Uncomment this line for the next run. 
            # Was added at the last minute
            dict['references'] = count_references(joined)['total']
            dict['inheritance'] = count_inheritance(joined)
            dict['lines'] = count_lines(joined)['total']
            dict['methods'] = count_methods(joined)['total']
            operators_dict = count_operators(joined)
            operands_dict = count_operands(joined)
            dict['program_vocabulary'] = operators_dict['distinct'] + operands_dict['distinct']
            dict['program_length'] = round(operators_dict['total'] + operands_dict['total'], 0)
            dict['volume'] = round(dict['program_length'] * log(dict['program_vocabulary'], 2), 0)
            dict['difficulty'] = (operators_dict['distinct'] / 2) * (operands_dict['total'] / operands_dict['distinct'])
            dict['effort'] = round(dict['volume'] * dict['difficulty'], 0)
            dict['time'] = round(dict['effort'] / 18, 0)
            dict['bugs'] = round(dict['volume'] / 3000, 0)
            dict['cyclomatic'] = calc_cyclomatic_complexity(joined)['total']

            rows.append(make_features(repo, dict))
    except:
        pass
    
    return rows

def cluster_repos(arr, nr_clusters):
    kmeans = KMeans(n_clusters=nr_clusters)
    kmeans.fit(arr)

    centroids = kmeans.cluster_centers_
    labels = kmeans.labels_

    return (centroids, labels)

Construct model with Apache projects



In [7]:

    
repositories = [('apache', x) for x in APACHE_PROJECTS]

We break the projects down in batches of five to make the analysis manageable



In [42]:

    
rows = make_rows(repositories[:5])



In [43]:

    
rows.extend(make_rows(repositories[5:10]))



In [44]:

    
rows.extend(make_rows(repositories[10:15]))



In [45]:

    
rows.extend(make_rows(repositories[15:20]))



In [46]:

    
rows.extend(make_rows(repositories[20:25]))



In [48]:

    
rows.extend(make_rows(repositories[25:30]))



In [49]:

    
rows.extend(make_rows(repositories[30:35]))



In [50]:

    
rows.extend(make_rows(repositories[35:40]))



In [51]:

    
rows.extend(make_rows(repositories[40:45]))



In [52]:

    
rows.extend(make_rows(repositories[45:50]))



In [57]:

    
rows.extend(make_rows(repositories[50:55]))



In [78]:

    
rows.extend(make_rows(repositories[55:60]))



In [12]:

    
rows.extend(make_rows(repositories[60:65]))



In [10]:

    
rows.extend(make_rows(repositories[65:70]))



In [12]:

    
rows.extend(make_rows(repositories[70:75]))



In [10]:

    
rows.extend(make_rows(repositories[75:80]))



In [10]:

    
rows.extend(make_rows(repositories[80:85]))



In [12]:

    
rows.extend(make_rows(repositories[85:90]))



In [10]:

    
rows.extend(make_rows(repositories[90:95]))



In [12]:

    
rows.extend(make_rows(repositories[95:100]))



In [14]:

    
rows.extend(make_rows(repositories[100:105]))



In [9]:

    
rows.extend(make_rows(repositories[105:110]))



In [12]:

    
rows.extend(make_rows(repositories[110:115]))



In [14]:

    
rows.extend(make_rows(repositories[115:120]))



In [10]:

    
rows.extend(make_rows(repositories[120:125]))



In [12]:

    
rows.extend(make_rows(repositories[125:130]))



In [15]:

    
rows.extend(make_rows(repositories[130:133]))



In [14]:

    
rows.extend(make_rows(repositories[133:134]))
print rows









    



[[6566, 384, 9416, 3906, 95025, 36, 146, 437707, 22906, 11298, 84664, 557006475, 10026116542], [15406, 1060, 144673, 18102, 630470, 36, 853, 2558906, 155298, 30203, 494960, 22077387999, 397392983988], [3852, 228, 14917, 2412, 93126, 36, 158, 474703, 20696, 6844, 91820, 545802960, 9824453288], [35114, 3049, 69789, 12527, 649765, 36, 1014, 3042609, 159666, 89509, 588521, 26988956033, 485801208594], [18704, 589, 224437, 17342, 923130, 36, 1160, 3480187, 185653, 21991, 673160, 35894842062, 646107157111], [29794, 1798, 74265, 30112, 760915, 36, 1237, 3711758, 169039, 72690, 717952, 34857325587, 627431860562], [10953, 663, 33461, 21462, 257534, 36, 515, 1544562, 43719, 27217, 298759, 3751483671, 67526706078], [296, 29, 568, 195, 5417, 27, 8, 25429, 890, 129, 5348, 1257323, 22631810], [2317, 209, 6531, 1173, 50177, 33, 90, 270874, 7044, 2950, 53698, 106002025, 1908036456], [5864, 301, 18010, 5693, 128391, 34, 151, 452530, 26039, 4923, 88950, 654634926, 11783428670], [13050, 889, 35194, 7558, 288393, 36, 464, 1391113, 62998, 41527, 269078, 4868740932, 87637336774], [5507, 1502, 18080, 5735, 140125, 34, 218, 654787, 22217, 15896, 128706, 808189043, 14547402779], [19463, 1417, 61913, 22468, 507327, 36, 910, 2729064, 96668, 57622, 527873, 14656286597, 263813158752], [701, 42, 2113, 373, 21397, 34, 32, 95364, 4620, 1051, 18745, 24476760, 440581680], [3499, 221, 12247, 3090, 124706, 36, 197, 591662, 29315, 3687, 114443, 963587307, 17344571530], [1737, 186, 4517, 1385, 39963, 34, 79, 238307, 8676, 1810, 46842, 114863974, 2067551532], [14887, 948, 37988, 7995, 376713, 36, 738, 2212795, 67769, 26351, 428013, 8331050242, 149958904355], [87460, 12847, 195931, 95813, 1753454, 36, 2776, 8327824, 370617, 364943, 1610821, 171468508189, 3086433147408], [18262, 1124, 51851, 31854, 460055, 36, 741, 2224464, 134979, 33172, 430270, 16680884792, 300255926256], [19853, 2228, 45818, 11388, 413665, 36, 1605, 4813852, 65637, 41459, 931126, 17553711318, 315966803724], [1548, 114, 4711, 819, 43989, 34, 63, 188333, 5628, 918, 37019, 58885451, 1059938124], [1965, 252, 7510, 1170, 65250, 35, 108, 325273, 17511, 3003, 63415, 316436417, 5695855503], [917, 80, 2406, 415, 20595, 35, 35, 103996, 2314, 763, 20275, 13369264, 240646744], [3330, 324, 9773, 5246, 72874, 36, 169, 507340, 16965, 6358, 98133, 478167950, 8607023100], [2438, 184, 8494, 1716, 70999, 32, 176, 526910, 15235, 1620, 105382, 445970769, 8027473850], [1002, 93, 1981, 687, 19019, 31, 46, 138485, 3982, 863, 27953, 30635959, 551447270], [532, 14, 1222, 456, 12179, 29, 21, 63921, 3050, 295, 13158, 10831058, 194959050], [1287, 37, 3299, 1459, 34716, 35, 104, 311430, 37323, 566, 60716, 645750105, 11623501890], [5071, 406, 16824, 2464, 118409, 36, 310, 929873, 14807, 2815, 179862, 764923862, 13768629511], [2815, 142, 7898, 4268, 70294, 35, 134, 400807, 23153, 1619, 78141, 515549137, 9279884471], [5395, 197, 12525, 974, 124320, 34, 219, 656685, 16968, 3287, 129079, 619035060, 11142631080], [49, 3, 242, 53, 2034, 23, 4, 12376, 304, 35, 2736, 209017, 3762304], [2327, 45, 2746, 998, 31700, 31, 45, 134893, 5962, 449, 27228, 44679559, 804232066], [764, 41, 1364, 273, 14879, 29, 29, 86428, 2640, 351, 17791, 12676107, 228169920], [1515, 130, 4674, 2282, 42899, 32, 84, 252750, 4972, 2274, 50550, 69815167, 1256673000], [211, 26, 783, 51, 6802, 25, 15, 43940, 720, 236, 9462, 1757600, 31636800], [434, 29, 1072, 427, 12461, 31, 29, 87120, 1958, 179, 17585, 9476720, 170580960], [305, 15, 908, 703, 7837, 30, 16, 47670, 1672, 248, 9715, 4428013, 79704240], [388, 18, 1192, 528, 10302, 30, 25, 74771, 3058, 295, 15238, 12702762, 228649718], [2006, 257, 3911, 3757, 43386, 32, 122, 367220, 8268, 3707, 73444, 168676387, 3036174960], [4167, 170, 10924, 1514, 105754, 36, 217, 651571, 15522, 4067, 126031, 561871392, 10113685062], [2299, 143, 6285, 2310, 55982, 35, 130, 390118, 16679, 808, 76057, 361487673, 6506778122], [335, 33, 751, 66, 7520, 26, 9, 25796, 1440, 273, 5488, 2063680, 37146240], [3605, 324, 9911, 2777, 102042, 34, 194, 582209, 16452, 3693, 114440, 532139026, 9578502468], [2763, 458, 7491, 4467, 63796, 32, 150, 449860, 6941, 3984, 89972, 173471014, 3122478260], [1591, 95, 4168, 1038, 36397, 36, 69, 207743, 7293, 1147, 40183, 84170539, 1515069699], [2023, 147, 4572, 254, 40360, 34, 69, 208159, 10764, 1584, 40916, 124479082, 2240623476], [5858, 107, 15404, 5174, 140914, 36, 433, 1298633, 86177, 1126, 251190, 6217349780, 111912296041], [129, 12, 509, 97, 3950, 24, 9, 26267, 684, 65, 5729, 998146, 17966628], [512, 52, 1511, 107, 12952, 30, 29, 87431, 940, 291, 17818, 4565841, 82185140], [11238, 589, 36946, 5540, 319104, 36, 768, 2302948, 120692, 14180, 445451, 15441522223, 277947400016], [1929, 135, 7663, 2965, 59594, 36, 123, 368848, 13897, 1399, 71345, 284771148, 5125880656], [1517, 167, 3761, 1903, 34995, 35, 69, 205556, 7735, 2047, 40075, 88331981, 1589975660], [952, 40, 2445, 958, 24754, 34, 45, 133505, 4044, 401, 26242, 29994123, 539894220], [390, 62, 766, 530, 10591, 31, 103, 310425, 1287, 451, 62659, 22195388, 399516975], [402, 42, 1679, 323, 14701, 35, 33, 99252, 4030, 473, 19350, 22221420, 399985560], [1159, 54, 3619, 855, 27318, 25, 49, 147024, 5620, 1523, 31660, 45904160, 826274880], [1144, 103, 4953, 1172, 33127, 35, 89, 266789, 7280, 731, 52013, 107901329, 1942223920], [2530, 243, 6071, 3172, 59800, 36, 89, 266380, 7878, 2681, 51525, 116585647, 2098541640], [296, 57, 1162, 457, 10436, 33, 19, 56644, 2592, 532, 11229, 8156736, 146821248], [7386, 386, 15537, 8174, 142228, 35, 212, 637483, 28054, 5045, 124283, 993552671, 17883948082], [3803, 361, 8684, 2221, 83761, 34, 147, 442436, 18264, 7041, 86966, 448925061, 8080651104], [6462, 627, 35453, 4982, 236749, 36, 445, 1335495, 55601, 10704, 258320, 4125269861, 74254857495], [3761, 257, 24420, 3364, 139476, 34, 171, 513498, 29224, 4487, 100934, 833692531, 15006465552], [44217, 3690, 109550, 42105, 958540, 36, 1377, 4131401, 188799, 402871, 799122, 43333576522, 780004377399], [37483, 1743, 134411, 32957, 1234912, 36, 2643, 7930443, 329095, 63753, 1533957, 144992729949, 2609869139085], [1059, 70, 1635, 704, 20439, 36, 30, 90288, 4758, 1217, 17464, 23866128, 429590304], [11317, 1050, 39497, 20406, 405282, 36, 742, 2225973, 61984, 14341, 430562, 7665261691, 137974710432], [11601, 1126, 48307, 10144, 390489, 36, 658, 1973929, 44317, 46782, 381810, 4859922861, 87478611493], [18313, 1335, 47633, 14406, 478682, 36, 811, 2432605, 112060, 27299, 470530, 15144317572, 272597716300], [4382, 293, 9697, 2390, 85898, 33, 140, 418836, 15660, 3971, 83030, 364387320, 6558971760], [8140, 416, 19309, 6022, 177241, 35, 251, 753579, 48295, 8588, 146917, 2021894323, 36394097805], [29753, 2001, 93356, 19040, 809946, 36, 1322, 3966341, 166036, 355788, 767195, 36586410793, 658555394276], [29067, 2980, 94082, 27364, 903225, 36, 1994, 5983190, 215930, 79644, 1157307, 71775012039, 1291950216700], [3999, 245, 14409, 3210, 125949, 36, 184, 552288, 29224, 5179, 106827, 896670251, 16140064512], [14940, 1101, 53071, 8911, 367156, 36, 739, 2216435, 113074, 24852, 428717, 13923398399, 250621171190], [3558, 135, 13483, 2339, 121491, 32, 302, 907115, 17897, 3510, 181423, 901924286, 16234637155], [1897, 217, 4096, 2415, 42040, 34, 66, 197383, 4680, 2843, 38798, 51319580, 923752440], [20633, 1855, 62217, 16261, 429912, 36, 685, 2054756, 86749, 43091, 397444, 9902668236, 178248028244], [7137, 988, 19343, 5258, 184820, 36, 334, 1000954, 26234, 11063, 193611, 1458834846, 26259027236], [2598, 198, 5760, 2154, 50499, 35, 87, 262414, 11752, 1917, 51160, 171327185, 3083889328], [16162, 900, 43545, 19191, 288971, 36, 560, 1678520, 80132, 20106, 324670, 7472398036, 134503164640], [68532, 3929, 201286, 59806, 1878276, 36, 2823, 8469273, 395213, 186077, 1638181, 185953710564, 3347166790149], [3186, 262, 8246, 2625, 76580, 36, 132, 397433, 17875, 3083, 76874, 394673049, 7104114875], [73687, 5483, 205504, 8413, 1767691, 36, 3224, 9673426, 599430, 351300, 1871096, 322141208177, 5798541747180], [34923, 1868, 202909, 44727, 1671971, 36, 2887, 8660126, 352690, 72054, 1675097, 169685546608, 3054339838940], [3807, 393, 16007, 3785, 123928, 36, 219, 656684, 39429, 7598, 127020, 1438466302, 25892393436], [48025, 3171, 241368, 42958, 1500657, 36, 2364, 7092062, 400088, 102901, 1371792, 157636050081, 2837448901456], [4167, 170, 10924, 1514, 105754, 36, 217, 651571, 15522, 4067, 126031, 561871392, 10113685062], [4127, 264, 9886, 2600, 101622, 35, 229, 687847, 23595, 6056, 134102, 901652776, 16229749965], [23586, 2008, 77173, 18730, 615379, 36, 1280, 3838845, 120640, 56145, 742534, 25728792267, 463118260800], [43384, 3242, 92431, 15050, 766062, 36, 1420, 4258710, 153985, 116872, 823747, 36432081075, 655777459350], [9825, 805, 24180, 22310, 217750, 36, 409, 1226317, 46566, 13048, 237202, 3172482079, 57104677422], [6794, 404, 16016, 6770, 160358, 36, 272, 815980, 42107, 6044, 157832, 1908803881, 34358469860], [2358, 177, 7465, 1730, 58924, 34, 112, 336668, 9816, 1906, 66176, 183596283, 3304733088], [46280, 5146, 170859, 75319, 1466099, 36, 2870, 8609533, 489528, 102878, 1665311, 234144859468, 4214607470424], [6218, 413, 27934, 8650, 131843, 36, 198, 592530, 20969, 7287, 114611, 690264532, 12424761570], [3596, 192, 6597, 2852, 73459, 36, 149, 447648, 11531, 2387, 86587, 286768283, 5161829088], [4333, 468, 10200, 2055, 90671, 35, 130, 390913, 18226, 6248, 76212, 395821130, 7124780338], [1493, 154, 4578, 1370, 43408, 36, 75, 225853, 6123, 1668, 43686, 76827662, 1382897919], [883, 49, 1124, 756, 16682, 30, 26, 77917, 1694, 447, 15879, 7332855, 131991398], [10854, 1025, 31887, 11279, 279402, 36, 401, 1202804, 42666, 17850, 232654, 2851046415, 51318835464], [2689, 177, 11172, 4434, 84634, 36, 158, 473622, 18902, 5157, 91611, 497355725, 8952403044], [9096, 713, 21768, 9194, 233749, 36, 384, 1150794, 53040, 11224, 222594, 3391006320, 61038113760], [42201, 2301, 82186, 23939, 709202, 36, 1091, 3272795, 138268, 130947, 633045, 25140156614, 452522819060], [4220, 329, 8328, 3267, 80915, 34, 111, 331733, 18432, 4610, 65206, 339694592, 6114502656], [4697, 595, 14866, 10463, 126540, 35, 183, 547587, 22724, 6370, 106757, 691298166, 12443366988], [3068, 143, 9885, 528, 91659, 36, 162, 485895, 33319, 1728, 93985, 899418639, 16189535505], [14326, 1072, 35164, 13253, 381971, 36, 671, 2012414, 147810, 16554, 389254, 16525272963, 297454913340], [14504, 1205, 45496, 10739, 398630, 36, 685, 2054559, 119353, 33511, 397406, 13623210018, 245217780327], [27850, 1661, 79756, 32038, 640001, 36, 1463, 4387901, 156507, 54068, 848736, 38152067878, 686737221807], [563, 81, 1622, 847, 13621, 29, 23, 69848, 2893, 749, 14378, 11226126, 202070264], [12022, 2108, 89883, 4337, 576863, 36, 1293, 3877563, 99112, 83453, 750023, 21350723559, 384313024056], [8157, 398, 16550, 9338, 167859, 36, 278, 834731, 38454, 8252, 161459, 1783263660, 32098745874], [29648, 1814, 87297, 22765, 710506, 36, 1336, 4009086, 155974, 136650, 775463, 34739621098, 625313179764], [4749, 504, 13837, 3026, 121009, 36, 185, 556346, 31083, 6168, 107612, 960716818, 17292902718], [12715, 1195, 29847, 11084, 277365, 36, 515, 1544179, 66807, 25985, 298685, 5731220359, 103161966453], [25597, 2279, 88208, 21989, 590061, 36, 992, 2977422, 107796, 188251, 575912, 17830787884, 320954181912], [9065, 1508, 25959, 10932, 224831, 34, 270, 810692, 34704, 26556, 159351, 1563014176, 28134255168], [15001, 1073, 37533, 9370, 335945, 36, 560, 1679998, 92950, 26292, 324956, 8675323006, 156155814100], [5761, 448, 20174, 7659, 164444, 36, 295, 884471, 45357, 9956, 171080, 2228719508, 40116951147], [1076, 75, 2871, 1211, 32822, 28, 55, 165748, 7139, 1067, 34478, 65737498, 1183274972], [23102, 1128, 69870, 33699, 531221, 36, 991, 2973162, 121290, 45315, 575088, 20034156610, 360614818980], [34644, 2137, 80943, 34805, 752096, 36, 1316, 3948815, 146692, 184658, 763805, 32181087221, 579259569980], [3463, 527, 8017, 4098, 70329, 35, 100, 301340, 29627, 4970, 58749, 495988899, 8927800180], [3345, 341, 8608, 3799, 73852, 31, 100, 300333, 11319, 5889, 60622, 188859402, 3399469227], [1179, 133, 2962, 1239, 29319, 35, 39, 116035, 6123, 1665, 22622, 39471239, 710482305], [13506, 2467, 45326, 18414, 420852, 36, 765, 2293508, 60593, 45887, 443625, 7720585014, 138970530244], [7750, 674, 21023, 3965, 211292, 36, 682, 2045496, 39598, 21990, 395653, 4499863923, 80997550608], [2688, 118, 4477, 1024, 48818, 32, 93, 278960, 8481, 1857, 55792, 131436653, 2365859760], [9520, 638, 24221, 2217, 234520, 36, 494, 1482517, 52611, 12779, 286758, 4333150105, 77996701887], [4528, 178, 17566, 3294, 118992, 36, 172, 514914, 38155, 4478, 99598, 1091474648, 19646543670], [4114, 370, 13459, 4126, 115745, 36, 182, 546249, 27625, 5433, 105659, 838340479, 15090128625]]

Clustering Apache Java projects



In [10]:

    
# TWO clusters
NR_CLUSTERS = 2
arr = np.array(rows)
tup = cluster_repos(arr, NR_CLUSTERS)
centroids = tup[0]

plot_cluster(arr, centroids, APACHE_PROJECTS, NR_CLUSTERS, str(NR_CLUSTERS) + ' Clusters')









    



/usr/local/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):



In [11]:

    
# THREE clusters
NR_CLUSTERS = 3
arr = np.array(rows)
tup = cluster_repos(arr, NR_CLUSTERS)
centroids = tup[0]

plot_cluster(arr, centroids, APACHE_PROJECTS, NR_CLUSTERS, str(NR_CLUSTERS) + ' Clusters')



In [12]:

    
# FOUR clusters
NR_CLUSTERS = 4
arr = np.array(rows)
tup = cluster_repos(arr, NR_CLUSTERS)
centroids = tup[0]

plot_cluster(arr, centroids, APACHE_PROJECTS, NR_CLUSTERS, str(NR_CLUSTERS) + ' Clusters')

Clustering results

The clustering shows us that with four clusters, we can cover the whole graph. This gives us four clearly defined areas in which all projects can be mapped. The task is now to discover what parameters have the largest importance in this clustering. We can do this by examining the features of the four projects closest to the centroids and comparing them.



In [29]:

    
names = [x[1] for x in repositories]
print names.index('synapse')
print names.index('tomcat')
print names.index('groovy')
print names.index('hama')

Tabulating groovy and synapse



In [30]:

    
headers = ['Repo', 'Com', 'Ref', 'Inh', 'Line', 'Meth', 'Voc', \
           'Len', 'Vol', 'Diff', 'Eff', 'Time', 'Bug','Cycl']

print tabulate([[names[118]] + [x for x in rows[118]], [names[123]] + [x for x in rows[123]], \
               [names[82]] + [x for x in rows[82]], [names[84]] + [x for x in rows[84]]], headers=headers)









    



Repo       Com    Ref     Inh    Line     Meth    Voc    Len      Vol    Diff     Eff     Time           Bug           Cycl
-------  -----  -----  ------  ------  -------  -----  -----  -------  ------  ------  -------  ------------  -------------
synapse   9065   1508   25959   10932   224831     34    270   810692   34704   26556   159351    1563014176    28134255168
tomcat   34644   2137   80943   34805   752096     36   1316  3948815  146692  184658   763805   32181087221   579259569980
groovy   68532   3929  201286   59806  1878276     36   2823  8469273  395213  186077  1638181  185953710564  3347166790149
hama     73687   5483  205504    8413  1767691     36   3224  9673426  599430  351300  1871096  322141208177  5798541747180

Construct a prediction model with the Apache projects

Labeling all projects using four clusters



In [31]:

    
# THREE clusters
NR_CLUSTERS = 4
arr = np.array(rows)
tup = cluster_repos(arr, NR_CLUSTERS)
labels = tup[1]

Construct a Support Vector Classification model



In [32]:

    
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(rows, labels)









    Out[32]:





SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.001, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)

Test it



In [33]:

    
print labels
print clf.predict(rows[3])
print clf.predict(rows[34])









    



[3 0 3 0 0 0 3 3 3 3 3 3 3 3 3 3 3 2 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 2 3 3 3 3 3 3 0 0
 3 3 3 3 3 3 3 3 2 3 1 2 3 2 3 3 0 0 3 3 3 1 3 3 3 3 3 3 3 3 0 3 3 3 0 3 0
 3 0 3 0 3 3 0 3 3 3 3 0 0 3 3 3 3 3 3 3 3 3]
[0]
[3]

Analyze JetBrains kotlin project



In [36]:

    
#repositories = [('qos-ch', 'slf4j'), ('mockito', 'mockito'), ('elastic', 'elasticsearch')]
repositories = [('JetBrains', 'kotlin')]
rows = make_rows(repositories)
print clf.predict(rows[0])

[3]



In [39]:

    
print tabulate([['Kotlin'] + [x for x in rows[0]]], headers=headers)









    



Repo      Com    Ref    Inh    Line    Meth    Voc    Len      Vol    Diff     Eff    Time          Bug          Cycl
------  -----  -----  -----  ------  ------  -----  -----  -------  ------  ------  ------  -----------  ------------
Kotlin  17295   1544  62387   24175  705252     36   1268  3804258  150852  629688  735844  31882218212  573879927816



In [ ]: