Lexical Information Overlap

This notebook contains some code to process and normalize the lexical information appearing in CodeMethod comments and implementations (i.e., CodeMethod.comment and CodeMethod.code, respectively).

The overall processing encompasses the following steps:

(Tokens Extraction)
- The textual data are chunked into tokens (thanks to nltk)
(Tokens Normalization)
- Most common (english) stopwords are removed, as well as Java language reserved keywords;
- Each non-english token is processed by the LINSEN algorithm;
- Each remaining token (a.k.a, lexeme) is turned into lowercase letters;
- Resulting tokens are finally stemmed.

Once those processing steps are completed, the jaccard_coefficient is computed between code and comments of each method, and all the analysis information are then stored in a CodeLexiconInfo model instance).

Python Version

This notebook requires Python 3

Preliminaries



In [2]:

    
%load preamble_directives.py

Import Django Model for Code Lexicon Information



In [3]:

    
from source_code_analysis.models import CodeLexiconInfo

DATA FETCHING CODE

SKIP this part if the Database already contains `CodeLexiconInfo` data

Analysis Code

LINSEN Normalizer



In [ ]:

    
from lexical_analysis import LINSENnormalizer

Lexical Analyzer



In [5]:

    
from lexical_analysis import LexicalAnalyzer



In [5]:

    
from source_code_analysis.models import SoftwareProject
target_sw_project = SoftwareProject.objects.get(name__iexact='CoffeeMaker')



In [6]:

    
# Use RelatedManager to get all the code methods associated to the target project
code_methods = target_sw_project.code_methods.all()



In [10]:

    
total_methods = code_methods.count()
coefficients = list()
for i, method in enumerate(code_methods):
    print('Analyzing Method {0} out of {1}: {2}'.format(i+1, total_methods, method.method_name))
    analyzer = LexicalAnalyzer(method)
    analyzer.analyse_textual_information()
    coefficients.append(analyzer.code_lexical_info.jaccard_coeff)









    



Analyzing Method 1 out of 47: CoffeeMaker
b''
Analyzing Method 2 out of 47: addRecipe
b''
Analyzing Method 3 out of 47: deleteRecipe
b''
Analyzing Method 4 out of 47: editRecipe
b''
Analyzing Method 5 out of 47: addInventory
b''
Analyzing Method 6 out of 47: checkInventory
b''
Analyzing Method 7 out of 47: makeCoffee
b''
Analyzing Method 8 out of 47: Inventory
b''
Analyzing Method 9 out of 47: getChocolate
b''
Analyzing Method 10 out of 47: setChocolate
b''
Analyzing Method 11 out of 47: addChocolate
b''
Analyzing Method 12 out of 47: getCoffee
b''
Analyzing Method 13 out of 47: setCoffee
b''
Analyzing Method 14 out of 47: addCoffee
b''
Analyzing Method 15 out of 47: getMilk
b''
Analyzing Method 16 out of 47: setMilk
b''
Analyzing Method 17 out of 47: addMilk
b''
Analyzing Method 18 out of 47: getSugar
b''
Analyzing Method 19 out of 47: setSugar
b''
Analyzing Method 20 out of 47: addSugar
b''
Analyzing Method 21 out of 47: enoughIngredients
b''
Analyzing Method 22 out of 47: useIngredients
b''
Analyzing Method 23 out of 47: mainMenu
b''
Analyzing Method 24 out of 47: addRecipe
b''
Analyzing Method 25 out of 47: deleteRecipe
b''
Analyzing Method 26 out of 47: editRecipe
b''
Analyzing Method 27 out of 47: addInventory
b''
Analyzing Method 28 out of 47: checkInventory
b''
Analyzing Method 29 out of 47: makeCoffee
b''
Analyzing Method 30 out of 47: inputOutput
b''
Analyzing Method 31 out of 47: recipeListSelection
b''
Analyzing Method 32 out of 47: Recipe
b''
Analyzing Method 33 out of 47: getAmtChocolate
b''
Analyzing Method 34 out of 47: setAmtChocolate
b''
Analyzing Method 35 out of 47: getAmtCoffee
b''
Analyzing Method 36 out of 47: setAmtCoffee
b''
Analyzing Method 37 out of 47: getAmtMilk
b''
Analyzing Method 38 out of 47: setAmtMilk
b''
Analyzing Method 39 out of 47: getAmtSugar
b''
Analyzing Method 40 out of 47: setAmtSugar
b''
Analyzing Method 41 out of 47: getName
b''
Analyzing Method 42 out of 47: setName
b''
Analyzing Method 43 out of 47: getPrice
b''
Analyzing Method 44 out of 47: setPrice
b''
Analyzing Method 45 out of 47: RecipeBook
b''
Analyzing Method 46 out of 47: getRecipes
b''
Analyzing Method 47 out of 47: deleteRecipe
b''

DATA ANALYSIS CODE (Statistics)



In [4]:

    
from scipy import median
from scipy import mean
from scipy import var, std
import numpy as np



In [5]:

    
from source_code_analysis.models import SoftwareProject

projects = list()
projects.append(SoftwareProject.objects.get(name__iexact='CoffeeMaker', version__exact='1.0'))
projects.append(SoftwareProject.objects.get(name__iexact='Jfreechart', version__exact='0.6.0'))
projects.append(SoftwareProject.objects.get(name__iexact='Jfreechart', version__exact='0.7.1'))
projects.append(SoftwareProject.objects.get(name__iexact='JHotDraw', version__exact='7.4.1'))

print(projects)









    



[<SoftwareProject: CoffeeMaker (1.0)>, <SoftwareProject: JFreeChart (0.6.0)>, <SoftwareProject: JFreeChart (0.7.1)>, <SoftwareProject: JHotDraw (7.4.1)>]

Jaccard Coefficient Statistics



In [8]:

    
for project in projects:
    code_methods = project.code_methods.all()
    coefficients = list()
    for method in code_methods:
        # Check that this method has no "wrong_association"
        n_evaluations = method.agreement_evaluations.count()
        n_eval_wrong_assocation = method.agreement_evaluations.filter(wrong_association=True).count()
        if n_evaluations == n_eval_wrong_assocation:
            # if **all** the evaluations for the current method mark it as a wrong_association
            # exclude it from the statistics
            continue
        clexicon_info = method.lexical_info
        coefficients.append(clexicon_info.jaccard_coeff)
    coeff = np.array(coefficients)
    print('{proj} ({ver}) & {total} & {min:.3} & {max:.3} & {median:.3} & {mean:.3} & {variance:.3} & {devstd:.3} \\\\'.format(
                                                                             proj = project.name.title(), ver=project.version,
                                                                             total=coeff.size, min=coeff.min(), max=coeff.max(),
                                                                             median=median(coeff), mean=coeff.mean(), 
                                                                             variance=var(coeff), devstd=std(coeff)))









    



Coffeemaker (1.0) & 47 & 0.0667 & 0.429 & 0.25 & 0.254 & 0.0106 & 0.103 \\
Jfreechart (0.6.0) & 486 & 0.0 & 0.857 & 0.222 & 0.233 & 0.0192 & 0.139 \\
Jfreechart (0.7.1) & 624 & 0.0 & 0.857 & 0.25 & 0.262 & 0.0189 & 0.138 \\
Jhotdraw (7.4.1) & 2480 & 0.0 & 1.0 & 0.182 & 0.215 & 0.0284 & 0.169 \\

TFIDF Statistics



In [21]:

    
# Import Scikit-Learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

for project in projects:
    
    # Populate the Doc Collection
    document_collection = list()
    
    # Get Methods
    code_methods = project.code_methods.all()
    for method in code_methods:
        # Check that this method has no "wrong_association"
        n_evaluations = method.agreement_evaluations.count()
        n_eval_wrong_assocation = method.agreement_evaluations.filter(wrong_association=True).count()
        if n_evaluations == n_eval_wrong_assocation:
            # if **all** the evaluations for the current method mark it as a wrong_association
            # exclude it from the statistics
            continue
        
        clexicon_info = method.lexical_info
        document_collection.append(clexicon_info.normalized_comment)
        document_collection.append(clexicon_info.normalized_code)
    
    vectorizer = TfidfVectorizer(input='content', sublinear_tf=True, lowercase=False)
    tfidf_values = vectorizer.fit_transform(document_collection)
    
    #cosine_sim_vals = list()
    #rows, cols = tfidf_values.shape
    #for i in range(0, rows, 2):
    #    cosine_sim_vals.append(tfidf_values[i].dot(tfidf_values[i+1].T)[0,0])
    #cosine_sim_vals = np.array(cosine_sim_vals)
    comments, code = tfidf_values[::2], tfidf_values[1::2]
    kernel_matrix = linear_kernel(comments, code)  # arrays are still L2 (length) normalized
    cosine_sim_vals = np.diag(kernel_matrix)
    
    print('{proj} ({ver}) & {tot} & {min:.3} & {max:.3} & {med:.3} & {mu:.3} & {var:.3} & {sigma:.3} \\\\'.format(
            proj=project.name.title(), ver=project.version, tot=cosine_sim_vals.size, min=cosine_sim_vals.min(), 
            max=cosine_sim_vals.max(), med=median(cosine_sim_vals), mu=cosine_sim_vals.mean(), 
            var=var(cosine_sim_vals), sigma=std(cosine_sim_vals)))









    



Coffeemaker (1.0) & 47 & 0.162 & 0.818 & 0.522 & 0.494 & 0.0309 & 0.176 \\
Jfreechart (0.6.0) & 486 & 0.0 & 0.954 & 0.403 & 0.403 & 0.0414 & 0.203 \\
Jfreechart (0.7.1) & 624 & 0.0 & 0.958 & 0.442 & 0.447 & 0.0416 & 0.204 \\
Jhotdraw (7.4.1) & 2480 & 0.0 & 1.0 & 0.41 & 0.403 & 0.0612 & 0.247 \\

Coffee Maker



In [6]:

    
coff_maker = projects[0]
methods = coff_maker.code_methods.all()
methods = methods[0:2]
docs = list()
for method in methods:
    lex_info = method.lexical_info
    docs.append(lex_info.normalized_comment)
    docs.append(lex_info.normalized_code)
print('Methods: ', len(methods))
print('Docs: ', len(docs))









    



Methods:  2
Docs:  4



In [7]:

    
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(input='content', sublinear_tf=True, lowercase=False)
X = vectorizer.fit_transform(docs)









    



/home/valerio/anaconda3/envs/code_comments_analysis/lib/python3.5/site-packages/sklearn/utils/fixes.py:64: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
  if 'order' in inspect.getargspec(np.copy)[0]:



In [14]:

    
vectorizer.get_feature_names()









    Out[14]:





['add',
 'amount',
 'amt',
 'chocol',
 'coffe',
 'current',
 'except',
 'format',
 'int',
 'integ',
 'inventori',
 'must',
 'number',
 'paramet',
 'pars',
 'posit',
 'string',
 'unit']



In [21]:

    
x = X[0].toarray()
from scipy.sparse import issparse
print(issparse(x))









    



False



In [30]:

    
x = x.ravel()



In [31]:

    
np.where(x>0)









    Out[31]:





(array([ 0,  1,  3,  5,  6, 10, 12, 13, 17]),)



In [33]:

    
np.take(x, np.where(x>0))









    Out[33]:





array([[ 0.1975104 ,  0.29840377,  0.62623383,  0.29840377,  0.1975104 ,
         0.33441418,  0.1975104 ,  0.29840377,  0.33441418]])



In [34]:

    
x[np.where(x>0)]









    Out[34]:





array([ 0.1975104 ,  0.29840377,  0.62623383,  0.29840377,  0.1975104 ,
        0.33441418,  0.1975104 ,  0.29840377,  0.33441418])



In [35]:

    
print(vectorizer.get_feature_names())









    



['add', 'amount', 'amt', 'chocol', 'coffe', 'current', 'except', 'format', 'int', 'integ', 'inventori', 'must', 'number', 'paramet', 'pars', 'posit', 'string', 'unit']



In [36]:

    
docs[0]









    Out[36]:





'add number chocol unit inventori current amount chocol unit paramet chocol inventori except'

JHotDraw



In [40]:

    
jhotdraw = projects[-1]
methods = jhotdraw.code_methods.all()
methods = methods[0:2]
docs = list()
for method in methods:
    lex_info = method.lexical_info
    docs.append(lex_info.normalized_comment)
    docs.append(lex_info.normalized_code)
print('Methods: ', len(methods))
print('Docs: ', len(docs))









    



Methods:  2
Docs:  4



In [42]:

    
docs[0], docs[1]









    Out[42]:





('creat instanc',
 'about action applic appledirect appledirect resourc bundl util label resourc bundl util get bundl organ jhotdraw appledirect label label configur action id')



In [44]:

    
methods[0].lexical_info.normalized_comment









    Out[44]:





'creat instanc'



In [45]:

    
methods[0].lexical_info.normalized_code









    Out[45]:





'about action applic appledirect appledirect resourc bundl util label resourc bundl util get bundl organ jhotdraw appledirect label label configur action id'



In [46]:

    
methods[0].example.target









    Out[46]:





0

TF Statistics



In [19]:

    
# Import Scikit-Learn
from sklearn.feature_extraction.text import TfidfVectorizer

## TODO: See the following "Optimization" subsections to see tests
from sklearn.metrics.pairwise import linear_kernel  # array are still L2 normalized

for project in projects:
    
    # Get Methods
    code_methods = project.code_methods.all()
    
    # Populate the Doc Collection
    document_collection = list()
    for method in code_methods:
        
        # Check that this method has no "wrong_association"
        n_evaluations = method.agreement_evaluations.count()
        n_eval_wrong_assocation = method.agreement_evaluations.filter(wrong_association=True).count()
        if n_evaluations == n_eval_wrong_assocation:
            # if **all** the evaluations for the current method mark it as a wrong_association
            # exclude it from the statistics
            continue
        
        clexicon_info = method.lexical_info
        document_collection.append(clexicon_info.normalized_comment)
        document_collection.append(clexicon_info.normalized_code)
    
    vectorizer = TfidfVectorizer(input='content', sublinear_tf=False, lowercase=False, use_idf=False)
    tf_values = vectorizer.fit_transform(document_collection)
    
    #cosine_sim_vals = list()
    #rows, cols = tf_values.shape
    #for i in range(0, rows, 2):
    #    cosine_sim_vals.append(tf_values[i].dot(tf_values[i+1].T)[0,0])
    #cosine_sim_vals = np.array(cosine_sim_vals)
    
    comments, code = tf_values[::2], tf_values[1::2]
    kernel_matrix = linear_kernel(comments, code)
    cosine_sim_vals = np.diag(kernel_matrix)
    
    print('{proj} ({ver}) & {total} & {min:.3} & {max:.3} & {median:.3} & {mean:.3} & {variance:.3} & {devstd:.3} \\\\'.format(
                                                                                 proj = project.name.title(), ver=project.version,
                                                                                 total=cosine_sim_vals.size, 
                                                                                 min=cosine_sim_vals.min(), 
                                                                                 max=cosine_sim_vals.max(), 
                                                                                 median=median(cosine_sim_vals), 
                                                                                 mean=cosine_sim_vals.mean(), 
                                                                                 variance=var(cosine_sim_vals), 
                                                                                 devstd=std(cosine_sim_vals)))









    



Coffeemaker (1.0) & 47 & 0.163 & 0.849 & 0.578 & 0.574 & 0.0365 & 0.191 \\
Jfreechart (0.6.0) & 486 & 0.0 & 0.946 & 0.491 & 0.486 & 0.0523 & 0.229 \\
Jfreechart (0.7.1) & 624 & 0.0 & 0.946 & 0.563 & 0.527 & 0.0459 & 0.214 \\
Jhotdraw (7.4.1) & 2480 & 0.0 & 1.0 & 0.469 & 0.435 & 0.0663 & 0.258 \\

Optimization

Trying to optimize the cosine_similarity computation replacing the cosine_sim_vals list (try using np.vstack)



In [6]:

    
from sklearn.feature_extraction.text import TfidfVectorizer

# Target Project (as this is just an example)
project = projects[0]
    
# Get Methods
code_methods = project.code_methods.all()

# Populate the Doc Collection
document_collection = list()
for method in code_methods:
    clexicon_info = method.lexical_info
    document_collection.append(clexicon_info.normalized_comment)
    document_collection.append(clexicon_info.normalized_code)

vectorizer = TfidfVectorizer(input='content', sublinear_tf=True, lowercase=False)
tfidf_values = vectorizer.fit_transform(document_collection)

rows, cols = tfidf_values.shape
cosine_sim_vals = tfidf_values[0].dot(tfidf_values[1].T)[0,0]
for i in range(2, rows, 2):
    cosine_sim_vals = np.vstack((cosine_sim_vals, tfidf_values[i].dot(tfidf_values[i+1].T)[0,0]))

cosine_sim_vals.ravel()









    



/home/valerio/anaconda3/envs/code_comments_analysis/lib/python3.5/site-packages/sklearn/utils/fixes.py:64: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
  if 'order' in inspect.getargspec(np.copy)[0]:






    Out[6]:





array([ 0.60911973,  0.57123082,  0.48069991,  0.26437923,  0.60911973,
        0.29431416,  0.17717904,  0.60911973,  0.29152788,  0.4093343 ,
        0.26356142,  0.71350747,  0.68495483,  0.30869327,  0.71070995,
        0.25701787,  0.37501614,  0.70250359,  0.65416068,  0.70250359,
        0.70250359,  0.42671532,  0.37145891,  0.42671532,  0.63666839,
        0.70924982,  0.76713779,  0.42671532,  0.20732429,  0.37453819,
        0.40207095,  0.41140922,  0.16396579,  0.16187423,  0.42900185,
        0.5221587 ,  0.56473598,  0.51437628,  0.56473598,  0.56473598,
        0.62759231,  0.57385856,  0.62759231,  0.81802882,  0.58340737,
        0.62759231,  0.33016113])



In [7]:

    
alt_method = np.einsum('ij,ij->i', tfidf_values[::2,].toarray(), tfidf_values[1::2,].toarray())
alt_method









    Out[7]:





array([ 0.60911973,  0.57123082,  0.48069991,  0.26437923,  0.60911973,
        0.29431416,  0.17717904,  0.60911973,  0.29152788,  0.4093343 ,
        0.26356142,  0.71350747,  0.68495483,  0.30869327,  0.71070995,
        0.25701787,  0.37501614,  0.70250359,  0.65416068,  0.70250359,
        0.70250359,  0.42671532,  0.37145891,  0.42671532,  0.63666839,
        0.70924982,  0.76713779,  0.42671532,  0.20732429,  0.37453819,
        0.40207095,  0.41140922,  0.16396579,  0.16187423,  0.42900185,
        0.5221587 ,  0.56473598,  0.51437628,  0.56473598,  0.56473598,
        0.62759231,  0.57385856,  0.62759231,  0.81802882,  0.58340737,
        0.62759231,  0.33016113])



In [8]:

    
alt_method.shape









    Out[8]:





(47,)



In [9]:

    
cosine_sim_vals.ravel().shape









    Out[9]:





(47,)



In [10]:

    
np.testing.assert_allclose(cosine_sim_vals.ravel(), alt_method)



In [11]:

    
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel



In [12]:

    
comments, code = tfidf_values[::2], tfidf_values[1::2]
print(comments.shape, code.shape)









    



(47, 104) (47, 104)



In [13]:

    
kernel = linear_kernel(comments, code)
np.diag(kernel)









    Out[13]:





array([ 0.60911973,  0.57123082,  0.48069991,  0.26437923,  0.60911973,
        0.29431416,  0.17717904,  0.60911973,  0.29152788,  0.4093343 ,
        0.26356142,  0.71350747,  0.68495483,  0.30869327,  0.71070995,
        0.25701787,  0.37501614,  0.70250359,  0.65416068,  0.70250359,
        0.70250359,  0.42671532,  0.37145891,  0.42671532,  0.63666839,
        0.70924982,  0.76713779,  0.42671532,  0.20732429,  0.37453819,
        0.40207095,  0.41140922,  0.16396579,  0.16187423,  0.42900185,
        0.5221587 ,  0.56473598,  0.51437628,  0.56473598,  0.56473598,
        0.62759231,  0.57385856,  0.62759231,  0.81802882,  0.58340737,
        0.62759231,  0.33016113])



In [14]:

    
from numpy.testing import assert_array_almost_equal
assert_array_almost_equal(alt_method, np.diag(kernel))



In [15]:

    
alt_method









    Out[15]:





array([ 0.60911973,  0.57123082,  0.48069991,  0.26437923,  0.60911973,
        0.29431416,  0.17717904,  0.60911973,  0.29152788,  0.4093343 ,
        0.26356142,  0.71350747,  0.68495483,  0.30869327,  0.71070995,
        0.25701787,  0.37501614,  0.70250359,  0.65416068,  0.70250359,
        0.70250359,  0.42671532,  0.37145891,  0.42671532,  0.63666839,
        0.70924982,  0.76713779,  0.42671532,  0.20732429,  0.37453819,
        0.40207095,  0.41140922,  0.16396579,  0.16187423,  0.42900185,
        0.5221587 ,  0.56473598,  0.51437628,  0.56473598,  0.56473598,
        0.62759231,  0.57385856,  0.62759231,  0.81802882,  0.58340737,
        0.62759231,  0.33016113])



In [16]:

    
cossim = cosine_similarity(comments, code)
np.diag(cossim)









    Out[16]:





array([ 0.60911973,  0.57123082,  0.48069991,  0.26437923,  0.60911973,
        0.29431416,  0.17717904,  0.60911973,  0.29152788,  0.4093343 ,
        0.26356142,  0.71350747,  0.68495483,  0.30869327,  0.71070995,
        0.25701787,  0.37501614,  0.70250359,  0.65416068,  0.70250359,
        0.70250359,  0.42671532,  0.37145891,  0.42671532,  0.63666839,
        0.70924982,  0.76713779,  0.42671532,  0.20732429,  0.37453819,
        0.40207095,  0.41140922,  0.16396579,  0.16187423,  0.42900185,
        0.5221587 ,  0.56473598,  0.51437628,  0.56473598,  0.56473598,
        0.62759231,  0.57385856,  0.62759231,  0.81802882,  0.58340737,
        0.62759231,  0.33016113])



In [17]:

    
assert_array_almost_equal(alt_method, np.diag(cossim))
assert_array_almost_equal(np.diag(cossim), np.diag(kernel))

TFIDF Statistics separated by Agreement Rates



In [12]:

    
from sklearn.feature_extraction.text import TfidfVectorizer
from evaluations import Judge


judges_combinations = (('leonardo.nole', 'rossella.linsalata'),
                       ('leonardo.nole', 'antonio.petrone'),
                       ('leonardo.nole', 'antonio.petrone'),
                       ('leonardo.nole', 'rossella.linsalata'),)

CODES_Labels = ('NC', 'DK', 'CO')
from collections import defaultdict
stats_results = defaultdict(list)

for pno, project in enumerate(projects):

    # Get Methods
    code_methods = project.code_methods.all()

    # Populate the Doc Collection
    document_collection = list()
    method_ids_map = dict()  # Map (dict) to store the association method.pk --> Row index in Tfidf Matrix
    for mno, method in enumerate(code_methods):
        clexicon_info = method.lexical_info
        document_collection.append(clexicon_info.normalized_comment)
        document_collection.append(clexicon_info.normalized_code)
        method_ids_map[method.id] = mno*2

    vectorizer = TfidfVectorizer(input='content', sublinear_tf=True, lowercase=False)
    tfidf_values = vectorizer.fit_transform(document_collection)

    j1_usrname, j2_usrname = judges_combinations[pno]
    j1 = Judge(j1_usrname, project.name, project.version)
    j2 = Judge(j2_usrname, project.name, project.version)
    
    j1_evals = j1.three_codes_evaluations
    j2_evals = j2.three_codes_evaluations
    
    project_stats = list()
    for code in range(3):
        j1_evals_code = j1_evals[code]
        j2_evals_code = j2_evals[code]
        
        method_ids = j1_evals_code.intersection(j2_evals_code)
        cosine_sim_vals = list()
        for mid in method_ids:
            i = method_ids_map[mid]
            cosine_sim_vals.append(tfidf_values[i].dot(tfidf_values[i+1].T)[0,0])

        cosine_sim_vals = np.array(cosine_sim_vals)
        project_stats.append(cosine_sim_vals)
    
    for code in range(3):
        vals = project_stats[code]
        label = CODES_Labels[code]
        if vals.size > 0:
            stats_results[label].append('{proj} ({ver}) & {total} & {min:.3} & {max:.3} & {median:.3} & {mean:.3} & {variance:.3} & {devstd:.3} \\\\'.format(
                                                                                 proj = project.name.title(), 
                                                                                 ver=project.version,
                                                                                 total=vals.size, 
                                                                                 min=vals.min(), 
                                                                                 max=vals.max(), 
                                                                                 median=median(vals), 
                                                                                 mean=vals.mean(), 
                                                                                 variance=var(vals), 
                                                                                 devstd=std(vals)))
        else:
            stats_results[label].append('{proj} ({ver}) & \multicolumn{{7}}{{c|}}{{N.A.}} \\\\'.format(proj = project.name.title(), 
                                                                                 ver=project.version))
            
for label in stats_results:
    print('\n{0}\n'.format(label))
    for value in stats_results[label]:
        print(value)









    



NC

Coffeemaker (1.0) & 20 & 0.164 & 0.818 & 0.543 & 0.471 & 0.0312 & 0.177 \\
Jfreechart (0.6.0) & 55 & 0.0 & 0.731 & 0.299 & 0.317 & 0.0326 & 0.181 \\
Jfreechart (0.7.1) & 68 & 0.0 & 0.812 & 0.341 & 0.336 & 0.0485 & 0.22 \\
Jhotdraw (7.4.1) & 1025 & 0.0 & 0.93 & 0.322 & 0.304 & 0.0515 & 0.227 \\

DK

Coffeemaker (1.0) & \multicolumn{7}{c|}{N.A.} \\
Jfreechart (0.6.0) & 24 & 0.0 & 0.656 & 0.388 & 0.368 & 0.0395 & 0.199 \\
Jfreechart (0.7.1) & 36 & 0.0 & 0.844 & 0.406 & 0.405 & 0.0363 & 0.191 \\
Jhotdraw (7.4.1) & 693 & 0.0 & 1.0 & 0.492 & 0.474 & 0.0671 & 0.259 \\

CO

Coffeemaker (1.0) & 27 & 0.162 & 0.767 & 0.429 & 0.511 & 0.03 & 0.173 \\
Jfreechart (0.6.0) & 406 & 0.0 & 0.954 & 0.421 & 0.418 & 0.0411 & 0.203 \\
Jfreechart (0.7.1) & 520 & 0.0 & 0.958 & 0.457 & 0.465 & 0.0389 & 0.197 \\
Jhotdraw (7.4.1) & 760 & 0.0 & 0.971 & 0.461 & 0.47 & 0.0469 & 0.217 \\

TFIDF Values Distribution (Separated by Agreement Rates)



In [13]:

    
judges_combinations = (('leonardo.nole', 'rossella.linsalata'),
                       ('leonardo.nole', 'antonio.petrone'),
                       ('leonardo.nole', 'antonio.petrone'),
                       ('leonardo.nole', 'rossella.linsalata'),)

CODES_Labels = ('NC', 'DK', 'CO')
from collections import defaultdict
stats_results_paths = defaultdict(list)

pwd_out = !pwd
current_dir = pwd_out[0]

folder_path = os.path.join(current_dir, 'distributions_per_rate_tfidf')
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

for pno, project in enumerate(projects):

    # Get Methods
    code_methods = project.code_methods.all()

    # Populate the Doc Collection
    document_collection = list()
    method_ids_map = dict()  # Map (dict) to store the association method.pk --> Row index in Tfidf Matrix
    for mno, method in enumerate(code_methods):
        clexicon_info = method.lexical_info
        document_collection.append(clexicon_info.normalized_comment)
        document_collection.append(clexicon_info.normalized_code)
        method_ids_map[method.id] = mno*2

    vectorizer = TfidfVectorizer(input='content', sublinear_tf=True, lowercase=False)
    tfidf_values = vectorizer.fit_transform(document_collection)

    j1_usrname, j2_usrname = judges_combinations[pno]
    j1 = Judge(j1_usrname, project.name, project.version)
    j2 = Judge(j2_usrname, project.name, project.version)
    
    j1_evals = j1.three_codes_evaluations
    j2_evals = j2.three_codes_evaluations
    
    project_stats = list()
    for code in range(3):
        j1_evals_code = j1_evals[code]
        j2_evals_code = j2_evals[code]
        
        method_ids = j1_evals_code.intersection(j2_evals_code)
        cosine_sim_vals = list()
        for mid in method_ids:
            i = method_ids_map[mid]
            cosine_sim_vals.append(tfidf_values[i].dot(tfidf_values[i+1].T)[0,0])

        cosine_sim_vals = np.array(cosine_sim_vals)
        project_stats.append(cosine_sim_vals)
    
    for code in range(3):
        vals = project_stats[code]
        label = CODES_Labels[code]
        if vals.size > 0:
            filename = '{label}_{proj}_({ver})_{total}.txt'.format(label=label, 
                                                                   proj=project.name.title(), 
                                                                   ver=project.version,
                                                                   total=vals.size)
            filepath = os.path.join(folder_path, filename)
            np.savetxt(filepath, vals)
            stats_results_paths[label].append(filepath)
            
for label in stats_results:
    print('\n{0}\n'.format(label))
    for path in stats_results_paths[label]:
        print('Saved Filepath:', path)









    



NC

Saved Filepath: /home/valerio/Research/Code-Coherence/comments_classification/notebooks/distributions_per_rate_tfidf/NC_Coffeemaker_(1.0)_20.txt
Saved Filepath: /home/valerio/Research/Code-Coherence/comments_classification/notebooks/distributions_per_rate_tfidf/NC_Jfreechart_(0.6.0)_55.txt
Saved Filepath: /home/valerio/Research/Code-Coherence/comments_classification/notebooks/distributions_per_rate_tfidf/NC_Jfreechart_(0.7.1)_68.txt
Saved Filepath: /home/valerio/Research/Code-Coherence/comments_classification/notebooks/distributions_per_rate_tfidf/NC_Jhotdraw_(7.4.1)_1025.txt

DK

Saved Filepath: /home/valerio/Research/Code-Coherence/comments_classification/notebooks/distributions_per_rate_tfidf/DK_Jfreechart_(0.6.0)_24.txt
Saved Filepath: /home/valerio/Research/Code-Coherence/comments_classification/notebooks/distributions_per_rate_tfidf/DK_Jfreechart_(0.7.1)_36.txt
Saved Filepath: /home/valerio/Research/Code-Coherence/comments_classification/notebooks/distributions_per_rate_tfidf/DK_Jhotdraw_(7.4.1)_693.txt

CO

Saved Filepath: /home/valerio/Research/Code-Coherence/comments_classification/notebooks/distributions_per_rate_tfidf/CO_Coffeemaker_(1.0)_27.txt
Saved Filepath: /home/valerio/Research/Code-Coherence/comments_classification/notebooks/distributions_per_rate_tfidf/CO_Jfreechart_(0.6.0)_406.txt
Saved Filepath: /home/valerio/Research/Code-Coherence/comments_classification/notebooks/distributions_per_rate_tfidf/CO_Jfreechart_(0.7.1)_520.txt
Saved Filepath: /home/valerio/Research/Code-Coherence/comments_classification/notebooks/distributions_per_rate_tfidf/CO_Jhotdraw_(7.4.1)_760.txt