For the course ADM I have created a library of tools to calculate and estimate the similarity between 2 pieces of python code. I named the module "Symilar" to emphesize it Python origin as well as its primairy field of study.
This Notebook will show some of its possibilities and explain the origin of the algorithems involved. The most resent codebase of the project Symilar is freely available from "http://admserver.frii.nl" in the "Public projects" section. You can store it in your python path or at any other place and include this place in your python path like this:
In [1]:
import sys
sys.path.insert(0,'/home/user/workspace/symilar/src/nl/boose/symilar/')
In [2]:
def differ(expected, actual):
import difflib
expected=expected.splitlines(1)
actual=actual.splitlines(1)
diff=difflib.unified_diff(expected, actual)
return ''.join(diff)
Now I will read CSV files with solutions provided to the autograder. The file "x.csv" is provided in the "example" directory of the symilar source. You will have to change the path in the next cell to get it to work.
Al student e-mail adresses are replaced by "Student_101" and upwards. "x" will be a matrix containing the following fields:
In [3]:
import csv
reader=csv.reader(open("/home/user/adm/x.csv","rb"),delimiter=',')
x = list(reader)
import numpy as np
import os
import subprocess
Lets inspect the important characteristics of the matrix x.
In [4]:
first_summision = x[1]
#number_of_submissions
print "# submissions: ", len(x)
print "Assignment of submission 1: ", x[1][0]
print "Task of submission 1: ", x[1][1]
#Code_of_first_submission
print "first code: \n=========================================\n", x[1][4]
print "========================================="
We will trace one submission, nr 96 from its original sourcecode into the winnowing string that we can use for comparing. This process is based on the paper that comes with the "Moss" system. The "Measure of source code similarity" system is provided by standford university and anno 2014 still active and supported.
Let first look at the provided sourcecode:
In [5]:
print x[96][4]
Step 1: generalise names, remove comments, generalise spacing and indenting. This is called a "Working Copy" of the code because it will still compile and run. Al aliases are renamed back to their original names (anti aliasing) and all subnames of imported modules will not be translated.
In [6]:
from Code import Code
import Scope
Scope.clearScopes()
myProgram = Code(x[96][4])
print myProgram.getWorkingCopy()
Using a basematrix image of the code, it will get converted to a version that will contain hash values for all names, symbols, indentatationdeltas and literals. You can provide salt to influence the hash function, but remember to use the same salt if you want to compare 2 pieces of code with each other :-)
Inspecting the following code you will notice that each line starts with the original line number, followed by a hash value indicating the delta of indentation. After that you will find 1 hashcode per name, constant or symbol (operators, brackets and quotes included). If we compare line number 1 to its original source we can assume that "2cd5" is the hash of the name "numpy".
In [7]:
print myProgram.getHashCopy(salt='salt')
In [8]:
myProgram.getWinnow(guarantee = 12, noise = 4, salt='salt')
Out[8]:
If we decrease the guarantee or increase the noise the window becomes smaller, this means that more hashvalues will be the smalles within a window so the size of the winnow will increase. Increasing the noise however will also demp the number of hashvalues in the winnow for the number of windows that will be inspected decreases. By increaasing the guarantee, and keeping the noise level, the algoritm gets less sensitive for order of statements and within statements for instance "a * b" becomes indifferend from "b * a". However this is also true for "a / b" and "b / a" although they are functionaly not similar.
In [9]:
print myProgram.getWinnow(guarantee = 12, noise = 6, salt='salt')
print '=== g = 12, n = 6 ======================================='
print myProgram.getWinnow(guarantee = 8, noise = 2, salt='salt')
print '=== g = 8, n = 2 ======================================='
print myProgram.getWinnow(guarantee = 12, noise = 10, salt='salt')
print '=== g = 12, n = 10 ======================================='
print myProgram.getWinnow(guarantee = 4, noise = 1, salt='salt')
print '=== g = 4, n = 1 ======================================='
From hereon we will be looking at only the hash codes in the winnow. This part can be retrieved with the getWinnowStr method.
In [10]:
print myProgram.winnow2str(guarantee = 12, noise = 6, salt='salt')
Programs define scopes. These scopes, and the names that are declared within them can be inspected by the Scope module. The method "printScopes" will only print the dictionaries of names that will be translated to go from source to working copy. Keywords and names of modules and names within modules will not get translated. For every method and class a seperate scope is introduced. For every function call, list or tuple operator (, [ or { a seperate subscope is introduced. Subscopes share the namespace of their parent scope but can have additional names if these names are defined within the parentheses.
In [12]:
Scope.printScopes()
In [13]:
def jsim(S1,S2):
S1 = set(S1)
S2 = set(S2)
try:
return len(S1.intersection(S2)) * 1.0 / len(S1.union(S2))
except:
return 0
def shingle(text, lenght=4, chunksize=5):
toreturn = []
for i in range((len(text) / chunksize) - lenght):
toreturn.append(text[i*chunksize:(i+lenght)*chunksize])
return toreturn
def longest(strfrom, strin, chunksize=5):
bigchunk=''
chunknum = len(strfrom) / chunksize
for i in range(chunknum):
for j in range(i+1,chunknum + 1):
if strin.find(strfrom[chunksize*i:chunksize*j]) >=0 and (j-i) * chunksize > len(bigchunk):
bigchunk = strfrom[chunksize*i:chunksize*j]
return bigchunk
Example of the "longest" method call
In [14]:
longest('1234 hyst gdys oiuy hsts kwnw','gdys oiuy hsts iyhg ')
Out[14]:
Here is some "magic". Given the 147 submissions this code will create and fill 3 matrixes from submission i to submission j. Within the matrixes, where j > i it will store:
In [18]:
'''This steps takes about 10 seconds with 147 code snipplets.'''
from time import time
matrixorg = np.zeros((len(x),len(x)))
matrixstripped = np.zeros((len(x),len(x)))
matrixwinnow = np.zeros((len(x),len(x)))
unstripped = {}
stripped = {}
winnowed = {}
salt = str(np.random.randint(100,999))
t = time()
for i in range(len(x)):
myProgram = Code(x[i][4])
newCode = myProgram.getWorkingCopy()
winnow = myProgram.winnow2str(12, noise = 4, salt = salt)
stripped[i] = newCode
winnowed[i] = winnow
unstripped[i] = x[i][4]
Scope.clearScopes()
print 'For converting en winnowing: ', time() - t
t = time()
for i in range(len(x)):
for j in range(i+1,len(x)):
matrixstripped[i][j] = jsim(shingle(stripped[i], chunksize=5), shingle(stripped[j], chunksize=5))
matrixorg[i][j] = jsim(shingle(unstripped[i], chunksize=1),shingle(unstripped[j], chunksize=1))
longnum = len(longest(winnowed[j], winnowed[i]))
if len(winnowed[i]) != 0:
matrixwinnow[i][j] = float(longnum) / float(len(winnowed[i]))
print 'For jsim and calculating longest winnow: ', time() - t
In [19]:
def ltab(value,lenght,char=' '):
return char + (char * lenght + str(value))[-lenght:]
def rtab(value,lenght,char=' '):
return char + ( str(value) + char * lenght)[0:lenght]
lbound = 0.9
space = '-'
headers = ltab('from',5,space) + space + ltab('to',5,space) + space + ltab('tf',4,space) + space
headers += ltab('tt',3,space) + space + rtab('translated',10,space) + space
headers += rtab('winnow',10,space) + space + rtab('original',10,space) + space
print
print 'solutions with a jackard simularity before translation over ' + str(lbound)
print headers
print
a = np.where( matrixorg > lbound)
for i in range(len(a[0])):
if x[a[0][i]][2] != x[a[1][i]][2]:
if int(x[a[0][i]][1]) in [1,2,5,6]:
print ltab(a[0][i],5), ltab(a[1][i],5), ltab(x[a[0][i]][1],4), ltab(x[a[1][i]][1],3), rtab(matrixstripped[a[0][i]][a[1][i]],10), rtab(matrixwinnow[a[0][i]][a[1][i]],10), rtab(matrixorg[a[0][i]][a[1][i]],10)
print
print 'solutions with a jackard simularity after translation over ' + str(lbound)
print headers
print
a = np.where( matrixstripped > lbound)
for i in range(len(a[0])):
#if this is the same task
if x[a[0][i]][2] != x[a[1][i]][2]:
#Don't show tasks without real question
if int(x[a[0][i]][1]) in [1,2,5,6]:
print ltab(a[0][i],5), ltab(a[1][i],5), ltab(x[a[0][i]][1],4), ltab(x[a[1][i]][1],3), rtab(matrixstripped[a[0][i]][a[1][i]],10), rtab(matrixwinnow[a[0][i]][a[1][i]],10), rtab(matrixorg[a[0][i]][a[1][i]],10)
print
print 'solutions with a winnow simularity after translation over ' + str(lbound)
print headers
print
a = np.where( matrixwinnow > lbound)
for i in range(len(a[0])):
#if this is the same task
if x[a[0][i]][2] != x[a[1][i]][2]:
#Don't show tasks without real question
if int(x[a[0][i]][1]) in [1,2,5,6] and x[a[0][i]][1] == x[a[1][i]][1]:
print ltab(a[0][i],5), ltab(a[1][i],5), ltab(x[a[0][i]][1],4), ltab(x[a[1][i]][1],3), rtab(matrixstripped[a[0][i]][a[1][i]],10), rtab(matrixwinnow[a[0][i]][a[1][i]],10), rtab(matrixorg[a[0][i]][a[1][i]],10)
In [21]:
from difflib import Differ
from pprint import pprint
def inspect(nb1, nb2):
print matrixstripped[nb1, nb2], matrixwinnow[nb1,nb2], matrixorg[nb1,nb2]
pprint (x[nb1][4].split('\n'))
print '== program 1=====================================\n'
pprint (x[nb2][4].split('\n'))
print '== program 2 ====================================\n'
myProgram1 = Code(x[nb1][4])
myProgram2 = Code(x[nb2][4])
print myProgram1.getWorkingCopy()
print '== Working copy 1 ===============================\n'
print myProgram2.getWorkingCopy()
print '== Working copy 2 ===============================\n'
print myProgram1.getHashCopy()
print '== Hash copy 1 ===============================\n'
print myProgram2.getHashCopy()
print '== Hash copy 2 ===============================\n'
print myProgram1.winnow2str(5)
print '== Winnow string 1 ==============================\n'
print myProgram2.winnow2str(5)
print '== Winnow string 2 ==============================\n'
print differ(myProgram1.getWorkingCopy(),myProgram2.getWorkingCopy())
inspect (51,54)
#inspect(120,121)
#inspect(13,25)
#inspect(114,130)
#inspect(114,115)
#inspect(125,132)
#inspect(51,54)
#inspect(112,132)
#inspect(5,11)
#inspect(5,11)
#inspect(48,58)
#inspect(99,106)
#inspect(114,117)
#inspect(109,131)
In [22]:
myProgram.getWinnow(10,noise=1)
Out[22]:
In [23]:
import numpy as numpy
def meth00000(name00000,name00001):
name00002=len(name00000)
name00003=len(name00000[0])
name00004=numpy.random.randn(len(name00000),name00001)
name00005=numpy.array([])
for name00006 in range(name00001):
name00005=numpy.concatenate((name00005,numpy.sign(numpy.sum(numpy.transpose(numpy.tile(name00004[:,name00006],(name00003,1)))*name00000,axis=0))))
return numpy.reshape(name00005,(name00001,name00003))
docs = np.array([[1,-2,3],[-3,2,-1],[-2,3,1],[3,-3,-3]])
sketches = meth00000(docs,10000)
In [24]:
sketches
Out[24]:
In [25]:
from Code import Code
import Scope
Scope.clearScopes()
myProgram1 = Code(x[96][4])
myProgram2 = Code(x[106][4])
print myProgram2.getWorkingCopy()
print myProgram.getWorkingCopy()
In [28]:
len(Scope.Scope.scopes)
Out[28]:
In [27]:
myProgram1 = ''
In [ ]: