Assigment B: Python code similarity

For the course ADM I have created a library of tools to calculate and estimate the similarity between 2 pieces of python code. I named the module "Symilar" to emphesize it Python origin as well as its primairy field of study.

This Notebook will show some of its possibilities and explain the origin of the algorithems involved. The most resent codebase of the project Symilar is freely available from "http://admserver.frii.nl" in the "Public projects" section. You can store it in your python path or at any other place and include this place in your python path like this:


In [1]:
import sys
sys.path.insert(0,'/home/user/workspace/symilar/src/nl/boose/symilar/')

Differ the humanly readable diff

I will be using diflib's unified_diff to indicate the difference between 2 piece of sourcecode


In [2]:
def differ(expected, actual):
    import difflib
    expected=expected.splitlines(1)
    actual=actual.splitlines(1)
    diff=difflib.unified_diff(expected, actual)
    return ''.join(diff)

Now I will read CSV files with solutions provided to the autograder. The file "x.csv" is provided in the "example" directory of the symilar source. You will have to change the path in the next cell to get it to work.

Al student e-mail adresses are replaced by "Student_101" and upwards. "x" will be a matrix containing the following fields:

  • Assignment title
  • task number
  • student name
  • submission date
  • submitted code
  • feedback and some other fields that are not important for this task.

In [3]:
import csv
reader=csv.reader(open("/home/user/adm/x.csv","rb"),delimiter=',')
x = list(reader)

import numpy as np
import os
import subprocess

Lets inspect the important characteristics of the matrix x.


In [4]:
first_summision = x[1]
#number_of_submissions 
print "# submissions: ", len(x)
print "Assignment of submission 1: ", x[1][0]
print "Task of submission 1: ", x[1][1]
#Code_of_first_submission 
print "first code: \n=========================================\n", x[1][4]
print "========================================="


# submissions:  147
Assignment of submission 1:  Assignment 0:: Hashing
Task of submission 1:  1
first code: 
=========================================
import numpy
def analyse_collisions(K, N, L):
    arr = [0]*N
    for x in xrange(K):
        random_number = numpy.random.random_integers(N)
        arr[random_number-1] = arr[random_number-1] + 1
        
    counts = [0] * (L + 1)
    
    for x in xrange(L+1):
        counts[x] = arr.count(x)
        
    return counts
=========================================

Trace one sumission

We will trace one submission, nr 96 from its original sourcecode into the winnowing string that we can use for comparing. This process is based on the paper that comes with the "Moss" system. The "Measure of source code similarity" system is provided by standford university and anno 2014 still active and supported.

Let first look at the provided sourcecode:


In [5]:
print x[96][4]


import numpy as np

def sketch(M, k):
    numrows = len(M)
    numcols = len(M[0])
    dirs = np.random.randn(len(M),k)
    result=np.array([])
    for i in range(k):
        result = np.concatenate((result, np.sign(np.sum(np.transpose(np.tile (dirs[:,i],(numcols,1))) * M, axis=0))))
    return np.reshape(result,(k,numcols))

Step 1: generalise names, remove comments, generalise spacing and indenting. This is called a "Working Copy" of the code because it will still compile and run. Al aliases are renamed back to their original names (anti aliasing) and all subnames of imported modules will not be translated.


In [6]:
from Code import Code
import Scope
Scope.clearScopes()
myProgram = Code(x[96][4])
print myProgram.getWorkingCopy()


import numpy as numpy
def meth00000(name00000,name00001):
    name00002=len(name00000)
    name00003=len(name00000[0])
    name00004=numpy.random.randn(len(name00000),name00001)
    name00005=numpy.array([])
    for name00006 in range(name00001):
        name00005=numpy.concatenate((name00005,numpy.sign(numpy.sum(numpy.transpose(numpy.tile(name00004[:,name00006],(name00003,1)))*name00000,axis=0))))
    return numpy.reshape(name00005,(name00001,name00003))

Using a basematrix image of the code, it will get converted to a version that will contain hash values for all names, symbols, indentatationdeltas and literals. You can provide salt to influence the hash function, but remember to use the same salt if you want to compare 2 pieces of code with each other :-)

Inspecting the following code you will notice that each line starts with the original line number, followed by a hash value indicating the delta of indentation. After that you will find 1 hashcode per name, constant or symbol (operators, brackets and quotes included). If we compare line number 1 to its original source we can assume that "2cd5" is the hash of the name "numpy".


In [7]:
print myProgram.getHashCopy(salt='salt')


l0000001 9e59 eae8 2cd5 6e26 2cd5
l0000003 9e59 67e3 ed5b 44d2 1d5b 9628 911f 6d5a 8b50
l0000004 3bb9 339f 03d5 0e6f 44d2 1d5b 6d5a
l0000005 9e59 6473 03d5 0e6f 44d2 1d5b f3d4 46a4 5171 6d5a
l0000006 9e59 04b3 03d5 2cd5 ff11 98e8 ff11 1820 44d2 0e6f 44d2 1d5b 6d5a 9628 911f 6d5a
l0000007 9e59 3cc1 03d5 2cd5 ff11 ea88 44d2 f3d4 5171 6d5a
l0000008 9e59 5719 87bd 1fda 1ef7 44d2 911f 6d5a 8b50
l0000009 3bb9 3cc1 03d5 2cd5 ff11 60f9 44d2 44d2 3cc1 9628 2cd5 ff11 cecd 44d2 2cd5 ff11 506e 44d2 2cd5 ff11 76f5 44d2 2cd5 ff11 ee88 44d2 04b3 f3d4 8b50 9628 87bd 5171 9628 44d2 6473 9628 a3d2 6d5a 6d5a 6d5a 2264 1d5b 9628 6a5c 03d5 46a4 6d5a 6d5a 6d5a 6d5a
l0000010 665e 32ac 2cd5 ff11 613c 44d2 3cc1 9628 44d2 911f 9628 6473 6d5a 6d5a

Winnow algorithm:

The getWinnow method looks at all the the hashes in a sequense without the linenumbers. A window of ("guarantee" - "noise" + 1) size moves with speed "noise" over the hashcodes and finds all lowest, most right hashcodes and saves those togetter with the corresponding linenumbers.


In [8]:
myProgram.getWinnow(guarantee = 12, noise = 4, salt='salt')


Out[8]:
[['2cd5', 1],
 ['1d5b', 3],
 ['03d5', 4],
 ['03d5', 5],
 ['04b3', 6],
 ['03d5', 6],
 ['0e6f', 6],
 ['03d5', 7],
 ['1fda', 8],
 ['1ef7', 8],
 ['03d5', 9],
 ['2cd5', 9],
 ['04b3', 9],
 ['44d2', 9],
 ['1d5b', 9],
 ['03d5', 9],
 ['2cd5', 10],
 ['3cc1', 10]]

If we decrease the guarantee or increase the noise the window becomes smaller, this means that more hashvalues will be the smalles within a window so the size of the winnow will increase. Increasing the noise however will also demp the number of hashvalues in the winnow for the number of windows that will be inspected decreases. By increaasing the guarantee, and keeping the noise level, the algoritm gets less sensitive for order of statements and within statements for instance "a * b" becomes indifferend from "b * a". However this is also true for "a / b" and "b / a" although they are functionaly not similar.


In [9]:
print myProgram.getWinnow(guarantee = 12, noise = 6, salt='salt')
print '=== g = 12, n = 6 ======================================='
print myProgram.getWinnow(guarantee = 8, noise = 2, salt='salt')
print '=== g = 8, n = 2 ======================================='
print myProgram.getWinnow(guarantee = 12, noise = 10, salt='salt')
print '=== g = 12, n = 10 ======================================='
print myProgram.getWinnow(guarantee = 4, noise = 1, salt='salt')
print '=== g = 4, n = 1 ======================================='


[['2cd5', 1], ['1d5b', 3], ['03d5', 4], ['03d5', 5], ['0e6f', 5], ['03d5', 6], ['0e6f', 6], ['1d5b', 6], ['03d5', 7], ['1fda', 8], ['1ef7', 8], ['03d5', 9], ['2cd5', 9], ['04b3', 9], ['44d2', 9], ['1d5b', 9], ['03d5', 9], ['2cd5', 10], ['3cc1', 10]]
=== g = 12, n = 6 =======================================
[['2cd5', 1], ['1d5b', 3], ['03d5', 4], ['03d5', 5], ['0e6f', 5], ['04b3', 6], ['03d5', 6], ['0e6f', 6], ['1d5b', 6], ['03d5', 7], ['2cd5', 7], ['44d2', 7], ['1fda', 8], ['1ef7', 8], ['03d5', 9], ['2cd5', 9], ['04b3', 9], ['44d2', 9], ['2264', 9], ['1d5b', 9], ['03d5', 9], ['2cd5', 10], ['3cc1', 10]]
=== g = 8, n = 2 =======================================
[['2cd5', 1], ['6d5a', 3], ['6473', 5], ['04b3', 6], ['0e6f', 6], ['2cd5', 7], ['1ef7', 8], ['44d2', 9], ['2cd5', 9], ['04b3', 9], ['6473', 9], ['03d5', 9], ['3cc1', 10]]
=== g = 12, n = 10 =======================================
[['2cd5', 1], ['44d2', 3], ['1d5b', 3], ['6d5a', 3], ['3bb9', 4], ['339f', 4], ['03d5', 4], ['0e6f', 4], ['1d5b', 4], ['03d5', 5], ['0e6f', 5], ['1d5b', 5], ['46a4', 5], ['04b3', 6], ['03d5', 6], ['2cd5', 6], ['1820', 6], ['0e6f', 6], ['1d5b', 6], ['6d5a', 6], ['3cc1', 7], ['03d5', 7], ['2cd5', 7], ['44d2', 7], ['5171', 7], ['5719', 8], ['1fda', 8], ['1ef7', 8], ['44d2', 8], ['3bb9', 9], ['03d5', 9], ['2cd5', 9], ['44d2', 9], ['3cc1', 9], ['2cd5', 9], ['04b3', 9], ['87bd', 9], ['5171', 9], ['44d2', 9], ['6473', 9], ['6d5a', 9], ['2264', 9], ['1d5b', 9], ['03d5', 9], ['46a4', 9], ['6d5a', 9], ['665e', 10], ['32ac', 10], ['2cd5', 10], ['3cc1', 10], ['44d2', 10], ['6473', 10]]
=== g = 4, n = 1 =======================================

From hereon we will be looking at only the hash codes in the winnow. This part can be retrieved with the getWinnowStr method.


In [10]:
print myProgram.winnow2str(guarantee = 12, noise = 6, salt='salt')


2cd5 1d5b 03d5 03d5 0e6f 03d5 0e6f 1d5b 03d5 1fda 1ef7 03d5 2cd5 04b3 44d2 1d5b 03d5 2cd5 3cc1 

Scopes:

Programs define scopes. These scopes, and the names that are declared within them can be inspected by the Scope module. The method "printScopes" will only print the dictionaries of names that will be translated to go from source to working copy. Keywords and names of modules and names within modules will not get translated. For every method and class a seperate scope is introduced. For every function call, list or tuple operator (, [ or { a seperate subscope is introduced. Subscopes share the namespace of their parent scope but can have additional names if these names are defined within the parentheses.


In [12]:
Scope.printScopes()


{'np': 'numpy', 'sketch': 'meth00000'}
{'dirs': 'name00004', 'i': 'name00006', 'k': 'name00001', 'M': 'name00000', 'numrows': 'name00002', 'result': 'name00005', 'numcols': 'name00003'}
{}
{}
{}
{}
{}
{}
{}
{}
{}
{}
{}
{}
{}
{}
{}
{}

Jsim:

Our friend Jsim from assignment 1. The shingle method is altered to support shingle-ing with bigger chunc sizes f.i. look at chunks of 5 caracters instead of 1. The "longest" method returns the longest part of string 1 that can be found in string 2.


In [13]:
def jsim(S1,S2):
    S1 = set(S1)
    S2 = set(S2)
    try:
        return len(S1.intersection(S2)) * 1.0 / len(S1.union(S2))
    except:
        return 0

def shingle(text, lenght=4, chunksize=5):
    toreturn = []
    for i in range((len(text) / chunksize) - lenght):
        toreturn.append(text[i*chunksize:(i+lenght)*chunksize])
    return toreturn

def longest(strfrom, strin, chunksize=5):
    bigchunk=''
    chunknum = len(strfrom) / chunksize
    for i in range(chunknum):
        for j in range(i+1,chunknum + 1):
            if strin.find(strfrom[chunksize*i:chunksize*j]) >=0 and (j-i) * chunksize > len(bigchunk):
                bigchunk = strfrom[chunksize*i:chunksize*j]
    return bigchunk

Example of the "longest" method call


In [14]:
longest('1234 hyst gdys oiuy hsts kwnw','gdys oiuy hsts iyhg ')


Out[14]:
'gdys oiuy hsts '

Comparison matrix:

Here is some "magic". Given the 147 submissions this code will create and fill 3 matrixes from submission i to submission j. Within the matrixes, where j > i it will store:

  • The jsim similarity of the original source codes.
  • The jsim similarity after converting the sourcecodes to working copies.
  • The longest substring within the winnowstrings from i to j divided by the length of i.

In [18]:
'''This steps takes about 10 seconds with 147 code snipplets.'''
from time import time

matrixorg = np.zeros((len(x),len(x)))
matrixstripped = np.zeros((len(x),len(x)))
matrixwinnow = np.zeros((len(x),len(x)))

                

unstripped = {}
stripped = {}
winnowed = {}

salt = str(np.random.randint(100,999))


t = time()

for i in range(len(x)):
    myProgram = Code(x[i][4])
    newCode = myProgram.getWorkingCopy()
    winnow = myProgram.winnow2str(12, noise = 4, salt = salt)
    
    stripped[i] = newCode
    winnowed[i] = winnow
    unstripped[i] = x[i][4]
    Scope.clearScopes()

print 'For converting en winnowing: ', time() - t
t = time()
    
for i in range(len(x)):
    for j in range(i+1,len(x)):
        matrixstripped[i][j] = jsim(shingle(stripped[i], chunksize=5), shingle(stripped[j], chunksize=5))        
        matrixorg[i][j] = jsim(shingle(unstripped[i], chunksize=1),shingle(unstripped[j], chunksize=1))  

        longnum = len(longest(winnowed[j], winnowed[i]))
        if len(winnowed[i]) != 0:
            matrixwinnow[i][j] = float(longnum) / float(len(winnowed[i]))
        
print 'For jsim and calculating longest winnow: ', time() - t


For converting en winnowing:  1.65841913223
For jsim and calculating longest winnow:  3.57578396797

Analyse the differences:

Next cell will create 3 lists of kandidates to inspect according to the 3 mentioned methods. After this cell you will find cells that will help you in analysing the corresponding sourcecodes.


In [19]:
def ltab(value,lenght,char=' '):
    return char + (char * lenght + str(value))[-lenght:]

def rtab(value,lenght,char=' '):
    return char + ( str(value) + char * lenght)[0:lenght]

lbound = 0.9
space = '-'
headers = ltab('from',5,space) + space + ltab('to',5,space) + space + ltab('tf',4,space) + space
headers += ltab('tt',3,space) + space + rtab('translated',10,space) + space
headers += rtab('winnow',10,space) + space + rtab('original',10,space) + space

print
print 'solutions with a jackard simularity before translation over ' + str(lbound) 
print headers
print

a = np.where( matrixorg > lbound)
for i in range(len(a[0])):
    if x[a[0][i]][2] != x[a[1][i]][2]:
        if int(x[a[0][i]][1]) in [1,2,5,6]:
            print ltab(a[0][i],5), ltab(a[1][i],5), ltab(x[a[0][i]][1],4), ltab(x[a[1][i]][1],3), rtab(matrixstripped[a[0][i]][a[1][i]],10), rtab(matrixwinnow[a[0][i]][a[1][i]],10), rtab(matrixorg[a[0][i]][a[1][i]],10) 

print
print 'solutions with a jackard simularity after translation over ' + str(lbound) 
print headers
print

a = np.where( matrixstripped > lbound)
for i in range(len(a[0])):
    #if this is the same task
    if x[a[0][i]][2] != x[a[1][i]][2]:
        #Don't show tasks without real question
        if int(x[a[0][i]][1]) in [1,2,5,6]:
            print ltab(a[0][i],5), ltab(a[1][i],5), ltab(x[a[0][i]][1],4), ltab(x[a[1][i]][1],3), rtab(matrixstripped[a[0][i]][a[1][i]],10), rtab(matrixwinnow[a[0][i]][a[1][i]],10), rtab(matrixorg[a[0][i]][a[1][i]],10) 

print
print 'solutions with a winnow simularity after translation over ' + str(lbound) 
print headers
print

a = np.where( matrixwinnow > lbound)
for i in range(len(a[0])):
    #if this is the same task
    if x[a[0][i]][2] != x[a[1][i]][2]:
        #Don't show tasks without real question
        if int(x[a[0][i]][1]) in [1,2,5,6] and x[a[0][i]][1] == x[a[1][i]][1]:
            print ltab(a[0][i],5), ltab(a[1][i],5), ltab(x[a[0][i]][1],4), ltab(x[a[1][i]][1],3), rtab(matrixstripped[a[0][i]][a[1][i]],10), rtab(matrixwinnow[a[0][i]][a[1][i]],10), rtab(matrixorg[a[0][i]][a[1][i]],10)


solutions with a jackard simularity before translation over 0.9
--from-----to----tf---tt--translated--winnow------original---

    47     55     1    1  1.0         1.0         1.0       
    60     68     2    2  1.0         1.0         0.99127906
    62     69     2    2  1.0         1.0         0.99178082
    64     67     2    2  0.85906040  0.30434782  0.95402298
    97    105     5    5  1.0         1.0         1.0       
    98    102     5    5  1.0         1.0         0.95454545
   101    104     5    5  1.0         1.0         1.0       
   109    118     1    1  1.0         1.0         0.97452229
   122    131     2    2  1.0         1.0         1.0       
   127    130     2    2  1.0         1.0         1.0       

solutions with a jackard simularity after translation over 0.9
--from-----to----tf---tt--translated--winnow------original---

    47     55     1    1  1.0         1.0         1.0       
    60     68     2    2  1.0         1.0         0.99127906
    62     69     2    2  1.0         1.0         0.99178082
    97    105     5    5  1.0         1.0         1.0       
    98    102     5    5  1.0         1.0         0.95454545
   101    104     5    5  1.0         1.0         1.0       
   109    118     1    1  1.0         1.0         0.97452229
   114    117     1    1  1.0         1.0         0.86363636
   122    131     2    2  1.0         1.0         1.0       
   127    130     2    2  1.0         1.0         1.0       

solutions with a winnow simularity after translation over 0.9
--from-----to----tf---tt--translated--winnow------original---

    47     55     1    1  1.0         1.0         1.0       
    51     54     1    1  0.00970873  1.0         0.80232558
    60     68     2    2  1.0         1.0         0.99127906
    62     69     2    2  1.0         1.0         0.99178082
    97    105     5    5  1.0         1.0         1.0       
    98    102     5    5  1.0         1.0         0.95454545
    99    106     5    5  0.4125      1.0         0.33236994
   101    104     5    5  1.0         1.0         1.0       
   109    118     1    1  1.0         1.0         0.97452229
   112    119     1    1  0.9         1.0         0.85123966
   114    117     1    1  1.0         1.0         0.86363636
   122    131     2    2  1.0         1.0         1.0       
   127    130     2    2  1.0         1.0         1.0       

In [21]:
from difflib import Differ
from pprint import pprint

def inspect(nb1, nb2):
    print matrixstripped[nb1, nb2], matrixwinnow[nb1,nb2], matrixorg[nb1,nb2]
    pprint (x[nb1][4].split('\n'))
    print '== program 1=====================================\n'
    pprint (x[nb2][4].split('\n'))
    print '== program 2 ====================================\n'
    myProgram1 = Code(x[nb1][4])
    myProgram2 = Code(x[nb2][4])
    
    print myProgram1.getWorkingCopy()
    print '== Working copy 1 ===============================\n' 
    print myProgram2.getWorkingCopy()
    print '== Working copy 2 ===============================\n'

    print myProgram1.getHashCopy()
    print '== Hash copy 1 ===============================\n' 
    print myProgram2.getHashCopy()
    print '== Hash copy 2 ===============================\n'
    
    print myProgram1.winnow2str(5)
    print '== Winnow string 1 ==============================\n'
    print myProgram2.winnow2str(5)
    print '== Winnow string 2 ==============================\n'
    
    print differ(myProgram1.getWorkingCopy(),myProgram2.getWorkingCopy())
    
    
inspect (51,54)
#inspect(120,121)
#inspect(13,25)

#inspect(114,130)
#inspect(114,115)
#inspect(125,132)

#inspect(51,54)
#inspect(112,132)
#inspect(5,11)
#inspect(5,11)
#inspect(48,58)
#inspect(99,106)
#inspect(114,117)
#inspect(109,131)


0.00970873786408 1.0 0.802325581395
['import sets',
 '',
 'def jsim(s1, s2):',
 '    set1 = sets.ImmutableSet(s1)',
 '    set2 = sets.ImmutableSet(s2)',
 '    ',
 '    intersection_len = len(set1.intersection(set2))',
 '    union_len = len(set1.union(set2))',
 '    ',
 '    return float(intersection_len) / float(union_len)']
== program 1=====================================

['import scipy',
 'import numpy',
 'import sets',
 'import random',
 'import math',
 '',
 'def jsim(s1, s2):',
 '    set1 = sets.ImmutableSet(s1)',
 '    set2 = sets.ImmutableSet(s2)',
 '    ',
 '    intersection_len = len(set1.intersection(set2))',
 '    union_len = len(set1.union(set2))',
 '    ',
 '    return float(intersection_len) / float(union_len)']
== program 2 ====================================

import sets
def meth00000(name00000,name00001):
    name00002=sets.ImmutableSet(name00000)
    name00003=sets.ImmutableSet(name00001)
    name00004=len(name00002.name00000(name00003))
    name00005=len(name00002.name00000(name00003))
    return float(name00004)/float(name00005)
== Working copy 1 ===============================

import scipy
import numpy
import sets
import random
import math
def meth00000(name00000,name00001):
    name00002=sets.ImmutableSet(name00000)
    name00003=sets.ImmutableSet(name00001)
    name00004=len(name00002.name00000(name00003))
    name00005=len(name00002.name00000(name00003))
    return float(name00004)/float(name00005)
== Working copy 2 ===============================

l0000001 a9e2 9347 178d
l0000003 a9e2 4ed9 813c 84c4 2dfc c0cb 139a 9371 853a
l0000004 9ce8 5152 43ec 178d 5058 898b 84c4 2dfc 9371
l0000005 a9e2 c44d 43ec 178d 5058 898b 84c4 139a 9371
l0000007 a9e2 1efa 43ec f5a8 84c4 5152 5058 2dfc 84c4 c44d 9371 9371
l0000008 a9e2 d636 43ec f5a8 84c4 5152 5058 2dfc 84c4 c44d 9371 9371
l0000010 a9e2 e70c 546a 84c4 1efa 9371 6666 546a 84c4 d636 9371

== Hash copy 1 ===============================

l0000001 a9e2 9347 10ea
l0000002 a9e2 9347 2ea9
l0000003 a9e2 9347 178d
l0000004 a9e2 9347 7ddf
l0000005 a9e2 9347 7e67
l0000007 a9e2 4ed9 813c 84c4 2dfc c0cb 139a 9371 853a
l0000008 9ce8 5152 43ec 178d 5058 898b 84c4 2dfc 9371
l0000009 a9e2 c44d 43ec 178d 5058 898b 84c4 139a 9371
l0000011 a9e2 1efa 43ec f5a8 84c4 5152 5058 2dfc 84c4 c44d 9371 9371
l0000012 a9e2 d636 43ec f5a8 84c4 5152 5058 2dfc 84c4 c44d 9371 9371
l0000014 a9e2 e70c 546a 84c4 1efa 9371 6666 546a 84c4 d636 9371

== Hash copy 2 ===============================

178d 2dfc 139a 43ec 178d 2dfc 178d 139a 1efa 43ec 2dfc 84c4 9371 43ec 2dfc 84c4 9371 546a 1efa 546a 
== Winnow string 1 ==============================

10ea 2ea9 178d 7ddf 4ed9 2dfc 139a 43ec 178d 2dfc 178d 139a 1efa 43ec 2dfc 84c4 9371 43ec 2dfc 84c4 9371 546a 1efa 546a 
== Winnow string 2 ==============================

--- 
+++ 
@@ -1,4 +1,8 @@
+import scipy
+import numpy
 import sets
+import random
+import math
 def meth00000(name00000,name00001):
     name00002=sets.ImmutableSet(name00000)
     name00003=sets.ImmutableSet(name00001)


In [22]:
myProgram.getWinnow(10,noise=1)


Out[22]:
[['2dfc', 4], ['2ea9', 6], ['2dfc', 6], ['2dfc', 8], ['0fbd', 8]]

In [23]:
import numpy as numpy
def meth00000(name00000,name00001):
    name00002=len(name00000)
    name00003=len(name00000[0])
    name00004=numpy.random.randn(len(name00000),name00001)
    name00005=numpy.array([])
    for name00006 in range(name00001):
        name00005=numpy.concatenate((name00005,numpy.sign(numpy.sum(numpy.transpose(numpy.tile(name00004[:,name00006],(name00003,1)))*name00000,axis=0))))
    return numpy.reshape(name00005,(name00001,name00003))

docs = np.array([[1,-2,3],[-3,2,-1],[-2,3,1],[3,-3,-3]])
sketches = meth00000(docs,10000)

In [24]:
sketches


Out[24]:
array([[ 1., -1.,  1.],
       [ 1., -1.,  1.],
       [-1.,  1.,  1.],
       ..., 
       [ 1., -1.,  1.],
       [ 1., -1., -1.],
       [ 1., -1., -1.]])

In [25]:
from Code import Code
import Scope
Scope.clearScopes()
myProgram1 = Code(x[96][4])
myProgram2 = Code(x[106][4])
print myProgram2.getWorkingCopy()
print myProgram.getWorkingCopy()


def meth00000(name00000,name00001):
    name00002=0
    for name00003 in range(0,len(name00000)):
        if(name00000[name00003]==name00001[name00003]):
            name00002+=1
    return float(name00002)/float(len(name00000))
import numpy as numpy
def meth00000(name00000):
    index=numpy.random.randint(len(name00000))
    return name00000[index]

In [28]:
len(Scope.Scope.scopes)


Out[28]:
0

In [27]:
myProgram1 = ''

In [ ]: