Description of experiments on homework corpus

We have five homeworks, with a minimum of 80 notebooks per homework. Given a notebook, we want to know if we can we predict which homework it came from. We test with three different feature sets, use Random Forest with 400 iterations and depth 2 as our classifier, and use 10 fold cross validation.

Generating the corpus

The following is the code used to generate the corpus, along with comments describing the code. This is done using a pipeline object inspired by sklearn.


In [ ]:
# Load the filenames
hw_filenames = np.load('homework_file_names.npy')
# Load the notebooks into a data structure
hw_notebooks = [[NotebookMiner(filename) for filename in temp[:80]] for temp in hw_filenames]

# For each homework, load all notebooks into the corpus. The second argument serves as a tag
# for each notebook added.
corpus = Features(hw_notebooks[0], 'hw1')
corpus.add_notebooks(hw_notebooks[1], 'hw2')
corpus.add_notebooks(hw_notebooks[2], 'hw3')
corpus.add_notebooks(hw_notebooks[3], 'hw4')
corpus.add_notebooks(hw_notebooks[4], 'hw5')

Baseline


In [ ]:
'''
Step 1: GetASTFeatures: This class is responsible for getting additional features for each cell
of our notebook. The original information is the source code, so this function adds some
features related to the AST, including the actual AST.

'''
gastf = GetASTFeatures()
'''
Step 2: ResampleByNode: This class resamples the notebooks using the AST feature that was
created. Each cell has n distinct trees in the AST (corresponding to n top level lines of code),
so this class splits each cell into n different parts, so we can perform more fine grained
operations.
'''
rbn = ResampleByNode()
'''
Step 3: GetImports: This class works on the ASTs to normalize variable names, gather information
about the imports from the notebook, and the functions called in each line of code.
'''
gi = GetImports()
'''
Pipeline: The pipeline collects the above classes and runs our corpus through each one
sequentially.
'''
pipe = Pipeline([gastf, rbn, gi])
corpus = pipe.transform(corpus)

RESULTS

For the baseline, we used the function names that were gathered from 'GetImports' as our features (1687 different functions called throughout the corpus), using CountVectorizer from sklearn, and got a mean accuracy of .32 using 10-fold cross validation.

Simple Templates

Now, we have the intuition that it is not only important which functions are called, but also which functions are called near each other, so we created an encoding system for lines of code to get a higher level feature. This method involves iteratively collapsing the leaves of each AST based on commonly occuring leaves until only one node is left in the tree.


In [ ]:
gastf = GetASTFeatures()
rbn = ResampleByNode()
gi = GetImports()
'''
Step 4: ASTGraphReducer: This is the class responsible for generating the templates from the
ASTs. For each AST, it creates a feature that corresponds to the template that 'covers' that
line of code (or None if the AST was not able to be reduced). 
'''
agr = ASTGraphReducer(a, threshold=8, split_call=False)
pipe = Pipeline([gastf, rbn, gi, agr])
corpus = pipe.transform(corpus)

RESULTS

For this classifer, we used the templates from 'ASTGraphReducer' as our features (we found 1188 different templates), using CountVectorizer from sklearn, and got a mean accuracy of .35 using 10-fold cross validation.

Higher Order Templates

While the templates improved on our baseline, we wanted to take further advantage of locality by creating higher order templates corresponding to commonly co-occuring simple templates. In order to do this, we took advantage of the natural split of the notebooks by cell, and found frequent itemsets using the cells as the buckets and the templates as the items. Then, for each notebook, we determined which frequent itemsets appeared for any cell in that notebook, and used the list of distinct frequent itemsets as our features.


In [ ]:
gastf = GetASTFeatures()
rbn = ResampleByNode()
gi = GetImports()
agr = ASTGraphReducer(a, threshold=8, split_call=False)
'''
Step 5: FrequentItemsets: This class is responsible for computing the frequent itemsets.
'''
fi = FrequentItemsets()
pipe = Pipeline([gastf, rbn, gi, agr, fi])
corpus = pipe.transform(corpus)

RESULTS

For this classifer, we used the templates from 'FrequentItemsets' as our features (we found 9030 different frequent itemsets), using CountVectorizer from sklearn, and got a mean accuracy of .82 using 10-fold cross validation.


In [ ]: