We have five homeworks, with a minimum of 80 notebooks per homework. Given a notebook, we want to know if we can we predict which homework it came from. We test with three different feature sets, use Random Forest with 400 iterations and depth 2 as our classifier, and use 10 fold cross validation.
In [ ]:
# Load the filenames
hw_filenames = np.load('homework_file_names.npy')
# Load the notebooks into a data structure
hw_notebooks = [[NotebookMiner(filename) for filename in temp[:80]] for temp in hw_filenames]
# For each homework, load all notebooks into the corpus. The second argument serves as a tag
# for each notebook added.
corpus = Features(hw_notebooks[0], 'hw1')
corpus.add_notebooks(hw_notebooks[1], 'hw2')
corpus.add_notebooks(hw_notebooks[2], 'hw3')
corpus.add_notebooks(hw_notebooks[3], 'hw4')
corpus.add_notebooks(hw_notebooks[4], 'hw5')
In [ ]:
'''
Step 1: GetASTFeatures: This class is responsible for getting additional features for each cell
of our notebook. The original information is the source code, so this function adds some
features related to the AST, including the actual AST.
'''
gastf = GetASTFeatures()
'''
Step 2: ResampleByNode: This class resamples the notebooks using the AST feature that was
created. Each cell has n distinct trees in the AST (corresponding to n top level lines of code),
so this class splits each cell into n different parts, so we can perform more fine grained
operations.
'''
rbn = ResampleByNode()
'''
Step 3: GetImports: This class works on the ASTs to normalize variable names, gather information
about the imports from the notebook, and the functions called in each line of code.
'''
gi = GetImports()
'''
Pipeline: The pipeline collects the above classes and runs our corpus through each one
sequentially.
'''
pipe = Pipeline([gastf, rbn, gi])
corpus = pipe.transform(corpus)
Now, we have the intuition that it is not only important which functions are called, but also which functions are called near each other, so we created an encoding system for lines of code to get a higher level feature. This method involves iteratively collapsing the leaves of each AST based on commonly occuring leaves until only one node is left in the tree.
In [ ]:
gastf = GetASTFeatures()
rbn = ResampleByNode()
gi = GetImports()
'''
Step 4: ASTGraphReducer: This is the class responsible for generating the templates from the
ASTs. For each AST, it creates a feature that corresponds to the template that 'covers' that
line of code (or None if the AST was not able to be reduced).
'''
agr = ASTGraphReducer(a, threshold=8, split_call=False)
pipe = Pipeline([gastf, rbn, gi, agr])
corpus = pipe.transform(corpus)
While the templates improved on our baseline, we wanted to take further advantage of locality by creating higher order templates corresponding to commonly co-occuring simple templates. In order to do this, we took advantage of the natural split of the notebooks by cell, and found frequent itemsets using the cells as the buckets and the templates as the items. Then, for each notebook, we determined which frequent itemsets appeared for any cell in that notebook, and used the list of distinct frequent itemsets as our features.
In [ ]:
gastf = GetASTFeatures()
rbn = ResampleByNode()
gi = GetImports()
agr = ASTGraphReducer(a, threshold=8, split_call=False)
'''
Step 5: FrequentItemsets: This class is responsible for computing the frequent itemsets.
'''
fi = FrequentItemsets()
pipe = Pipeline([gastf, rbn, gi, agr, fi])
corpus = pipe.transform(corpus)
In [ ]: