Here we explore some off-the-shelf learning strategies for identifying datasets to be flagged based on the results of the AutoQC suite of tests.
learning-0.0.1pip_freeze.dat for package versions.The full dataset was processed by AutoQC, and the results logged as JSON serializations.
In [5]:
import json
def reloadData():
## read raw data
with open('../../../AutoQC_raw/true.dat') as true_data:
truth = json.load(true_data)
with open('../../../AutoQC_raw/results.dat') as results_data:
rawResults = json.load(results_data)
return truth, rawResults
truth, rawResults = reloadData()
datasetSize = len(truth)
In [6]:
import random
import numpy as np
def shuffleLists(a, b):
'''
given two lists a, b, shuffle them maintaining pairwise correspondence.
thanks http://stackoverflow.com/questions/13343347/randomizing-two-lists-and-maintaining-order-in-python
'''
combined = zip(a, b)
random.seed(2154)
random.shuffle(combined)
a[:], b[:] = zip(*combined)
def transpose(lists):
'''
return the transpose of lists, a list of lists.
all the inner lists had better be the same length!
'''
T = []
for i in range(len(lists[0])):
T.append([None]*len(lists))
for i in range(len(lists)):
for j in range(len(lists[0])):
T[j][i] = lists[i][j]
return T
def runClassifier(classifier, trainingSize):
'''
given a scikit-learn classifier, train it on the first trainingSize points of data and truth,
and return the prediction classes on the remainder of data
'''
#load and arrange data
truth, rawResults = reloadData()
data = transpose(rawResults) #arrange data into rows by profile for consumption by scikit-learn
shuffleLists(data, truth) #randomize order of profiles
#train classifier
classifier.fit(data[0:trainingSize], truth[0:trainingSize])
#predict values for remainder of profiles
TT = 0.
TF = 0.
FT = 0.
FF = 0.
for i in range(trainingSize, len(truth)):
assessment = classifier.predict(data[i])
if assessment and truth[i]:
TT += 1
elif assessment and not truth[i]:
TF += 1
elif not assessment and truth[i]:
FT += 1
elif not assessment and not truth[i]:
FF += 1
return TT, TF, FT, FF
def printSummary(title, TT, TF, FT, FF):
print title
print '\t Correct flags:', TT
print '\t False positive:', TF
print '\t False negative:', FT
print '\t Correct pass:', FF
trainingSize = 5000
In [7]:
for i in range(len(rawResults)):
TT = 0.
TF = 0.
FT = 0.
FF = 0.
for j in range(len(rawResults[i])):
if rawResults[i][j] and truth[j]:
TT += 1
elif rawResults[i][j] and not truth[j]:
TF += 1
elif not rawResults[i][j] and truth[j]:
FT += 1
elif not rawResults[i][j] and not truth[j]:
FF +=1
printSummary(i, TT/len(truth), TF/len(truth), FT/len(truth), FF/len(truth))
Row 7, corrsponding to the EN_background_check test, gives the best standalone performance, with around 5.7% of the entire dataset providing correct flags and and another 5.9% providing false negatives, and about 4.3% identified as false positives.
Next, we consider the performance of raising a flag on a profile if any of the underlying tests do so:
In [8]:
truth, rawResults = reloadData()
data = transpose(rawResults)
TT = 0.
TF = 0.
FT = 0.
FF = 0.
for i in range(len(truth)):
anyFlag = sum(data[i]) > 0
if anyFlag and truth[i]:
TT += 1
elif anyFlag and not truth[i]:
TF += 1
elif not anyFlag and truth[i]:
FT += 1
elif not anyFlag and not truth[i]:
FF +=1
printSummary('any flag', TT/len(truth), TF/len(truth), FT/len(truth), FF/len(truth))
giving us an improved flag rate at the expense of seeing about twice as many false positives.
In this section, we explore several of the individual classifiers presented by scikit-learn. We attempt to remain as parameter-agnostic as possible at this stage, using defaults wherever possible.
None of the classifiers investigated outperform flagging a profile flagged by any of the underlying tests. The classifiers that perform comparably are:
These classifiers will form the basis of further inquiry in the next section, on ensemble methods.
First we examine the performance of scikit-learn's SVM, with default kernels. Notably, randomization of data order was necessary before SVM training, to ensure no systematics from sorting. Substantially worse performance of this classifier was observed when trained on non-randomized data.
In [9]:
from sklearn import svm
#linear kernel
TT, TF, FT, FF = runClassifier(svm.SVC(kernel='linear'), trainingSize)
printSummary('SVM with linear kernel', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
In [10]:
#polynomial kernel
TT, TF, FT, FF = runClassifier(svm.SVC(kernel='poly'), trainingSize)
printSummary('SVM with polynomial kernel', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
In [11]:
#rbf kernel
TT, TF, FT, FF = runClassifier(svm.SVC(kernel='rbf'), trainingSize)
printSummary('SVM with rbf kernel', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
In [12]:
#sigmoid kernel
TT, TF, FT, FF = runClassifier(svm.SVC(kernel='sigmoid'), trainingSize)
printSummary('SVM with sigmoid kernel', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
So the out-of-the box linear SVM performs comparably to EN_background, but with a lower false positive rate. Naively, the SVM is learning that EN_background is the best predictor, and uses other tests to veto some false positives.
Next we explore the discriminant analysis techniques presented by scikit-learn.
In [13]:
from sklearn.lda import LDA
TT, TF, FT, FF = runClassifier(LDA(solver="svd"), trainingSize)
printSummary('Linear discriminant', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
In [14]:
from sklearn.qda import QDA
TT, TF, FT, FF = runClassifier(QDA(), trainingSize)
printSummary('Quadratic discriminant', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
So we see the linear discriminant behaves very comparably to the linear SVM, while the quadratic disciminant gives the best flagging performance so far, albeit at the cost of a substantially higher false positive rate.
Given the efficacy of the SVM, other kernel-trick based algorithms are worth exploring; here we try the kernel ridge algorithm.
In [15]:
from sklearn.kernel_ridge import KernelRidge
TT, TF, FT, FF = runClassifier(KernelRidge(kernel='linear'), trainingSize)
printSummary('Kernel Ridge with linear kernel', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
Next we consider the SGD algorithm for learning classification, exploring a few possible configurations.
In [16]:
from sklearn.linear_model import SGDClassifier
TT, TF, FT, FF = runClassifier(SGDClassifier(loss="hinge", penalty="l2"), trainingSize)
printSummary('SGD with hinge loss & L2 penalty', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
In [17]:
from sklearn.linear_model import SGDClassifier
TT, TF, FT, FF = runClassifier(SGDClassifier(loss="hinge", penalty="elasticnet"), trainingSize)
printSummary('SGD with hinge loss & elasticnet penalty', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
In [38]:
from sklearn.linear_model import SGDClassifier
TT, TF, FT, FF = runClassifier(SGDClassifier(loss="modified_huber", penalty="l2"), trainingSize)
printSummary('SGD with modified huber loss & L2 penalty', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
In [19]:
from sklearn.linear_model import SGDClassifier
TT, TF, FT, FF = runClassifier(SGDClassifier(loss="modified_huber", penalty="elasticnet"), trainingSize)
printSummary('SGD with modified huber loss & elasticnet penalty', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
In [20]:
from sklearn.linear_model import SGDClassifier
TT, TF, FT, FF = runClassifier(SGDClassifier(loss="log", penalty="l2"), trainingSize)
printSummary('SGD with logistic loss & L2 penalty', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
In [21]:
from sklearn.linear_model import SGDClassifier
TT, TF, FT, FF = runClassifier(SGDClassifier(loss="log", penalty="elasticnet"), trainingSize)
printSummary('SGD with logistic loss & elasticnet penalty', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
The best of these are comparable to the linear kernel SVM or the linear discriminant analysis. Note these classifiers produce very different results on re-execution; more investigation required.
Next we explore kNN classification; we restrict ourselves to k nearest neighbour techniques, as the dimensionality of the inputs space is large (and growing). We explore a logarithmic range of k values, to get a gross sense of the effect of this choice.
In [22]:
from sklearn import neighbors
TT, TF, FT, FF = runClassifier(neighbors.KNeighborsClassifier(10, weights='uniform'), trainingSize)
printSummary('kNN, k=10', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
In [23]:
from sklearn import neighbors
TT, TF, FT, FF = runClassifier(neighbors.KNeighborsClassifier(100, weights='uniform'), trainingSize)
printSummary('kNN, k=100', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
In [24]:
from sklearn import neighbors
TT, TF, FT, FF = runClassifier(neighbors.KNeighborsClassifier(1000, weights='uniform'), trainingSize)
printSummary('kNN, k=1000', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
Perhaps unsurprisingly, increasing k pushes the algorithm to never raise a flag; for large k, NN essentially takes the majortiy result of the dataset, which is mostly no-flag.
In [25]:
from sklearn import neighbors
TT, TF, FT, FF = runClassifier(neighbors.KNeighborsClassifier(10, weights='distance'), trainingSize)
printSummary('kNN, k=10, distance weighted', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
In [26]:
from sklearn import neighbors
TT, TF, FT, FF = runClassifier(neighbors.KNeighborsClassifier(100, weights='distance'), trainingSize)
printSummary('kNN, k=100, distance weighted', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
In [27]:
from sklearn import neighbors
TT, TF, FT, FF = runClassifier(neighbors.KNeighborsClassifier(1000, weights='distance'), trainingSize)
printSummary('kNN, k=1000, distance weighted', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
Distance weighting improves performance, suggesting some clustering of flagged data in the input space.
Nearest centroid is a subset of NN algorithms. Scikit-learn advertises it as a good baseline classifier, for its lack of parameterization.
In [28]:
from sklearn.neighbors.nearest_centroid import NearestCentroid
TT, TF, FT, FF = runClassifier(NearestCentroid(), trainingSize)
printSummary('Nearest Centroid', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
Comparable to kernel ridge and QDA, but with a slightly lower false positive rate. Shrunken threshold approaches do not yield substantially different results.
A major class of classifiers are decision trees, which we examine here.
In [29]:
from sklearn import tree
TT, TF, FT, FF = runClassifier(tree.DecisionTreeClassifier(), trainingSize)
printSummary('Decision Tree', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
In line with some of the best classifiers examined so far.
Next we explore the collection of naive Baysian models provided by scikit-learn.
In [30]:
from sklearn.naive_bayes import GaussianNB
TT, TF, FT, FF = runClassifier(GaussianNB(), trainingSize)
printSummary('SVM with linear kernel', TT/(len(truth)-trainingSize), TF/(len(truth)-trainingSize), FT/(len(truth)-trainingSize), FF/(len(truth)-trainingSize))
In [31]:
from sklearn.naive_bayes import MultinomialNB
TT, TF, FT, FF = runClassifier(MultinomialNB(), trainingSize)
printSummary('SVM with linear kernel', TT/(len(truth)-trainingSize), TF/(len(truth)-trainingSize), FT/(len(truth)-trainingSize), FF/(len(truth)-trainingSize))
In [32]:
from sklearn.naive_bayes import BernoulliNB
TT, TF, FT, FF = runClassifier(BernoulliNB(), trainingSize)
printSummary('SVM with linear kernel', TT/(len(truth)-trainingSize), TF/(len(truth)-trainingSize), FT/(len(truth)-trainingSize), FF/(len(truth)-trainingSize))
All the Bayes models produce comparable results, all poorer than the SVM. The fundamental assumption of independent features made in these models is probably a poor one for this data.
In Part 1, we examined individual scikit-learn classifiers, and found that many of them provide similar performance, flagging about half of datasets that ought to be flagged. In this section, we explore ideas for combining the results of several of these classifiers into a final decision.
In [33]:
from sklearn import svm
from sklearn.qda import QDA
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors.nearest_centroid import NearestCentroid
from sklearn import tree
import matplotlib.pyplot as plt
%matplotlib inline
#load and arrange data
truth, rawResults = reloadData()
data = transpose(rawResults) #arrange data into rows by profile for consumption by scikit-learn
shuffleLists(data, truth) #randomize order of profiles
clf_SVM = svm.SVC(kernel='linear')
clf_QDA = QDA()
clf_KernelRidge = KernelRidge(kernel='linear')
clf_SGD = SGDClassifier(loss="hinge", penalty="l2")
clf_NearestCentroid = NearestCentroid()
clf_DecisionTree = tree.DecisionTreeClassifier()
clfs = [clf_SVM, clf_QDA, clf_KernelRidge, clf_SGD, clf_NearestCentroid, clf_DecisionTree]
histEntries = []
# train the classifiers
for clf in clfs:
clf.fit(data[0:trainingSize], truth[0:trainingSize])
# poll classifiers and report
TT = 0.
TF = 0.
FT = 0.
FF = 0.
for i in range(trainingSize, len(truth)):
flagsRaised = 0
for clf in clfs:
if clf.predict(data[i]):
flagsRaised += 1
histEntries.append(flagsRaised)
assessment = flagsRaised >= len(clfs)/2
if assessment and truth[i]:
TT += 1
elif assessment and not truth[i]:
TF += 1
elif not assessment and truth[i]:
FT += 1
elif not assessment and not truth[i]:
FF += 1
printSummary('Majority Poll', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
plt.hist(histEntries, bins=[0,1,2,3,4,5,6,7])
plt.show()
The plot shows the number of times n classifiers flagged a given dataset. Another simple approach is to flag a dataset if any of the classifiers flag it:
In [34]:
from sklearn import svm
from sklearn.qda import QDA
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors.nearest_centroid import NearestCentroid
from sklearn import tree
#load and arrange data
truth, rawResults = reloadData()
data = transpose(rawResults) #arrange data into rows by profile for consumption by scikit-learn
shuffleLists(data, truth) #randomize order of profiles
clf_SVM = svm.SVC(kernel='linear')
clf_QDA = QDA()
clf_KernelRidge = KernelRidge(kernel='linear')
clf_SGD = SGDClassifier(loss="hinge", penalty="l2")
clf_NearestCentroid = NearestCentroid()
clf_DecisionTree = tree.DecisionTreeClassifier()
clfs = [clf_SVM, clf_QDA, clf_KernelRidge, clf_SGD, clf_NearestCentroid, clf_DecisionTree]
# train the classifiers
for clf in clfs:
clf.fit(data[0:trainingSize], truth[0:trainingSize])
# poll classifiers and report
TT = 0.
TF = 0.
FT = 0.
FF = 0.
for i in range(trainingSize, len(truth)):
assessment = False
for clf in clfs:
assessment = assessment or clf.predict(data[i])
if assessment and truth[i]:
TT += 1
elif assessment and not truth[i]:
TF += 1
elif not assessment and truth[i]:
FT += 1
elif not assessment and not truth[i]:
FF += 1
printSummary('Any Flag', TT/(datasetSize-trainingSize), TF/(datasetSize-trainingSize), FT/(datasetSize-trainingSize), FF/(datasetSize-trainingSize))
So, flagging any profile that is flagged by an individual classifier performs better than the majority poll, but this performance is no better than simply flagging any profile flagged by a base test.