Tiered/Multi-level Classifier training

We assume here that the notebook 01.train_test_segmentation.ipynb has been run and outputs generated.

What are we doing

The idea of this approach is to train multiple levels of classifiers based on lower level classifier results.

With regards to just images, this means that we will train an solely-image-based classifer that will attempt to classify single images as either HT or non-HT (HT = human trafficing releated). We then create a histogram per ad based on the real-values results of child image classification results (probabilities). The histogram generated will have a number of bins based on an arbitrary partitioning of the classification probability space ([0,1] range, we'll pick the arbitrary bin size later). After ads have been assigned histograms, we will train a second classifier using the histograms as the descriptors. Above ads, we have

The previous notebook in this series notes how we must split train/test/evaluation sets based on the highest organizational level. In our case here, that would be the cluster level. In addition, for each level, we will have to split the descriptor data in half for classifier training and application in order to eliminate train/test bias.

For our example where the final goal is a classifier for cluster-level histograms, we must split the training images (positive and negative) in half, based on cluster separation (highest order organization), where the first half is used to train the classifier, and the classifier is applied to the second half in order to create histograms for ads. Of the result ad histograms, we must split that in half again for in order to train the ad histogram classifier on half, and then apply the classifier on the other half of ad histograms in order to construct the cluster-level histogram classifier. If the final goal is just an ad histogram classifier (like before we received cluster-separated negative training data), then splitting the ads in half would not be necessary and training of the ad classifier could use all available ad histograms from the training data.

Later on we will also note where it is possible to fuse in results from different classifiers to create a potentially more robust higher level result, which is the advantage of this approach over others.



In [ ]:

    
# Lets import things and enable logging
import collections
import logging
from smqtk.utils.bin_utils import initialize_logging
initialize_logging(logging.getLogger('smqtk'), logging.DEBUG)
initialize_logging(logging.getLogger(__name__), logging.DEBUG)

Training level 1 - Image classifier

Let us use the train1... partition to train the image classifier. We will use the recorded SHA1 values to fetch computed descriptors to use for training.



In [ ]:

    
print "Loading positive ads/shas"
train1_pos_shas = cPickle.load(open('train1_pos_shas.pickle'))
print "Loading negative ads/shas"
train1_neg_shas = cPickle.load(open('train1_neg_shas.pickle'))

print "Done"

# Creating training CSV file for use with ``classifier_model_validation.py`` for classifier training.
from smqtk.representation.descriptor_index.postgres import PostgresDescriptorIndex
image_descr_index = PostgresDescriptorIndex(table_name='descriptor_index_alexnet_fc7',
                                            db_host='/dev/shm',
                                            db_port=5434,
                                            db_user='purg',
                                            multiquery_batch_size=10000,
                                            read_only=True)

print "Descriptors stored:", image_descr_index.count()

Since there is, relatively for SVMs, a lot of training data, we will need to sub-sample positive and negative pools into smaller sets so we can allow the SVM to converge.



In [ ]:

    
import random

from smqtk.algorithms.classifier.libsvm import LibSvmClassifier
from smqtk.representation.classification_element.memory import MemoryClassificationElement
from smqtk.representation.classification_element.file import FileClassificationElement
from smqtk.representation import ClassificationElementFactory

image_classifier = LibSvmClassifier('image_classifier.train1.classifier.model',
                                    'image_classifier.train1.classifier.label',
                                    normalize=2)
c_file_factory = ClassificationElementFactory(FileClassificationElement, {
        "save_dir": "image_classifier.classifications",
        "subdir_split": 10,
    })

# Increase these numbers for final model training?
# n_pos = 5000
# n_neg = 10000
n_pos = 3000
n_neg = 6000

random.seed(0)
train1_pos_shas_rand = sorted(train1_pos_shas)
train1_neg_shas_rand = sorted(train1_neg_shas)
random.shuffle(train1_pos_shas_rand)
random.shuffle(train1_neg_shas_rand)
train1_pos_shas_rand = train1_pos_shas_rand[:n_pos]
train1_neg_shas_rand = train1_neg_shas_rand[:n_neg]

# Writing out as CSV file for posterity
with open('image_classifier.train1.csv', 'w') as f:
    writer = csv.writer(f)
    for sha in train1_pos_shas_rand:
        writer.writerow([sha, 'positive'])
    for sha in train1_neg_shas_rand:
        writer.writerow([sha, 'negative'])

# Get the descriptors from the image index for the selected SHA1 values
train1_pos_descrs = set(image_descr_index.get_many_descriptors(train1_pos_shas_rand))
train1_neg_descrs = set(image_descr_index.get_many_descriptors(train1_neg_shas_rand))



In [ ]:

    
# Do training if we haven't yet
if not image_classifier.has_model():
    image_classifier.train(positive=train1_pos_descrs, negative=train1_neg_descrs)



In [ ]:

    
# Creating CSV file of train2 sha/truth values for image classifier validation
IMG_MODEL_VALIDATION_TRUTH_CSV = "image_classifier.train2_validation.csv"

print "loading train2 pos shas" 
train2_pos_shas = cPickle.load(open('train2_pos_shas.pickle'))

print "loading neg train2 shas"
train2_neg_shas = cPickle.load(open('train2_neg_shas.pickle'))

print "Writing CSV file"
with open(IMG_MODEL_VALIDATION_TRUTH_CSV, 'w') as f:
    writer = csv.writer(f)
    for sha in train2_pos_shas:
        writer.writerow([sha, 'positive'])
    for sha in train2_neg_shas:
        writer.writerow([sha, 'negative'])

print "Done"

Now we can optionally run the following for model validation using the Train2 data:

classifier_model_validation.py -vc image_classifier.train2_validation.json 2>&1 | tee image_classifier.train2_validation.log

Generating Ad histograms

Steps:

Compute image probabilities using image classifier
collect image probabilities per ad
create N-element histogram of child image probabilities per add
train classifier using histograms (Validate using train3?)

Applying the image classifier

Now that we have an image classifier, we need to apply that classifier to the second half of training images (this might take a while).



In [ ]:

    
TRAIN2_IMG_CLASSIFICATIONS_CACHE = 'image_classifier.train2.classifications_cache.pickle'

train2_pos_ads  = cPickle.load(open('train2_pos_ads.pickle'))
train2_neg_ads  = cPickle.load(open('train2_neg_ads.pickle'))

if os.path.isfile(TRAIN2_IMG_CLASSIFICATIONS_CACHE):
    print "Loading existing train2 image classification results"
    train2_img_descr2classifications = cPickle.load(open(TRAIN2_IMG_CLASSIFICATIONS_CACHE))
    
else:
    # Apply classifier to image descriptors, collecting positive confidence
    shas_to_compute = train2_pos_shas | train2_neg_shas
    print "Classifying positive/negative train2 image descriptors (count: %d)" % len(shas_to_compute)
    
    train2_img_descr2classifications = \
        image_classifier.classify_async(image_descr_index.get_many_descriptors(shas_to_compute), 
                                        c_file_factory, 
                                        use_multiprocessing=True,
                                        ri=1.0)

    # Saving out results
    with open(TRAIN2_IMG_CLASSIFICATIONS_CACHE, 'w') as f:
        cPickle.dump(train2_img_descr2classifications, f, -1)



In [ ]:

    
pos_ad2shas = cPickle.load(open('positive.ad2shas.pickle'))
pos_sha2ads = cPickle.load(open('positive.sha2ads.pickle'))
neg_ad2shas = cPickle.load(open('negative.ad2shas.pickle'))
neg_sha2ads = cPickle.load(open('negative.sha2ads.pickle'))



In [ ]:

    
import numpy


def filter_sha2ads(sha_set, ad_set):
    """
    Create a SHA1 to ads mapping from the master sets, filtered on the given 
    SHA1 and ad subsets.
    
    :param sha_set: Sub-set of SHA1 values to consider
    :param ad:set: Sub-set of ad IDs to consider.
    :return: New mapping of SHA1 values to parent ads using only subsets given.
    """
    new_sha2ads = collections.defaultdict(set)
    for sha in sha_set:
        # Get all ads associated with the SHA1, then filter with given ad set
        new_sha2ads[sha] = (pos_sha2ads[sha] | neg_sha2ads[sha]) & ad_set
    return new_sha2ads


def make_histogram(probs, n_bins):
    """ Make histogram of the given positive probability values """
    return numpy.histogram(probs, bins=n_bins, range=[0,1])[0]


def generate_ad_histograms(img_classifications, sha2ads, hist_size=20, target_class='positive'):
    """
    Generate ad histograms based on input iterable of ClassificationElements.
    
    This first aggregates image classification `target_class` scores under parent ads,
    and then creates a histogram of `hist_size` bins for each ad based on the image
    classification for images associated with that ad.  Histograms are normalized to
    relative frequencies (L1 nomalization).
    
    :param img_classifications: Iterable of ClassificationElement instances for all images
        classified.
    :param sha2ads: Mapping of image descriptor UUID values (image SHA1) to the one or more
        ads that image is a child of.
    :param hist_size: The number of percentile partitions to make for the ad histograms.
        This is literally the size of the histograms output.
    :param target_class: The class probability to extract from input ClassificationElement
        instances.
    :return: Dict mapping of ad ID to the its L1 normalized histogram of child image
        classification scores.
    """
    ad2img_probs = collections.defaultdict(dict)
    for c in img_classifications:
        sha = c.uuid
        for ad in sha2ads[sha]:
            ad2img_probs[ad][sha] = c[target_class]
            
    ad2histogram = {}
    for ad, img_prob_map in ad2img_probs.iteritems():
        h = make_histogram(img_prob_map.values(), hist_size).astype(float)
        if not h.sum():
            raise RuntimeError("Zero-sum histogram for ad: %s (probs: %s)" 
                               % (ad, img_prob_map.values()))
        h /= h.sum()
        ad2histogram[ad] = h
    
    return ad2histogram



In [ ]:

    
# Reverse mapping within the scope of train2 only
train2_sha2ads = filter_sha2ads(train2_pos_shas | train2_neg_shas,
                                train2_pos_ads | train2_neg_ads)
assert len(train2_img_descr2classifications) == len(train2_sha2ads), \
    "The same number of SHA1 values in reverse map should equal the classifications computed"

train2_ad2histogram = generate_ad_histograms(train2_img_descr2classifications.itervalues(), 
                                             train2_sha2ads)

with open('ad_classifier.train2_ad2histogram.pickle', 'w') as f:
    cPickle.dump(train2_ad2histogram, f, -1)



In [ ]:

    
from smqtk.representation import DescriptorElementFactory
from smqtk.representation.descriptor_element.local_elements import DescriptorMemoryElement
from smqtk.representation.descriptor_index.memory import MemoryDescriptorIndex

ad_histogram_descriptor_index = \
    MemoryDescriptorIndex(file_cache='ad_classifier.ad_histogram_descriptor_index.pickle')
d_mem_factory = DescriptorElementFactory(DescriptorMemoryElement, {})
    
to_add = set()

for ad in train2_ad2histogram:
    d = d_mem_factory('ad-histogram', ad)
    d.set_vector(train2_ad2histogram[ad])
    if not ad_histogram_descriptor_index.has_descriptor(d.uuid()):
        to_add.add(d)
print "Adding %d missing ad histograms" % len(to_add)
ad_histogram_descriptor_index.add_many_descriptors(to_add)
del to_add

Ad histogram classifier

With histograms computed, we need to wrap the histograms in DescriptorElement instances and organize them for training the histogram classifier.



In [ ]:

    
train2_pos_ad_descriptors = set()
train2_neg_ad_descriptors = set()
for ad in train2_pos_ads:
    train2_pos_ad_descriptors.add( ad_histogram_descriptor_index[ad] )
for ad in train2_neg_ads:
    train2_neg_ad_descriptors.add( ad_histogram_descriptor_index[ad] )
        
assert len(train2_pos_ad_descriptors) + len(train2_neg_ad_descriptors) == len(train2_ad2histogram), \
    "Descriptors we sent to the classifier for training should add up with the number of histograms " \
    "we computed."
assert len(train2_pos_ad_descriptors) + len(train2_neg_ad_descriptors) == ad_histogram_descriptor_index.count(), \
    "Descriptors we sent to the classifier for training should add up with the number of histograms " \
    "we computed."



In [ ]:

    
len(train2_pos_ad_descriptors), len(train2_neg_ad_descriptors)



In [ ]:

    
# Not setting a normalization value because we're already giving it L1 normalized histograms
ad_classifier = LibSvmClassifier('ad_classifier.train2_classifier.model',
                                 'ad_classifier.train2_classifier.labels')

# Randomly subsampling positive/negative examples
# TODO: Increase for final model
n_pos = 24365
n_neg = 8300

pos_train_d = sorted(train2_pos_ad_descriptors, key=lambda d: d.uuid())
neg_train_d = sorted(train2_neg_ad_descriptors, key=lambda d: d.uuid())
numpy.random.seed(0)
numpy.random.shuffle(pos_train_d)
numpy.random.shuffle(neg_train_d)
pos_train_d = pos_train_d[:n_pos]
neg_train_d = neg_train_d[:n_neg]

if not ad_classifier.has_model():
    ad_classifier.train(positive=pos_train_d, negative=neg_train_d)
else:
    print "Ad classifier already trained"

Validating Ad Histogram

Apply image classifier to train3 image SHA1s (image_descr_index)
Create sub-map of SHA1->ads (function filter_sha2ads)
Generate Ad histograms (function generate_ad_histograms) and save to a DescriptorIndex
Save out ad truth CSV for validation script
Run validation script



In [ ]:

    
# Apply image classifier to train3 image descriptors
train3_pos_shas = cPickle.load(open('train3_pos_shas.pickle'))
train3_neg_shas = cPickle.load(open('train3_neg_shas.pickle'))


TRAIN3_IMG_CLASSIFICATIONS_CACHE = 'image_classifier.train3.classifications_cache.pickle'
if os.path.isfile(TRAIN3_IMG_CLASSIFICATIONS_CACHE):
    print "Loading existing train3 image classification results"
    with open(TRAIN3_IMG_CLASSIFICATIONS_CACHE) as f:
        train3_img_descr2classifications = cPickle.load(f)
else:
    train3_img_descriptors = image_descr_index.get_many_descriptors(train3_pos_shas | train3_neg_shas)
    
    train3_img_descr2classifications = image_classifier.classify_async(train3_img_descriptors,
                                                                       c_file_factory,
                                                                       use_multiprocessing=True,
                                                                       ri=1.)
    with open(TRAIN3_IMG_CLASSIFICATIONS_CACHE, 'w') as f:
        cPickle.dump(train3_img_descr2classifications, f, -1)
        
# We should have a classification for all input descriptors
assert len(train3_img_descr2classifications) == len(train3_pos_shas | train3_neg_shas), \
    "There should be a classification for all input descriptors but wasn't"



In [ ]:

    
# Create index of ad historgram descriptors
train3_pos_ads = cPickle.load(open('train3_pos_ads.pickle'))
train3_neg_ads = cPickle.load(open('train3_neg_ads.pickle'))
train3_sha2ads = filter_sha2ads(train3_pos_shas | train3_neg_shas,
                                train3_pos_ads | train3_neg_ads)
assert len(train3_sha2ads) == len(train3_img_descr2classifications)

train3_ad2histogram = generate_ad_histograms(train3_img_descr2classifications.itervalues(),
                                              train3_sha2ads)
with open('ad_classifier.train3_ad2histogram.pickle', 'w') as f:
    cPickle.dump(train3_ad2histogram, f, -1)

to_add = set()
for ad in train3_ad2histogram:
    if not ad_histogram_descriptor_index.has_descriptor(ad):
        d = DescriptorMemoryElement('ad-histogram', ad)
        d.set_vector(train3_ad2histogram[ad])
        to_add.add(d)
ad_histogram_descriptor_index.add_many_descriptors(to_add)
del to_add



In [ ]:

    
TRAIN3_VALIDATION_AD_TRUTH_CSV = 'ad_classifier.train3_validation.csv'
with open(TRAIN3_VALIDATION_AD_TRUTH_CSV, 'w') as f:
    writer = csv.writer(f)
    for ad in train3_pos_ads:
        writer.writerow([ad, 'positive'])
    for ad in train3_neg_ads:
        writer.writerow([ad, 'negative'])

Now the classifier validation script can be run to validate the train2-data trained ad classifier on train3 data:

classifier_model_validation.py -vc ad_classifier.train3_validation.json 2>&1 | tee ad_classifier.train3_validation.log

Cluster histograms

generate image classifications for train3 images (performed earlier during ad histograms validation)
generate ad histograms for train3 ads and classify (also done in previous step)
generate cluster histograms and train classifier
validate stack on test images/ads/clusters



In [ ]:

    
def generate_cluster_histograms(ad_classifications, ad2cluster, hist_size=20, target_class='positive'):
    """
    Generate cluster histograms based on input iterable of ClassificationElements.
    
    :param ad_classifications: Iterable of ClassificationElement instances for all ads
        classified.
    :param ad2cluster: Mapping of ad descriptor UUID values (doc id) to the cluster it
        is a child of
    :param hist_size: The number of percentile partitions to make for the ad histograms.
        This is literally the size of the histograms output.
    :param target_class: The class probability to extract from input ClassificationElement
        instances.
    :return: Dict mapping of cluster ID to the its L1 normalized histogram of child ad
        classification scores.
    """
    cluster2ad_probs = collections.defaultdict(dict)
    for c in ad_classifications:
        ad = c.uuid
        cluster = ad2cluster[ad]
        cluster2ad_probs[cluster][ad] = c[target_class]
            
    cluster2histogram = {}
    for cluster, ad_prob_map in cluster2ad_probs.iteritems():
        h = make_histogram(ad_prob_map.values(), hist_size).astype(float)
        if not h.sum():
            raise RuntimeError("Zero-sum histogram for ad: %s (probs: %s)" 
                               % (ad, ad_prob_map.values()))
        h /= h.sum()
        cluster2histogram[cluster] = h
    
    return cluster2histogram



In [ ]:

    
# Reverse mapping of ads to their parent cluster. 1-to-1 relationship.
pos_cluster2ads = cPickle.load(open('positive.cluster2ads.pickle'))
neg_cluster2ads = cPickle.load(open('negative.cluster2ads.pickle'))
ad2cluster = {}
for c in pos_cluster2ads:
    for ad in pos_cluster2ads[c]:
        ad2cluster[ad] = c
for c in neg_cluster2ads:
    for ad in neg_cluster2ads[c]:
        ad2cluster[ad] = c



In [ ]:

    
# Generate classifications for train3 ad histograms
TRAIN3_AD_CLASSIFICATIONS_CACHE = 'ad_classifier.train3.classifications_cache.pickle'

ad_c_file_factory = ClassificationElementFactory(FileClassificationElement, {
        "save_dir": "ad_classifier.classifications",
        "subdir_split": 10,
    })

if os.path.isfile(TRAIN3_AD_CLASSIFICATIONS_CACHE):
    print "Loading existing train3 ad classification results"
    train3_ad_descr2classifications = cPickle.load(open(TRAIN3_AD_CLASSIFICATIONS_CACHE))
else:
    ads_to_compute = train3_pos_ads | train3_neg_ads
    train3_ad_descr2classifications = \
        ad_classifier.classify_async(ad_histogram_descriptor_index.get_many_descriptors(ads_to_compute),
                                     ad_c_file_factory, 
                                     use_multiprocessing=True,
                                     ri=1.)
    
    with open(TRAIN3_AD_CLASSIFICATIONS_CACHE, 'w') as f:
        cPickle.dump(train3_ad_descr2classifications, f, -1)



In [ ]:

    
train3_cluster2histogram = generate_cluster_histograms(train3_ad_descr2classifications.itervalues(),
                                                       ad2cluster)



In [ ]:

    
cluster_histogram_descriptor_index = \
    MemoryDescriptorIndex('cluster_classifier.cluster_histogram_descriptor_index.pickle')

to_add = set()
for c in train3_cluster2histogram:
    if not cluster_histogram_descriptor_index.has_descriptor(c):
        d = d_mem_factory('cluster-histogram', c)
        d.set_vector(train3_cluster2histogram[c])
        to_add.add(d)
print "Add %d missing cluster histograms" % len(to_add)
cluster_histogram_descriptor_index.add_many_descriptors(to_add)
del to_add



In [ ]:

    
# Train cluster histogram classifier
train3_pos_clusters = cPickle.load(open('train3_pos_clusters.pickle'))
train3_neg_clusters = cPickle.load(open('train3_neg_clusters.pickle'))
train3_pos_c_descriptors = set()
train3_neg_c_descriptors = set()
for c in train3_pos_clusters:
    train3_pos_c_descriptors.add( cluster_histogram_descriptor_index[c] )
for c in train3_neg_clusters:
    train3_neg_c_descriptors.add( cluster_histogram_descriptor_index[c] )



In [ ]:

    
len(train3_pos_c_descriptors), len(train3_neg_c_descriptors)



In [ ]:

    
cluster_classifier = LibSvmClassifier('cluster_classifier.train3.classifier.model',
                                      'cluster_classifier.train3.classifier.label')

if not cluster_classifier.has_model():
    cluster_classifier.train(positive=train3_pos_c_descriptors, negative=train3_neg_c_descriptors)
else:
    print "Cluster classifier already trained"

Validating cluster classifier



In [ ]:

    
# Apply image classifier to test image shas
test_pos_shas = cPickle.load(open('test_pos_shas.pickle'))
test_neg_shas = cPickle.load(open('test_neg_shas.pickle'))

TEST_IMG_CLASSIFICATIONS_CACHE = 'image_classifier.test.classifications_cache.pickle'
if os.path.isfile(TEST_IMG_CLASSIFICATIONS_CACHE):
    print "Loading existing test image classification results"
    with open(TEST_IMG_CLASSIFICATIONS_CACHE) as f:
        test_img_descr2classifications = cPickle.load(f)
else:
    test_img_descriptors = image_descr_index.get_many_descriptors(test_pos_shas | test_neg_shas)
    
    test_img_descr2classifications = \
        image_classifier.classify_async( test_img_descriptors,
                                         c_file_factory,
                                         use_multiprocessing=True,
                                         ri=1. )
    with open(TEST_IMG_CLASSIFICATIONS_CACHE, 'w') as f:
        cPickle.dump(test_img_descr2classifications, f, -1)
        
assert len(test_img_descr2classifications) == len(test_pos_shas | test_neg_shas)



In [ ]:

    
# Create ad histograms
test_pos_ads = cPickle.load(open('test_pos_ads.pickle'))
test_neg_ads = cPickle.load(open('test_neg_ads.pickle'))
test_sha2ads = filter_sha2ads(test_pos_shas | test_neg_shas,
                              test_pos_ads | test_neg_ads)
assert len(test_sha2ads) == len(test_img_descr2classifications)

test_ad2histogram = generate_ad_histograms(test_img_descr2classifications.itervalues(),
                                           test_sha2ads)

with open('ad_classifier.test_ad2histogram.pickle', 'w') as f:
    cPickle.dump(test_ad2histogram, f, -1)

# Add histograms as descriptors to index
to_add = set()
for ad in test_ad2histogram:
    if not ad_histogram_descriptor_index.has_descriptor(ad):
        d = DescriptorMemoryElement('ad-histogram', ad)
        d.set_vector(test_ad2histogram[ad])
        to_add.add(d)
print "Adding %d test ad histograms to index" % len(to_add)
ad_histogram_descriptor_index.add_many_descriptors(to_add)
del to_add



In [ ]:

    
# Clasifying ad histograms
TEST_AD_CLASSIFICATIONS_CACHE = 'ad_classifier.test.classifications_cache.pickle'
if os.path.isfile(TEST_AD_CLASSIFICATIONS_CACHE):
    with open(TEST_AD_CLASSIFICATIONS_CACHE) as f:
        test_ad_descr2classifications = cPickle.load(f)
else:
    test_ad_descriptors = ad_histogram_descriptor_index.get_many_descriptors(test_pos_ads | test_neg_ads)
    test_ad_descr2classifications = \
        ad_classifier.classify_async( test_ad_descriptors,
                                      ad_c_file_factory,
                                      use_multiprocessing=True,
                                      ri=1.0 )
    with open(TEST_AD_CLASSIFICATIONS_CACHE, 'w') as f:
        cPickle.dump(test_ad_descr2classifications, f, -1)



In [ ]:

    
cluster_histogram_descriptor_index = \
    MemoryDescriptorIndex('cluster_classifier.cluster_histogram_descriptor_index.pickle')



In [ ]:

    
# Create cluster histograms
test_cluster2histogram = generate_cluster_histograms(test_ad_descr2classifications.itervalues(),
                                                     ad2cluster)

test_pos_clusters = cPickle.load(open('test_pos_clusters.pickle'))
test_neg_clusters = cPickle.load(open('test_neg_clusters.pickle'))
assert len(test_cluster2histogram) == len(test_pos_clusters | test_neg_clusters)

to_add = set()
for c in test_cluster2histogram:
    if not cluster_histogram_descriptor_index.has_descriptor(c):
        d = DescriptorMemoryElement('cluster-histogram', c)
        d.set_vector(test_cluster2histogram[c])
        to_add.add(d)
print "Adding %d test cluster histograms to index" % len(to_add)
cluster_histogram_descriptor_index.add_many_descriptors(to_add)
del to_add



In [ ]:

    
# Checking that the descriptor index has the histograms for the test set clusters
for c in test_pos_clusters:
    assert cluster_histogram_descriptor_index.has_descriptor(c)
for c in test_neg_clusters:
    assert cluster_histogram_descriptor_index.has_descriptor(c)



In [ ]:

    
# Createa CSV for cluster classifier validation against the test set
TEST_VALIDATION_CLUSTER_TRUTH_CSV = 'cluster_classifier.test_validation.csv'
with open(TEST_VALIDATION_CLUSTER_TRUTH_CSV, 'w') as f:
    writer = csv.writer(f)
    for c in test_pos_clusters:
        writer.writerow([c, 'positive'])
    for c in test_neg_clusters:
        writer.writerow([c, 'negative'])

Now the classifier validation script can be run to validate the train3-data trained cluster classifier on test data:

classifier_model_validation.py -vc cluster_classifier.test_validation.json 2>&1 | tee cluster_classifier.test_validation.log

Normally, the classifier_model_validation.py script interprets UUID values as strings from the CSV file, but since we (wrongly) stored everything with the cluster ID as an integer, I changed the code to interpret the UUID field as an integer in the script temporarilly.