Training Image Classifier

We will use part of the training data provided to us, separated by high level [entity] clusters, to train the image classifier. Due to the scale of the full dataset, a random subsample is taken. See this notebook block for image classifier training.

Compute evaluation image descriptors

When we get the evaluation image data, we must compute descriptors for that data. We will use the eval index (see the common.descriptor_index.eval.json config) and the common descriptor store (common.descriptor_factory.json). Using the common.cmd.eval.config.json with the compute_many_descriptors.py script should be used, which is set to these locations.

After descriptors are computed, we can proceed to scoring via the image classifier.

Using image classifier for scoring

Here we will use the trained image classifer to score clustered ad images, pooling the maximum and average HT positive scores each ad and then cluster resulting in two score sets that we will "submit" for evaluation.

Output must be in the form of an ordered json-lines file with each line having the structure:

{"cluster_id": "...", "score": <float>}

Thus, we need the evaluation truth file in order to get the cluster ID ordering, which is also json-lines and of the form:

{"cluster_id": "...", "class": <int>}
...

The evaluation script (for plotting the ROC curve) can be found here.

The steps that need to be performed:

Get images + cluster/ad/sha CSV

Compute descriptors for imagery provided
Load cluster/ad/sha maps after knowing what images were successfully described
Run classifier over descriptors computed
Determine ad/cluster scores via max/avg pooling
Output json-line files for scoring in evaluation script (linked above)



In [ ]:

    
# Initialize logging
import logging
from smqtk.utils.bin_utils import initialize_logging
initialize_logging(logging.getLogger('smqtk'), logging.DEBUG)
initialize_logging(logging.getLogger('__name__'), logging.DEBUG)



In [ ]:

    
# File path parameters

CMD_PROCESSED_CSV = 'eval.cmd.processed.csv'
CLUSTER_ADS_IMAGES_CSV = 'eval.clusters_ads_images.csv'

EVAL_IMAGE_CLASSIFICATIONS_CACHE = 'eval.image_classifications_cache.pickel'

OUTPUT_MAX_SCORE_JL = 'eval.cluster_scores.max_pool.jl'
OUTPUT_AVG_SCORE_JL = 'eval.cluster_scores.avg_pool.jl'



In [ ]:

    
from smqtk.algorithms.classifier.libsvm import LibSvmClassifier
from smqtk.representation.classification_element.memory import MemoryClassificationElement
from smqtk.representation.classification_element.file import FileClassificationElement
from smqtk.representation import ClassificationElementFactory

image_classifier = LibSvmClassifier('image_classifier.train1.classifier.model',
                                    'image_classifier.train1.classifier.label',
                                    normalize=2)
c_file_factory = ClassificationElementFactory(FileClassificationElement,
                                         {
                                           "save_dir": "image_classifier.classifications",
                                           "subdir_split": 10
                                         })
    
from smqtk.representation import get_descriptor_index_impls
from smqtk.utils.plugin import from_plugin_config
with open('eval.test.cmd.json') as f:
    descr_index = from_plugin_config(json.load(f)['descriptor_index'], get_descriptor_index_impls())



In [ ]:

    
descr_index.count()  # should equal lines of eval.cmd .processed.csv



In [ ]:

    
# TESTING
# Make up ground truth file from test-set clusters/ads/shas
test_pos_clusters = cPickle.load(open('test_pos_clusters.pickle'))
test_neg_clusters = cPickle.load(open('test_neg_clusters.pickle'))
pos_cluster2ads = cPickle.load(open('positive.cluster2ads.pickle'))
neg_cluster2ads = cPickle.load(open('negative.cluster2ads.pickle'))
pos_ad2shas = cPickle.load(open('positive.ad2shas.pickle'))
neg_ad2shas = cPickle.load(open('negative.ad2shas.pickle'))

with open('eval.test.clusters_ads_images.csv', 'w') as csv_out:
    writer = csv.writer(csv_out)
    writer.writerow(['cluster', 'ad', 'sha1'])
    for c in test_pos_clusters:
        for ad in pos_cluster2ads[c]:
            for sha in pos_ad2shas[ad]:
                writer.writerow([c, ad, sha])
    for c in test_neg_clusters:
        for ad in neg_cluster2ads[c]:
            for sha in neg_ad2shas[ad]:
                writer.writerow([c, ad, sha])

with open('eval.test.gt.jl', 'w') as f:
    for c in sorted(test_pos_clusters | test_neg_clusters, key=lambda k: str(k)):
        if c in test_pos_clusters:
            f.write( json.dumps({'cluster_id': str(c), 'class': 1}) + '\n' )
        elif c in test_neg_clusters:
            f.write( json.dumps({'cluster_id': str(c), 'class': 0}) + '\n' )
        else:
            raise ValueError("Cluster %d not positive or negative?" % c)



In [ ]:

    
# Step [3]

# Load in successfully processed image shas
# This is a result file from descriptor computation.
with open(CMD_PROCESSED_CSV) as f:
    computed_shas = {r[1] for r in csv.reader(f)}

# Load cluster/ad/sha relationship maps, filtered by what was actually processed
import collections
cluster2ads = collections.defaultdict(set)
cluster2shas = collections.defaultdict(set)
ad2shas = collections.defaultdict(set)
sha2ads = collections.defaultdict(set)
with open(CLUSTER_ADS_IMAGES_CSV) as f:
    reader = csv.reader(f)
    for i, r in enumerate(reader):
        if i == 0:
            # skip header line
            continue
        c, ad, sha = r
        if sha in computed_shas:
            cluster2ads[c].add(ad)
            cluster2shas[c].add(sha)
            ad2shas[ad].add(sha)
            sha2ads[sha].add(ad)



In [ ]:

    
# Step [4]
# Classify eval set images

if os.path.isfile(EVAL_IMAGE_CLASSIFICATIONS_CACHE):
    with open(EVAL_IMAGE_CLASSIFICATIONS_CACHE) as f:
        image_descr2classifications = cPickle.load(f)
else:
    img_descriptors = descr_index.get_many_descriptors(set(sha2ads))
    image_descr2classifications = image_classifier.classify_async(img_descriptors, 
                                                                  c_file_factory,
                                                                  use_multiprocessing=True,
                                                                  ri=1.0)
    with open(EVAL_IMAGE_CLASSIFICATIONS_CACHE, 'w') as f:
        cPickle.dump(image_descr2classifications, f, -1)



In [ ]:

    
# Step [5]
print "Collecting scores for SHA1s"
sha2score = {}
for c in image_descr2classifications.itervalues():
    sha2score[c.uuid] = c['positive']

# select ads score from max and average of child image scores
print "Collecting scores for ads (MAX and AVG)"
import numpy
ad2score_max = {}
ad2score_avg = {}
for ad, child_shas in ad2shas.iteritems():
    scores = [sha2score[sha] for sha in child_shas]
    ad2score_max[ad] = numpy.max(scores)
    ad2score_avg[ad] = numpy.average(scores)

# select cluster score from max and average of child ad scores
print "Collecting scores for ads (MAX and AVG)"
cluster2score_max = {}
cluster2score_avg = {}
for c, child_ads in cluster2ads.iteritems():
    cluster2score_max[c] = numpy.max(    [ad2score_max[ad] for ad in child_ads])
    cluster2score_avg[c] = numpy.average([ad2score_avg[ad] for ad in child_ads])



In [ ]:

    
len(cluster2score_max)



In [ ]:

    
# Step [6]
# Write out json-lines file in same order as GT file

# The ordering we will save out json-lines (arbitrary?)
cluster_id_order = sorted(cluster2score_avg.iterkeys())

with open(OUTPUT_MAX_SCORE_JL, 'w') as f:
    for c in cluster_id_order:
        if c in cluster2score_max:
            f.write( json.dumps({"cluster_id": c, "score": cluster2score_max[c]}) + '\n' )
        else:
            # Due to a cluster having no child ads with imagery
            f.write( json.dumps({"cluster_id": c, "score": 0.5}) + '\n' )
            
with open(OUTPUT_AVG_SCORE_JL, 'w') as f:
    for c in cluster_id_order:
        if c in cluster2score_avg:
            f.write( json.dumps({"cluster_id": c, "score": cluster2score_avg[c]}) + '\n' )
        else:
            # Due to a cluster having no child ads with imagery
            f.write( json.dumps({"cluster_id": c, "score": 0.5}) + '\n' )



In [ ]:

    
import numpy
numpy.average(sha2score.values()), numpy.min(sha2score.values()), numpy.max(sha2score.values())