We assume here that the notebook 01.train_test_segmentation.ipynb
has been run and outputs generated.
The idea of this approach is to train multiple levels of classifiers based on lower level classifier results.
With regards to just images, this means that we will train an solely-image-based classifer that will attempt to classify single images as either HT or non-HT (HT = human trafficing releated). We then create a histogram per ad based on the real-values results of child image classification results (probabilities). The histogram generated will have a number of bins based on an arbitrary partitioning of the classification probability space ([0,1] range, we'll pick the arbitrary bin size later). After ads have been assigned histograms, we will train a second classifier using the histograms as the descriptors. Above ads, we have
The previous notebook in this series notes how we must split train/test/evaluation sets based on the highest organizational level. In our case here, that would be the cluster level. In addition, for each level, we will have to split the descriptor data in half for classifier training and application in order to eliminate train/test bias.
For our example where the final goal is a classifier for cluster-level histograms, we must split the training images (positive and negative) in half, based on cluster separation (highest order organization), where the first half is used to train the classifier, and the classifier is applied to the second half in order to create histograms for ads. Of the result ad histograms, we must split that in half again for in order to train the ad histogram classifier on half, and then apply the classifier on the other half of ad histograms in order to construct the cluster-level histogram classifier. If the final goal is just an ad histogram classifier (like before we received cluster-separated negative training data), then splitting the ads in half would not be necessary and training of the ad classifier could use all available ad histograms from the training data.
Later on we will also note where it is possible to fuse in results from different classifiers to create a potentially more robust higher level result, which is the advantage of this approach over others.
In [ ]:
# Lets import things and enable logging
import collections
import logging
from smqtk.utils.bin_utils import initialize_logging
initialize_logging(logging.getLogger('smqtk'), logging.DEBUG)
initialize_logging(logging.getLogger(__name__), logging.DEBUG)
In [ ]:
print "Loading positive ads/shas"
train1_pos_shas = cPickle.load(open('train1_pos_shas.pickle'))
print "Loading negative ads/shas"
train1_neg_shas = cPickle.load(open('train1_neg_shas.pickle'))
print "Done"
# Creating training CSV file for use with ``classifier_model_validation.py`` for classifier training.
from smqtk.representation.descriptor_index.postgres import PostgresDescriptorIndex
image_descr_index = PostgresDescriptorIndex(table_name='descriptor_index_alexnet_fc7',
db_host='/dev/shm',
db_port=5434,
db_user='purg',
multiquery_batch_size=10000,
read_only=True)
print "Descriptors stored:", image_descr_index.count()
Since there is, relatively for SVMs, a lot of training data, we will need to sub-sample positive and negative pools into smaller sets so we can allow the SVM to converge.
In [ ]:
import random
from smqtk.algorithms.classifier.libsvm import LibSvmClassifier
from smqtk.representation.classification_element.memory import MemoryClassificationElement
from smqtk.representation.classification_element.file import FileClassificationElement
from smqtk.representation import ClassificationElementFactory
image_classifier = LibSvmClassifier('image_classifier.train1.classifier.model',
'image_classifier.train1.classifier.label',
normalize=2)
c_file_factory = ClassificationElementFactory(FileClassificationElement, {
"save_dir": "image_classifier.classifications",
"subdir_split": 10,
})
# Increase these numbers for final model training?
# n_pos = 5000
# n_neg = 10000
n_pos = 3000
n_neg = 6000
random.seed(0)
train1_pos_shas_rand = sorted(train1_pos_shas)
train1_neg_shas_rand = sorted(train1_neg_shas)
random.shuffle(train1_pos_shas_rand)
random.shuffle(train1_neg_shas_rand)
train1_pos_shas_rand = train1_pos_shas_rand[:n_pos]
train1_neg_shas_rand = train1_neg_shas_rand[:n_neg]
# Writing out as CSV file for posterity
with open('image_classifier.train1.csv', 'w') as f:
writer = csv.writer(f)
for sha in train1_pos_shas_rand:
writer.writerow([sha, 'positive'])
for sha in train1_neg_shas_rand:
writer.writerow([sha, 'negative'])
# Get the descriptors from the image index for the selected SHA1 values
train1_pos_descrs = set(image_descr_index.get_many_descriptors(train1_pos_shas_rand))
train1_neg_descrs = set(image_descr_index.get_many_descriptors(train1_neg_shas_rand))
In [ ]:
# Do training if we haven't yet
if not image_classifier.has_model():
image_classifier.train(positive=train1_pos_descrs, negative=train1_neg_descrs)
In [ ]:
# Creating CSV file of train2 sha/truth values for image classifier validation
IMG_MODEL_VALIDATION_TRUTH_CSV = "image_classifier.train2_validation.csv"
print "loading train2 pos shas"
train2_pos_shas = cPickle.load(open('train2_pos_shas.pickle'))
print "loading neg train2 shas"
train2_neg_shas = cPickle.load(open('train2_neg_shas.pickle'))
print "Writing CSV file"
with open(IMG_MODEL_VALIDATION_TRUTH_CSV, 'w') as f:
writer = csv.writer(f)
for sha in train2_pos_shas:
writer.writerow([sha, 'positive'])
for sha in train2_neg_shas:
writer.writerow([sha, 'negative'])
print "Done"
Now we can optionally run the following for model validation using the Train2 data:
classifier_model_validation.py -vc image_classifier.train2_validation.json 2>&1 | tee image_classifier.train2_validation.log
Steps:
Now that we have an image classifier, we need to apply that classifier to the second half of training images (this might take a while).
In [ ]:
TRAIN2_IMG_CLASSIFICATIONS_CACHE = 'image_classifier.train2.classifications_cache.pickle'
train2_pos_ads = cPickle.load(open('train2_pos_ads.pickle'))
train2_neg_ads = cPickle.load(open('train2_neg_ads.pickle'))
if os.path.isfile(TRAIN2_IMG_CLASSIFICATIONS_CACHE):
print "Loading existing train2 image classification results"
train2_img_descr2classifications = cPickle.load(open(TRAIN2_IMG_CLASSIFICATIONS_CACHE))
else:
# Apply classifier to image descriptors, collecting positive confidence
shas_to_compute = train2_pos_shas | train2_neg_shas
print "Classifying positive/negative train2 image descriptors (count: %d)" % len(shas_to_compute)
train2_img_descr2classifications = \
image_classifier.classify_async(image_descr_index.get_many_descriptors(shas_to_compute),
c_file_factory,
use_multiprocessing=True,
ri=1.0)
# Saving out results
with open(TRAIN2_IMG_CLASSIFICATIONS_CACHE, 'w') as f:
cPickle.dump(train2_img_descr2classifications, f, -1)
In [ ]:
pos_ad2shas = cPickle.load(open('positive.ad2shas.pickle'))
pos_sha2ads = cPickle.load(open('positive.sha2ads.pickle'))
neg_ad2shas = cPickle.load(open('negative.ad2shas.pickle'))
neg_sha2ads = cPickle.load(open('negative.sha2ads.pickle'))
In [ ]:
import numpy
def filter_sha2ads(sha_set, ad_set):
"""
Create a SHA1 to ads mapping from the master sets, filtered on the given
SHA1 and ad subsets.
:param sha_set: Sub-set of SHA1 values to consider
:param ad:set: Sub-set of ad IDs to consider.
:return: New mapping of SHA1 values to parent ads using only subsets given.
"""
new_sha2ads = collections.defaultdict(set)
for sha in sha_set:
# Get all ads associated with the SHA1, then filter with given ad set
new_sha2ads[sha] = (pos_sha2ads[sha] | neg_sha2ads[sha]) & ad_set
return new_sha2ads
def make_histogram(probs, n_bins):
""" Make histogram of the given positive probability values """
return numpy.histogram(probs, bins=n_bins, range=[0,1])[0]
def generate_ad_histograms(img_classifications, sha2ads, hist_size=20, target_class='positive'):
"""
Generate ad histograms based on input iterable of ClassificationElements.
This first aggregates image classification `target_class` scores under parent ads,
and then creates a histogram of `hist_size` bins for each ad based on the image
classification for images associated with that ad. Histograms are normalized to
relative frequencies (L1 nomalization).
:param img_classifications: Iterable of ClassificationElement instances for all images
classified.
:param sha2ads: Mapping of image descriptor UUID values (image SHA1) to the one or more
ads that image is a child of.
:param hist_size: The number of percentile partitions to make for the ad histograms.
This is literally the size of the histograms output.
:param target_class: The class probability to extract from input ClassificationElement
instances.
:return: Dict mapping of ad ID to the its L1 normalized histogram of child image
classification scores.
"""
ad2img_probs = collections.defaultdict(dict)
for c in img_classifications:
sha = c.uuid
for ad in sha2ads[sha]:
ad2img_probs[ad][sha] = c[target_class]
ad2histogram = {}
for ad, img_prob_map in ad2img_probs.iteritems():
h = make_histogram(img_prob_map.values(), hist_size).astype(float)
if not h.sum():
raise RuntimeError("Zero-sum histogram for ad: %s (probs: %s)"
% (ad, img_prob_map.values()))
h /= h.sum()
ad2histogram[ad] = h
return ad2histogram
In [ ]:
# Reverse mapping within the scope of train2 only
train2_sha2ads = filter_sha2ads(train2_pos_shas | train2_neg_shas,
train2_pos_ads | train2_neg_ads)
assert len(train2_img_descr2classifications) == len(train2_sha2ads), \
"The same number of SHA1 values in reverse map should equal the classifications computed"
train2_ad2histogram = generate_ad_histograms(train2_img_descr2classifications.itervalues(),
train2_sha2ads)
with open('ad_classifier.train2_ad2histogram.pickle', 'w') as f:
cPickle.dump(train2_ad2histogram, f, -1)
In [ ]:
from smqtk.representation import DescriptorElementFactory
from smqtk.representation.descriptor_element.local_elements import DescriptorMemoryElement
from smqtk.representation.descriptor_index.memory import MemoryDescriptorIndex
ad_histogram_descriptor_index = \
MemoryDescriptorIndex(file_cache='ad_classifier.ad_histogram_descriptor_index.pickle')
d_mem_factory = DescriptorElementFactory(DescriptorMemoryElement, {})
to_add = set()
for ad in train2_ad2histogram:
d = d_mem_factory('ad-histogram', ad)
d.set_vector(train2_ad2histogram[ad])
if not ad_histogram_descriptor_index.has_descriptor(d.uuid()):
to_add.add(d)
print "Adding %d missing ad histograms" % len(to_add)
ad_histogram_descriptor_index.add_many_descriptors(to_add)
del to_add
In [ ]:
train2_pos_ad_descriptors = set()
train2_neg_ad_descriptors = set()
for ad in train2_pos_ads:
train2_pos_ad_descriptors.add( ad_histogram_descriptor_index[ad] )
for ad in train2_neg_ads:
train2_neg_ad_descriptors.add( ad_histogram_descriptor_index[ad] )
assert len(train2_pos_ad_descriptors) + len(train2_neg_ad_descriptors) == len(train2_ad2histogram), \
"Descriptors we sent to the classifier for training should add up with the number of histograms " \
"we computed."
assert len(train2_pos_ad_descriptors) + len(train2_neg_ad_descriptors) == ad_histogram_descriptor_index.count(), \
"Descriptors we sent to the classifier for training should add up with the number of histograms " \
"we computed."
In [ ]:
len(train2_pos_ad_descriptors), len(train2_neg_ad_descriptors)
In [ ]:
# Not setting a normalization value because we're already giving it L1 normalized histograms
ad_classifier = LibSvmClassifier('ad_classifier.train2_classifier.model',
'ad_classifier.train2_classifier.labels')
# Randomly subsampling positive/negative examples
# TODO: Increase for final model
n_pos = 24365
n_neg = 8300
pos_train_d = sorted(train2_pos_ad_descriptors, key=lambda d: d.uuid())
neg_train_d = sorted(train2_neg_ad_descriptors, key=lambda d: d.uuid())
numpy.random.seed(0)
numpy.random.shuffle(pos_train_d)
numpy.random.shuffle(neg_train_d)
pos_train_d = pos_train_d[:n_pos]
neg_train_d = neg_train_d[:n_neg]
if not ad_classifier.has_model():
ad_classifier.train(positive=pos_train_d, negative=neg_train_d)
else:
print "Ad classifier already trained"
In [ ]:
# Apply image classifier to train3 image descriptors
train3_pos_shas = cPickle.load(open('train3_pos_shas.pickle'))
train3_neg_shas = cPickle.load(open('train3_neg_shas.pickle'))
TRAIN3_IMG_CLASSIFICATIONS_CACHE = 'image_classifier.train3.classifications_cache.pickle'
if os.path.isfile(TRAIN3_IMG_CLASSIFICATIONS_CACHE):
print "Loading existing train3 image classification results"
with open(TRAIN3_IMG_CLASSIFICATIONS_CACHE) as f:
train3_img_descr2classifications = cPickle.load(f)
else:
train3_img_descriptors = image_descr_index.get_many_descriptors(train3_pos_shas | train3_neg_shas)
train3_img_descr2classifications = image_classifier.classify_async(train3_img_descriptors,
c_file_factory,
use_multiprocessing=True,
ri=1.)
with open(TRAIN3_IMG_CLASSIFICATIONS_CACHE, 'w') as f:
cPickle.dump(train3_img_descr2classifications, f, -1)
# We should have a classification for all input descriptors
assert len(train3_img_descr2classifications) == len(train3_pos_shas | train3_neg_shas), \
"There should be a classification for all input descriptors but wasn't"
In [ ]:
# Create index of ad historgram descriptors
train3_pos_ads = cPickle.load(open('train3_pos_ads.pickle'))
train3_neg_ads = cPickle.load(open('train3_neg_ads.pickle'))
train3_sha2ads = filter_sha2ads(train3_pos_shas | train3_neg_shas,
train3_pos_ads | train3_neg_ads)
assert len(train3_sha2ads) == len(train3_img_descr2classifications)
train3_ad2histogram = generate_ad_histograms(train3_img_descr2classifications.itervalues(),
train3_sha2ads)
with open('ad_classifier.train3_ad2histogram.pickle', 'w') as f:
cPickle.dump(train3_ad2histogram, f, -1)
to_add = set()
for ad in train3_ad2histogram:
if not ad_histogram_descriptor_index.has_descriptor(ad):
d = DescriptorMemoryElement('ad-histogram', ad)
d.set_vector(train3_ad2histogram[ad])
to_add.add(d)
ad_histogram_descriptor_index.add_many_descriptors(to_add)
del to_add
In [ ]:
TRAIN3_VALIDATION_AD_TRUTH_CSV = 'ad_classifier.train3_validation.csv'
with open(TRAIN3_VALIDATION_AD_TRUTH_CSV, 'w') as f:
writer = csv.writer(f)
for ad in train3_pos_ads:
writer.writerow([ad, 'positive'])
for ad in train3_neg_ads:
writer.writerow([ad, 'negative'])
Now the classifier validation script can be run to validate the train2-data trained ad classifier on train3 data:
classifier_model_validation.py -vc ad_classifier.train3_validation.json 2>&1 | tee ad_classifier.train3_validation.log
In [ ]:
def generate_cluster_histograms(ad_classifications, ad2cluster, hist_size=20, target_class='positive'):
"""
Generate cluster histograms based on input iterable of ClassificationElements.
:param ad_classifications: Iterable of ClassificationElement instances for all ads
classified.
:param ad2cluster: Mapping of ad descriptor UUID values (doc id) to the cluster it
is a child of
:param hist_size: The number of percentile partitions to make for the ad histograms.
This is literally the size of the histograms output.
:param target_class: The class probability to extract from input ClassificationElement
instances.
:return: Dict mapping of cluster ID to the its L1 normalized histogram of child ad
classification scores.
"""
cluster2ad_probs = collections.defaultdict(dict)
for c in ad_classifications:
ad = c.uuid
cluster = ad2cluster[ad]
cluster2ad_probs[cluster][ad] = c[target_class]
cluster2histogram = {}
for cluster, ad_prob_map in cluster2ad_probs.iteritems():
h = make_histogram(ad_prob_map.values(), hist_size).astype(float)
if not h.sum():
raise RuntimeError("Zero-sum histogram for ad: %s (probs: %s)"
% (ad, ad_prob_map.values()))
h /= h.sum()
cluster2histogram[cluster] = h
return cluster2histogram
In [ ]:
# Reverse mapping of ads to their parent cluster. 1-to-1 relationship.
pos_cluster2ads = cPickle.load(open('positive.cluster2ads.pickle'))
neg_cluster2ads = cPickle.load(open('negative.cluster2ads.pickle'))
ad2cluster = {}
for c in pos_cluster2ads:
for ad in pos_cluster2ads[c]:
ad2cluster[ad] = c
for c in neg_cluster2ads:
for ad in neg_cluster2ads[c]:
ad2cluster[ad] = c
In [ ]:
# Generate classifications for train3 ad histograms
TRAIN3_AD_CLASSIFICATIONS_CACHE = 'ad_classifier.train3.classifications_cache.pickle'
ad_c_file_factory = ClassificationElementFactory(FileClassificationElement, {
"save_dir": "ad_classifier.classifications",
"subdir_split": 10,
})
if os.path.isfile(TRAIN3_AD_CLASSIFICATIONS_CACHE):
print "Loading existing train3 ad classification results"
train3_ad_descr2classifications = cPickle.load(open(TRAIN3_AD_CLASSIFICATIONS_CACHE))
else:
ads_to_compute = train3_pos_ads | train3_neg_ads
train3_ad_descr2classifications = \
ad_classifier.classify_async(ad_histogram_descriptor_index.get_many_descriptors(ads_to_compute),
ad_c_file_factory,
use_multiprocessing=True,
ri=1.)
with open(TRAIN3_AD_CLASSIFICATIONS_CACHE, 'w') as f:
cPickle.dump(train3_ad_descr2classifications, f, -1)
In [ ]:
train3_cluster2histogram = generate_cluster_histograms(train3_ad_descr2classifications.itervalues(),
ad2cluster)
In [ ]:
cluster_histogram_descriptor_index = \
MemoryDescriptorIndex('cluster_classifier.cluster_histogram_descriptor_index.pickle')
to_add = set()
for c in train3_cluster2histogram:
if not cluster_histogram_descriptor_index.has_descriptor(c):
d = d_mem_factory('cluster-histogram', c)
d.set_vector(train3_cluster2histogram[c])
to_add.add(d)
print "Add %d missing cluster histograms" % len(to_add)
cluster_histogram_descriptor_index.add_many_descriptors(to_add)
del to_add
In [ ]:
# Train cluster histogram classifier
train3_pos_clusters = cPickle.load(open('train3_pos_clusters.pickle'))
train3_neg_clusters = cPickle.load(open('train3_neg_clusters.pickle'))
train3_pos_c_descriptors = set()
train3_neg_c_descriptors = set()
for c in train3_pos_clusters:
train3_pos_c_descriptors.add( cluster_histogram_descriptor_index[c] )
for c in train3_neg_clusters:
train3_neg_c_descriptors.add( cluster_histogram_descriptor_index[c] )
In [ ]:
len(train3_pos_c_descriptors), len(train3_neg_c_descriptors)
In [ ]:
cluster_classifier = LibSvmClassifier('cluster_classifier.train3.classifier.model',
'cluster_classifier.train3.classifier.label')
if not cluster_classifier.has_model():
cluster_classifier.train(positive=train3_pos_c_descriptors, negative=train3_neg_c_descriptors)
else:
print "Cluster classifier already trained"
In [ ]:
# Apply image classifier to test image shas
test_pos_shas = cPickle.load(open('test_pos_shas.pickle'))
test_neg_shas = cPickle.load(open('test_neg_shas.pickle'))
TEST_IMG_CLASSIFICATIONS_CACHE = 'image_classifier.test.classifications_cache.pickle'
if os.path.isfile(TEST_IMG_CLASSIFICATIONS_CACHE):
print "Loading existing test image classification results"
with open(TEST_IMG_CLASSIFICATIONS_CACHE) as f:
test_img_descr2classifications = cPickle.load(f)
else:
test_img_descriptors = image_descr_index.get_many_descriptors(test_pos_shas | test_neg_shas)
test_img_descr2classifications = \
image_classifier.classify_async( test_img_descriptors,
c_file_factory,
use_multiprocessing=True,
ri=1. )
with open(TEST_IMG_CLASSIFICATIONS_CACHE, 'w') as f:
cPickle.dump(test_img_descr2classifications, f, -1)
assert len(test_img_descr2classifications) == len(test_pos_shas | test_neg_shas)
In [ ]:
# Create ad histograms
test_pos_ads = cPickle.load(open('test_pos_ads.pickle'))
test_neg_ads = cPickle.load(open('test_neg_ads.pickle'))
test_sha2ads = filter_sha2ads(test_pos_shas | test_neg_shas,
test_pos_ads | test_neg_ads)
assert len(test_sha2ads) == len(test_img_descr2classifications)
test_ad2histogram = generate_ad_histograms(test_img_descr2classifications.itervalues(),
test_sha2ads)
with open('ad_classifier.test_ad2histogram.pickle', 'w') as f:
cPickle.dump(test_ad2histogram, f, -1)
# Add histograms as descriptors to index
to_add = set()
for ad in test_ad2histogram:
if not ad_histogram_descriptor_index.has_descriptor(ad):
d = DescriptorMemoryElement('ad-histogram', ad)
d.set_vector(test_ad2histogram[ad])
to_add.add(d)
print "Adding %d test ad histograms to index" % len(to_add)
ad_histogram_descriptor_index.add_many_descriptors(to_add)
del to_add
In [ ]:
# Clasifying ad histograms
TEST_AD_CLASSIFICATIONS_CACHE = 'ad_classifier.test.classifications_cache.pickle'
if os.path.isfile(TEST_AD_CLASSIFICATIONS_CACHE):
with open(TEST_AD_CLASSIFICATIONS_CACHE) as f:
test_ad_descr2classifications = cPickle.load(f)
else:
test_ad_descriptors = ad_histogram_descriptor_index.get_many_descriptors(test_pos_ads | test_neg_ads)
test_ad_descr2classifications = \
ad_classifier.classify_async( test_ad_descriptors,
ad_c_file_factory,
use_multiprocessing=True,
ri=1.0 )
with open(TEST_AD_CLASSIFICATIONS_CACHE, 'w') as f:
cPickle.dump(test_ad_descr2classifications, f, -1)
In [ ]:
cluster_histogram_descriptor_index = \
MemoryDescriptorIndex('cluster_classifier.cluster_histogram_descriptor_index.pickle')
In [ ]:
# Create cluster histograms
test_cluster2histogram = generate_cluster_histograms(test_ad_descr2classifications.itervalues(),
ad2cluster)
test_pos_clusters = cPickle.load(open('test_pos_clusters.pickle'))
test_neg_clusters = cPickle.load(open('test_neg_clusters.pickle'))
assert len(test_cluster2histogram) == len(test_pos_clusters | test_neg_clusters)
to_add = set()
for c in test_cluster2histogram:
if not cluster_histogram_descriptor_index.has_descriptor(c):
d = DescriptorMemoryElement('cluster-histogram', c)
d.set_vector(test_cluster2histogram[c])
to_add.add(d)
print "Adding %d test cluster histograms to index" % len(to_add)
cluster_histogram_descriptor_index.add_many_descriptors(to_add)
del to_add
In [ ]:
# Checking that the descriptor index has the histograms for the test set clusters
for c in test_pos_clusters:
assert cluster_histogram_descriptor_index.has_descriptor(c)
for c in test_neg_clusters:
assert cluster_histogram_descriptor_index.has_descriptor(c)
In [ ]:
# Createa CSV for cluster classifier validation against the test set
TEST_VALIDATION_CLUSTER_TRUTH_CSV = 'cluster_classifier.test_validation.csv'
with open(TEST_VALIDATION_CLUSTER_TRUTH_CSV, 'w') as f:
writer = csv.writer(f)
for c in test_pos_clusters:
writer.writerow([c, 'positive'])
for c in test_neg_clusters:
writer.writerow([c, 'negative'])
Now the classifier validation script can be run to validate the train3-data trained cluster classifier on test data:
classifier_model_validation.py -vc cluster_classifier.test_validation.json 2>&1 | tee cluster_classifier.test_validation.log
Normally, the classifier_model_validation.py
script interprets UUID values as strings from the CSV file, but since we (wrongly) stored everything with the cluster ID as an integer, I changed the code to interpret the UUID field as an integer in the script temporarilly.