Loading ad clusters and image SHA1s

Loading cluster, ad and SHA1 ships from the base file.

Filter based on computed descriptors.

Some images were not valid (i.e. were gifs or HTML pages), and should not be considered. This may cause some ads to no longer have child images and thus should also be not considered. This again applies to clusters that have no resulting child ads.

Maps and files saved from this point on have been filtered by what has been actually computed.


In [ ]:
import collections
import csv

# SHA1 values of images actually computed for descriptors
pos_computed_shas = {r[1] for r in csv.reader(open('positive.cmd.processed.csv'))}
neg_computed_shas = {r[1] for r in csv.reader(open('negative.cmd.processed.csv'))}

pos_cluster2ads = collections.defaultdict(set)
pos_cluster2shas = collections.defaultdict(set)
pos_ad2shas = collections.defaultdict(set)
pos_sha2ads = collections.defaultdict(set)

neg_cluster2ads = collections.defaultdict(set)
neg_cluster2shas = collections.defaultdict(set)
neg_ad2shas = collections.defaultdict(set)
neg_sha2ads = collections.defaultdict(set)

print "Loading positive CSV file"
with open('./positive.CP1_clusters_ads_images.csv') as f:
    reader = csv.reader(f)
    for i, r in enumerate(reader):
        if i == 0:
            # skip header line
            continue
        c, ad, sha = r
        c = int(c)
        if sha in pos_computed_shas:
            pos_cluster2ads[c].add(ad)
            pos_cluster2shas[c].add(sha)
            pos_ad2shas[ad].add(sha)
            pos_sha2ads[sha].add(ad)
        
print "Loading Negative CSV file"
with open('./negative.CP1_clusters_ads_images.csv') as f:
    reader = csv.reader(f)
    for i, r in enumerate(reader):
        if i == 0:
            # skip header line
            continue
        c, ad, sha = r
        c = int(c)
        if sha in neg_computed_shas:
            neg_cluster2ads[c].add(ad)
            neg_cluster2shas[c].add(sha)
            neg_ad2shas[ad].add(sha)
            neg_sha2ads[sha].add(ad)
        
print "Done"

In [ ]:
# If negative cluster IDs intersect positive cluster IDs,
# re-assign negative cluster IDs by increasing by max(pos_cluster_ids)
pos_cluster_ids = set(pos_cluster2ads)
neg_cluster_ids = set(neg_cluster2ads)
if pos_cluster_ids & neg_cluster_ids:
    print "Reassigning cluster IDs"
    offset = max(pos_cluster_ids)
    new_neg_cluster2ads  = collections.defaultdict(set)
    new_neg_cluster2shas = collections.defaultdict(set)
    
    neg_cluster_id_old2new = {}
    
    for cid in sorted(neg_cluster_ids, reverse=True):
        print "- %d -> %d" % (cid, cid+offset)
        neg_cluster_id_old2new[cid] = cid+offset
        
        new_neg_cluster2ads[cid+offset] = neg_cluster2ads[cid]
        new_neg_cluster2shas[cid+offset] = neg_cluster2shas[cid]
    
    neg_cluster_ids = set(new_neg_cluster2ad)
    neg_cluster2ads = new_neg_cluster2ads
    neg_cluster2shas = new_neg_cluster2shas
    del new_neg_cluster2ads, new_neg_cluster2shas
    
    with open('negative.cluster_id_reassignment.old2new.pickle', 'w') as f:
        print "Saving reassignment mapping"
        cPickle.dump(neg_cluster_id_old2new, f, -1)
    print "Done"

In [ ]:
# SHA1's collected should now be <= to the SHA1s computed
print len( {s for c, shas in pos_cluster2shas.iteritems() for s in shas}.difference(pos_computed_shas) )
print len( {s for c, shas in neg_cluster2shas.iteritems() for s in shas}.difference(neg_computed_shas) )

# Number of intersectring SHA1 between positive and negative set
print len(set(pos_sha2ads) & set(neg_sha2ads))

In [ ]:
# Saving clusters
import json

def convert_dict(a):
    return dict( (k, list(v)) for k, v in a.iteritems() )

json_params = {"indent": 2, "separators": (',', ': '), "sort_keys": True}


# Saving positive info
print "Saving POS cluster->ads"
with open('positive.cluster2ads.pickle', 'w') as f:
    cPickle.dump( pos_cluster2ads, f, -1 )

print "Saving POS cluster->image shas"
with open('positive.cluster2shas.pickle', 'w') as f:
    cPickle.dump( pos_cluster2shas, f, -1 )

print "Saving POS ad->image shas"
with open('positive.ad2shas.pickle', 'w') as f:
    cPickle.dump( pos_ad2shas, f, -1 )
    
print "Saving POS SHA1->ads"
with open('positive.sha2ads.pickle', 'w') as f:
    cPickle.dump( pos_sha2ads, f, -1 )


# Saving negative info
print "Saving NEG cluster->ads"
with open('negative.cluster2ads.pickle', 'w') as f:
    cPickle.dump( neg_cluster2ads, f, -1 )

print "Saving NEG cluster->image shas"
with open('negative.cluster2shas.pickle', 'w') as f:
    cPickle.dump( neg_cluster2shas, f, -1 )

print "Saving NEG ad->image shas"
with open('negative.ad2shas.pickle', 'w') as f:
    cPickle.dump( neg_ad2shas, f, -1 )

print "Saving NEG SHA1->ads"
with open('negative.sha2ads.pickle', 'w') as f:
    cPickle.dump( neg_sha2ads, f, -1 )


print "Done"

Creating Train and Test sets

Separating data based on Clusters

Based on ordering clusters by the total number of child images. This lets us base train/test sets of approximately the same relative sizes.

Train 1 has so many images because one cluster has ~30k child images (cluster 417).

We create 3 "train" sets:

  • train1: one for training the image classifier
  • train2: one for applying the image classifier and training the ad classifier
  • train3: one for applying the image+ad classifier and training the cluster classifier

In [ ]:
pos_clusters_ordered = sorted( pos_cluster2shas, 
                               key=lambda c: ( len(pos_cluster2shas[c]), c ),
                               reverse=1 )
neg_clusters_ordered = sorted( neg_cluster2shas,
                               key=lambda c: ( len(neg_cluster2shas[c]), c ),
                               reverse=1)

# Image classifier training clusters/ads/shas
train1_pos_clusters = { c   for i, c in enumerate(pos_clusters_ordered) if i % 4 == 0 }
train1_neg_clusters = { c   for i, c in enumerate(neg_clusters_ordered) if i % 4 == 0 }
train1_pos_ads      = { ad  for c in train1_pos_clusters for ad  in pos_cluster2ads[c] }
train1_neg_ads      = { ad  for c in train1_neg_clusters for ad  in neg_cluster2ads[c] }
train1_pos_shas     = { sha for c in train1_pos_clusters for sha in pos_cluster2shas[c] }
train1_neg_shas     = { sha for c in train1_neg_clusters for sha in neg_cluster2shas[c] }

# Ad histogram classifier training clusters/ads/shas
train2_pos_clusters = { c for i, c in enumerate(pos_clusters_ordered) if i % 4 == 1 }
train2_neg_clusters = { c for i, c in enumerate(neg_clusters_ordered) if i % 4 == 1 }
train2_pos_ads      = { ad  for c in train2_pos_clusters for ad  in pos_cluster2ads[c] }
train2_neg_ads      = { ad  for c in train2_neg_clusters for ad  in neg_cluster2ads[c] }
train2_pos_shas     = { sha for c in train2_pos_clusters for sha in pos_cluster2shas[c] }
train2_neg_shas     = { sha for c in train2_neg_clusters for sha in neg_cluster2shas[c] }

# Cluster histogram classifier training clusters/ads/shas
train3_pos_clusters = { c for i, c in enumerate(pos_clusters_ordered) if i % 4 == 2 }
train3_neg_clusters = { c for i, c in enumerate(neg_clusters_ordered) if i % 4 == 2 }
train3_pos_ads      = { ad  for c in train3_pos_clusters for ad  in pos_cluster2ads[c] }
train3_neg_ads      = { ad  for c in train3_neg_clusters for ad  in neg_cluster2ads[c] }
train3_pos_shas     = { sha for c in train3_pos_clusters for sha in pos_cluster2shas[c] }
train3_neg_shas     = { sha for c in train3_neg_clusters for sha in neg_cluster2shas[c] }

# Test/Validation clusters/ads/shas
test_pos_clusters   = { c for i, c in enumerate(pos_clusters_ordered) if i % 4 == 3 }
test_neg_clusters   = { c for i, c in enumerate(neg_clusters_ordered) if i % 4 == 3 }
test_pos_ads        = { ad  for c in test_pos_clusters for ad  in pos_cluster2ads[c] }
test_neg_ads        = { ad  for c in test_neg_clusters for ad  in neg_cluster2ads[c] }
test_pos_shas       = { sha for c in test_pos_clusters for sha in pos_cluster2shas[c] }
test_neg_shas       = { sha for c in test_neg_clusters for sha in neg_cluster2shas[c] }

In [ ]:
print "Train 1 (image)"
print "  (pos)| clusters:", len(train1_pos_clusters)
print "       | ads:",      len(train1_pos_ads)
print "       | images:",   len(train1_pos_shas)
print
print "  (neg)| clusters:", len(train1_neg_clusters)
print "       | ads:",      len(train1_neg_ads)
print "       | images:",   len(train1_neg_shas)
print
print "Train 2 (ads)"
print "  (pos)| clusters:", len(train2_pos_clusters)
print "       | ads:",      len(train2_pos_ads)
print "       | images:",   len(train2_pos_shas)
print
print "  (neg)| clusters:", len(train2_neg_clusters)
print "       | ads:",      len(train2_neg_ads)
print "       | images:",   len(train2_neg_shas)
print
print "Train 3 (cluster)"
print "  (pos)| clusters:", len(train3_pos_clusters)
print "       | ads:",      len(train3_pos_ads)
print "       | images:",   len(train3_pos_shas)
print
print "  (neg)| clusters:", len(train3_neg_clusters)
print "       | ads:",      len(train3_neg_ads)
print "       | images:",   len(train3_neg_shas)
print
print "Test"
print "  (pos)| clusters:", len(test_pos_clusters)
print "       | ads:",      len(test_pos_ads)
print "       | images:",   len(test_pos_shas)
print
print "  (neg)| clusters:", len(test_neg_clusters)
print "       | ads:",      len(test_neg_ads)
print "       | images:",   len(test_neg_shas)

In [ ]:
# Train1 - for image classifier
cPickle.dump(train1_pos_clusters, open('train1_pos_clusters.pickle', 'w'), -1)
cPickle.dump(train1_pos_ads,      open('train1_pos_ads.pickle', 'w'), -1)
cPickle.dump(train1_pos_shas,     open('train1_pos_shas.pickle', 'w'), -1)

cPickle.dump(train1_neg_clusters, open('train1_neg_clusters.pickle', 'w'), -1)
cPickle.dump(train1_neg_ads,      open('train1_neg_ads.pickle', 'w'), -1)
cPickle.dump(train1_neg_shas,     open('train1_neg_shas.pickle', 'w'), -1)

# Train2 - for ad histogram classifier
cPickle.dump(train2_pos_clusters, open('train2_pos_clusters.pickle', 'w'), -1)
cPickle.dump(train2_pos_ads,      open('train2_pos_ads.pickle', 'w'), -1)
cPickle.dump(train2_pos_shas,     open('train2_pos_shas.pickle', 'w'), -1)

cPickle.dump(train2_neg_clusters, open('train2_neg_clusters.pickle', 'w'), -1)
cPickle.dump(train2_neg_ads,      open('train2_neg_ads.pickle', 'w'), -1)
cPickle.dump(train2_neg_shas,     open('train2_neg_shas.pickle', 'w'), -1)

# Train3 - for cluster histogram classifier
cPickle.dump(train3_pos_clusters, open('train3_pos_clusters.pickle', 'w'), -1)
cPickle.dump(train3_pos_ads,      open('train3_pos_ads.pickle', 'w'), -1)
cPickle.dump(train3_pos_shas,     open('train3_pos_shas.pickle', 'w'), -1)

cPickle.dump(train3_neg_clusters, open('train3_neg_clusters.pickle', 'w'), -1)
cPickle.dump(train3_neg_ads,      open('train3_neg_ads.pickle', 'w'), -1)
cPickle.dump(train3_neg_shas,     open('train3_neg_shas.pickle', 'w'), -1)

# Test - for image/ad/cluster classifier validation
cPickle.dump(test_pos_clusters, open('test_pos_clusters.pickle', 'w'), -1)
cPickle.dump(test_pos_ads,      open('test_pos_ads.pickle', 'w'), -1)
cPickle.dump(test_pos_shas,     open('test_pos_shas.pickle', 'w'), -1)

cPickle.dump(test_neg_clusters, open('test_neg_clusters.pickle', 'w'), -1)
cPickle.dump(test_neg_ads,      open('test_neg_ads.pickle', 'w'), -1)
cPickle.dump(test_neg_shas,     open('test_neg_shas.pickle', 'w'), -1)

In [ ]:
# Creating grount-truth json-lines file for test-set for use with MEMEX-provided evaluation script
# format: {"cluster_id": "<number>", "class": <int>}
# Class value should be:
# - 1 for positive
# - 0 for negative
with open('test_eval_gt.jl', 'w') as f:
    for c in sorted(pos_cluster_ids):
        f.write( json.dumps({"cluster_id": str(c), "class": 1}) + "\n" )
    for c in sorted(neg_cluster_ids):
        f.write( json.dumps({"cluster_id": str(c), "class": 0}) + "\n" )

SHA1 Intersection Investigation

Some images have been found to be shared across ads in different clusters. Since clusters are supposed to represent distictly seperate entities or relationships, this shows that either the clusters are not linkable via multimedia only, they were incorrectly clustered, or actively split up on purpose. For the purpose of our approach (image-base classification), this means that the same images will potentially show up in both or all of the train/test/evaluation data sets.

Traditionally, the presence of the same/similar images in both train and test sets leads to faulty evaluation because the classifier has an easier time handling data it was trained on, and thus artificially higher scores. This is the same here in that train/test scores will probably be higher with their presence on both sides. However, if their shared presence is a strong positive indicator of a new HT ad, so their repeated positive recognition is a boon. On the other hand, again, we may want to see/measure how the classifier is performing due to abstract features, not including repeat imagery.


In [ ]:
train_1_2_intersection = train1_pos_shas & train2_pos_shas
train_2_3_intersection = train2_pos_shas & train3_pos_shas
train_test_intersection = (train1_pos_shas | train2_pos_shas | train3_pos_shas) & test_pos_shas

print len(train_1_2_intersection)
print len(train_2_3_intersection)
print len(train_test_intersection)

Cluster Image intersection observations

Found non-trivial images (i.e. not website logos or similar) in separate clusters:

train_1_shas & train_2_shas & test_shas

For example, the image with hash 372a91ac487a27554d0017c48f66facbaa9a8f19 is shared between positive clusters 348, 690, 673, and 148. Visual verification of the images in the ads that image is a part of shows what is definitely the same person. There are other images in ads that are near duplicates to images in other ads.