Creating Train/Test split files for Caffe

Need to create two files, train.txt and test.txt, that list out the image files for each set and their 0/1 labels (0 == negative, 1 == positive). In our case, positive means "relevant for HT investigation".

In reality, we create train and test listings that contain equal numbers of positive and negative examples. This is to eliminate any class biasing that may occur when training. For example, if ~70% of examples were negative, the CNN may train to call everything negative, thus yielding a ~70% accuracy. This is clearly not desired.


In [ ]:
__depends__ = [
    'image_path_to_sha1.csv',
    'train_pos_shas.pickle',
    'test_pos_shas.pickle',
    'train_neg_shas.pickle',
    'test_neg_shas.pickle',
]
__dest__ = [
    'alexnet_adam/caffe.train.txt',
    'alexnet_adam/caffe.test.txt',
    'alexnet_adam/caffe.even.train.txt',
    'alexnet_adam/caffe.even.test.txt',
]

In [ ]:
# CSV of image path to its SHA1 checksum
# - should only list valid image files that Caffe can load
IMAGE_PATH_TO_SHA1 = "image_path_to_sha1.csv"
# Input sets of SHAs for train/test split
TRAIN_POS_SHA1S = 'train_pos_shas.pickle'
TRAIN_NEG_SHA1S = 'train_neg_shas.pickle'
TEST_POS_SHA1S = 'test_pos_shas.pickle'
TEST_NEG_SHA1S = 'test_neg_shas.pickle'
# Output files for training with Caffe
RAW_TRAIN_TXT = 'alexnet_adam/caffe.train.txt'
RAW_TEST_TXT = 'alexnet_adam/caffe.test.txt'
EVEN_TRAIN_TXT = 'alexnet_adam/caffe.even.train.txt'
EVEN_TEST_TXT = 'alexnet_adam/caffe.even.test.txt'

In [ ]:
import csv
import random

In [ ]:
# Mapping of SHA1 value to the path of the original image file
sha2path = dict((r[1],r[0]) for r in csv.reader(open(IMAGE_PATH_TO_SHA1)))

train_pos_shas = cPickle.load(open(TRAIN_POS_SHA1S))
train_neg_shas = cPickle.load(open(TRAIN_NEG_SHA1S))
test_pos_shas  = cPickle.load(open(TEST_POS_SHA1S))
test_neg_shas  = cPickle.load(open(TEST_NEG_SHA1S))

In [ ]:
# Write raw train/test files
#
# Remember:
#   0 == negative
#   1 == positive
with open(RAW_TRAIN_TXT, 'w') as f:
    for sha in train_pos_shas:
        fp = sha2path[sha]
        f.write(fp + ' 1\n')
    for sha in train_neg_shas:
        fp = sha2path[sha]
        f.write(fp + ' 0\n')

with open(RAW_TEST_TXT, 'w') as f:
    for sha in test_pos_shas:
        fp = sha2path[sha]
        f.write(fp + ' 1\n')
    for sha in test_neg_shas:
        fp = sha2path[sha]
        f.write(fp + ' 0\n')

In [ ]:
# Output test and train sets with equal balance, randomly sub-sampling where needed
even_train_size = min([len(train_pos_shas), len(train_neg_shas)])
even_test_size = min([len(test_pos_shas), len(test_neg_shas)])

print "Even train size:", even_train_size
print "Even test size :", even_test_size

random.seed(0)
even_train_pos = random.sample(train_pos_shas, even_train_size)
even_train_neg = random.sample(train_neg_shas, even_train_size)
even_test_pos = random.sample(test_pos_shas, even_test_size)
even_test_neg = random.sample(test_neg_shas, even_test_size)

with open(EVEN_TRAIN_TXT, 'w') as f:
    for sha in even_train_pos:
        fp = sha2path[sha]
        f.write(fp + ' 1\n')
    for sha in even_train_neg:
        fp = sha2path[sha]
        f.write(fp + ' 0\n')

with open(EVEN_TEST_TXT, 'w') as f:
    for sha in even_test_pos:
        fp = sha2path[sha]
        f.write(fp + ' 1\n')
    for sha in even_test_neg:
        fp = sha2path[sha]
        f.write(fp + ' 0\n')

Running Caffe training

To start model fine-tuning:

caffe/build/tools/caffe train -sigint_effect snapshot -solver solver.prototxt -weights <base_model>

If already started and resuming from a snapshot is desired:

caffe/build/tools/caffe train -sigint_effect snapshot -solver solver.prototxt -snapshot <snapshot_file>