The goal of this project is to discover aesthetically interesting digital video feedback processes by incorporating learned features into a hand constructed feedback process.
Consider a video feedback process defined by the mapping from images to images $x_t = \Delta_\phi(x_{t-1})$, where $\Delta$ is a transition function, $\phi$ is a parameterization which may be spatially varying or interactively controlled, and $x_t$ is the image at time step $t$.
Additionally suppose we have a deep autoencoder $\gamma$ for images: $$h^{\ell+1} = \gamma_\ell(h^\ell)$$ $$h^{\ell} \approx \gamma_\ell^{-1}(h^{\ell+1})$$ $$h^0 = x$$
Combining these two concepts, we can define a new feedback process where position in the feature hierarchy acts like another spatial dimension: $$h_t^\ell = \Delta_\phi( h_{t-1}^\ell, \gamma_{\ell-1}(h_{t-1}^{l-1}), \gamma_\ell^{-1}(h_{t-1}^{\ell+1}) )$$
The goal then is to learn a deep autoencoder which represents abstract image features and admits layer-wise encoding and decoding as above. I propose a convolutional pooling autoencoder based on the convolutional autoencoders of Masci et al. and the upsampling layers of Long et al..
Below I have trained a single layer pooled convolutional autoencoder on the CIFAR-10 dataset using caffe. The code is available at my GitHub. I use a filter size of 3x3x3 and 2x2 max pooling. For this experiment, the data dimensionality is preserved in the intermediate representation by using 12 filters (3 input colors x factor of 4 lost to pooling). I trained on the L2 reconstruction error with momentum but no other regularization. Test error was found to decrease consistently from about 100 at random initialization to about 1.3.
In [4]:
#get caffe and pycaffe set up
import numpy as np
import matplotlib.pyplot as plt
import scipy.ndimage
%matplotlib inline
#assuming feature-feedback repo and caffe root are in the same directory
caffe_root = '../../caffe/'
import sys
sys.path.insert(0, caffe_root+'python')
import caffe
from caffe.proto import caffe_pb2
#I have compiled caffe for CPU only (nvidia GPUs only)
caffe.set_mode_cpu()
In [ ]:
# L2 reconstruction error for images may not be a fantastic idea in RGB colorspace;
# we may want to preprocess the data by converting to CIELUV or something
In [ ]:
#run this cell to solve the model defined in the solver_file
solver_file = 'autoencoder-0-solver.prototxt'
solver = caffe.get_solver(solver_file);
solver.solve();
In [24]:
#load the model trained by the previous cell
#(and saved elsewhere in the repo) and set it up on test data
model_def_file = 'autoencoder-0.prototxt'
model_file = '../bin/cifar-tanh-20epoch-unregularized.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()
Out[24]:
In [9]:
#load the cifar mean into numpy array
blob = caffe_pb2.BlobProto()
data = open('../../caffe/examples/cifar10/mean.binaryproto').read()
blob.ParseFromString(data)
mean = caffe.io.blobproto_to_array(blob)[0].transpose([1,2,0])/256
In [7]:
def get_reconstructions(net, mean, n, compare=0):
inputs = np.hstack([ np.copy(net.blobs['data'].data[i]).transpose([1,2,0])+mean for i in range(n)])
outputs = np.hstack([ np.copy(net.blobs['decode1neuron'].data[i]).transpose([1,2,0])+mean for i in range(n)])
#clamp the reconstruction to [0,1]
#even with tanh activation outputs can be out of bounds once mean is added back
np.clip(outputs, 0, 1, outputs)
#compare to cubic resampling through the intermediate spatial resolution
#this is a good baseline for how well spatial information is stored and
#recovered by the convolutional layers
if compare>0:
comparisons = np.dsplit(np.copy(inputs), inputs.shape[2])
comparisons = [scipy.ndimage.zoom(np.squeeze(c), 1./compare, order=3) for c in comparisons]
comparisons = [scipy.ndimage.zoom(c, compare, order=3) for c in comparisons]
comparisons = np.dstack(comparisons)
np.clip(comparisons, 0, 1, comparisons)
return (inputs, outputs, comparisons)
return (inputs, outputs)
def vis_reconstructions(rec):
disp = np.vstack(rec)
plt.imshow(disp, interpolation='None')
In [60]:
rec = get_reconstructions(net, mean, 8, compare=2)
vis_reconstructions(rec)
CIFAR-10 test inputs on top, reconstructions in the middle, cubic interpolation comparison on the bottom. Looks good!
In [26]:
def get_filters(net, layer = 'encode1'):
filters = np.copy(net.params[layer][0].data).transpose([0,2,3,1])
biases = np.copy(net.params[layer][1].data)
print biases
return filters
def vis_filters(filters, rows):
#normalize preserving 0 = 50% gray
filters/=2*abs(filters).max()
filters+=.5
disp = np.hstack([np.pad(f,[(1,1),(1,1),(0,0)],'constant', constant_values=[.5]) for f in filters])
disp = np.vstack(np.hsplit(disp,rows))
return disp
In [154]:
filters = get_filters(net)
disp = vis_filters(filters, 3)
plt.imshow(disp, interpolation='none')
Out[154]:
Looks like the network mostly learned localized primary and secondary color detectors. Weird! These aren't the usual edge filters, but they seem to at least have some plausible structure.
In [5]:
def get_responses(net, layer, filts, n):
reps = np.hstack([ net.blobs[layer].data[i].transpose([1,2,0]) for i in range(n)])
# normalize preserving 0 = 50% gray
reps/=2*abs(reps).max()
reps+=.5
reps = np.vstack(np.dsplit(reps, filts))
return reps.squeeze()
def vis_responses(reps):
plt.figure(figsize=(10,10))
plt.imshow(reps, interpolation='none', cmap='coolwarm')
In [156]:
reps = get_responses(net, 'pool1', 12, 8)
vis_responses(reps)
Out[156]:
Pooled activations for each of 12 filters. Red is positive response, blue negative.
In [3]:
solver_file = 'autoencoder-1-solver.prototxt'
solver = caffe.get_solver(solver_file)
solver.solve()
In [66]:
model_def_file = 'autoencoder-1.prototxt'
model_file = '../bin/cifar-tanh-20epoch-squeezing.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()
Out[66]:
In [67]:
rec = get_reconstructions(net, mean, 8, compare=2)
vis_reconstructions(rec)
This time there's some clear loss of detail. The filters are doing something, though; looks better than cubic interpolation
In [24]:
filters = get_filters(net)
disp = vis_filters(filters, 2)
plt.imshow(disp, interpolation='none')
Out[24]:
These filters appear to be learning color gradients in a subtractive color space
In [ ]:
In [31]:
reps = get_responses(net, 'pool1', 6, 8)
vis_responses(reps)
In [ ]:
solver_file = 'autoencoder-2-solver.prototxt'
solver = caffe.get_solver(solver_file)
solver.solve('autoencoder-2_iter_20000.solverstate')
In [3]:
model_def_file = 'autoencoder-2.prototxt'
#model_file = '../bin/cifar-tanh-20epoch-squeezing-pool3.caffemodel'
model_file = 'autoencoder-2_iter_20000.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()
Out[3]:
In [10]:
rec = get_reconstructions(net, mean, 8, compare=4)
vis_reconstructions(rec)
This looks worse than the dimensionality reducing version even. By 40 epochs training had slowed to a crawl.
In [12]:
filters = get_filters(net)
disp = vis_filters(filters, 6)
plt.imshow(disp, interpolation='none')
Out[12]:
These filters look like noisy edge detectors. Something prevented the training from finding a good minimum
In [14]:
reps = get_responses(net, 'pool1', 18, 8)
vis_responses(reps)
In [ ]:
solver_file = 'autoencoder-6-solver.prototxt'
solver = caffe.get_solver(solver_file)
solver.solve('autoencoder-6_iter_10000.solverstate')
In [10]:
model_def_file = 'autoencoder-6.prototxt'
model_file = 'autoencoder-6_iter_20000.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()
Out[10]:
In [11]:
rec = get_reconstructions(net, mean, 8, compare=4)
vis_reconstructions(rec)
In [12]:
filters = get_filters(net)
disp = vis_filters(filters, 3)
plt.imshow(disp, interpolation='none')
Out[12]:
Interesting--these look like 3x3 filters with a random fringe. Curiously it learned better than the first architecture above, even though the extra pixels appear to be wasted. Perhaps it got a better random initialization, or the filter noisiness acts like a kind of regularization. It may have learned small filters because the reconstruction filter size was too small. Let's bump that up too:
In [9]:
solver_file = 'autoencoder-7-solver.prototxt'
solver = caffe.get_solver(solver_file)
solver.solve('autoencoder-7_iter_10000.solverstate')
In [5]:
model_def_file = 'autoencoder-7.prototxt'
model_file = 'autoencoder-7_iter_40000.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()
Out[5]:
In [10]:
rec = get_reconstructions(net, mean, 8, compare=4)
vis_reconstructions(rec)
In [27]:
filters = get_filters(net)
disp = vis_filters(filters, 3)
plt.imshow(disp, interpolation='none')
Out[27]:
The more expressive decoder did reduce error and visual fidelity is now very close to perfect. It did not change the noisy-fringed character of the learned filters. The center filters are mostly in pairs which appear to be mirrors, rotations and/or or color inverses. neat!
In [16]:
reps = get_responses(net, 'pool1', 12, 8)
vis_responses(reps)
We could keep going to 7x7 encoders and 8x8 decoders; but at some point I expect larger filters to have trouble with CIFAR since the images are so tiny. With 7x7 filters about half of all convolutions are going to include some padding.
In [37]:
def dump_to_img(net, nlayers):
for l in range(1, nlayers+1):
encode_name = 'encode'+str(l)
decode_name = 'decode'+str(l)
#move source channel to innermost dimension
filters = np.copy(net.params[encode_name][0].data).transpose([0,2,3,1])
biases = np.copy(net.params[encode_name][1].data)
np.save(encode_name+'-filters', filters)
np.save(encode_name+'-biases', biases)
filters = np.copy(net.params[decode_name][0].data).transpose([1,2,3,0])
biases = np.copy(net.params[decode_name][1].data)
np.save(decode_name+'-filters', filters)
np.save(decode_name+'-biases', biases)
In [38]:
dump_to_img(net, 1)
In [39]:
#get_filters(net)
np.copy(net.params['decode1'][0].data).transpose([1,2,3,0])
Out[39]:
In [ ]:
solver_file = 'autoencoder-8-solver.prototxt'
solver = caffe.get_solver(solver_file)
#initialize the first layer with previously trained weights
#first let's try stacking with the lower weights frozen
pre_net = caffe.Net('autoencoder-7.prototxt', 'autoencoder-7_iter_40000.caffemodel', caffe.TEST)
for layer in ['encode1', 'decode1']:
solver.net.params[layer][0].data[:] = pre_net.params[layer][0].data
solver.net.params[layer][1].data[:] = pre_net.params[layer][1].data
solver.solve()
In [7]:
model_def_file = 'autoencoder-8.prototxt'
#model_file = '../bin/cifar-tanh-20epoch-2layer.caffemodel'
model_file = 'autoencoder-8_iter_40000.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()
Out[7]:
In [8]:
rec = get_reconstructions(net, mean, 8, compare=4)
vis_reconstructions(rec)
The loss for the new layer went pretty low, but the overall reconstruction error is high. Let's try fine tuning on all weights with original L2 reconstruction error
In [2]:
solver_file = 'autoencoder-9-solver.prototxt'
solver = caffe.get_solver(solver_file)
#initialize the first layer with previously trained weights
#this time bring over all the parameters
pre_net = caffe.Net('autoencoder-8.prototxt', 'autoencoder-8_iter_40000.caffemodel', caffe.TEST)
for layer in ['encode1', 'decode1', 'encode2', 'decode2']:
solver.net.params[layer][0].data[:] = pre_net.params[layer][0].data
solver.net.params[layer][1].data[:] = pre_net.params[layer][1].data
solver.solve()
In [6]:
model_def_file = 'autoencoder-9.prototxt'
model_file = '../bin/cifar-tanh-60epoch-2layer-finetuned-dualobjective.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()
Out[6]:
In [8]:
rec = get_reconstructions(net, mean, 8, compare=4)
vis_reconstructions(rec)
Fine tuning reduced both parts of the loss, but still looks much worse than the single layer model
In [7]:
filters = get_filters(net)
disp = vis_filters(filters, 3)
plt.imshow(disp, interpolation='none')
Out[7]:
Fine tuning on the first layer appears to have corrupted the nice filters we had before
In [ ]:
In [9]:
#map triples of filters to colors
reps = get_responses(net, 'pool2', 16, 8)
vis_responses(reps)
In [17]:
solver_file = 'autoencoder-4-solver.prototxt'
solver = caffe.get_solver(solver_file)
solver.solve()
In [18]:
model_def_file = 'autoencoder-4.prototxt'
model_file = '../bin/cifar-tanh-40epoch-3layer.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()
Out[18]:
In [19]:
rec = get_reconstructions(net, mean, 8, compare=8)
vis_reconstructions(rec)
Again, we are recovering a lot of spatial detail. Deeper is still worse, will be interesting to see how heavier training improves the situation
In [20]:
#map triples of filters to colors
reps = get_responses(net, 'pool3', 64, 8)
vis_responses(reps)
In [21]:
filters = get_filters(net)
disp = vis_filters(filters, 3)
plt.imshow(disp, interpolation='none')
Out[21]:
These first-layer filters are hard to interpret, but do seem to have some internal color coordination and symmetry. Most deep convolutional architectures start with a large number of filters; maybe having just 4x the number of colors is asking each filter to do too many things. Then again, maybe that isn't a problem for anything besides filter visualization.
Using Adagrad gave similar error, but much more random looking filters, above
ReLUs have helped to train very deep networks. For a classifier, it's not a problem to have zero mean inputs but nonnegative hidden+output layers. For this application, we rely on hidden layers having the same image properties as the input. Can we get rid of the mean subtraction and use a non negative image representation with ReLU instead of tanh units? Let's start back at the 1-layer, dimensionality preserving autoencoder:
In [13]:
solver_file = 'autoencoder-5-solver.prototxt'
solver = caffe.get_solver(solver_file)
solver.solve()
In [14]:
model_def_file = 'autoencoder-5.prototxt'
model_file = '../bin/cifar-relu-20epoch.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()
Out[14]:
In [15]:
rec = get_reconstructions(net, np.zeros(mean.shape), 8, compare=2)
vis_reconstructions(rec)
The ReLU units work with either a reduced learning rate or increased lr and the Adagrad solver, though still not quite as well as tanh units.
In [16]:
filters = get_filters(net)
disp = vis_filters(filters, 3)
plt.imshow(disp, interpolation='none')
Out[16]:
In [ ]: