Feature Learning for Video Feedback

Victor Shepardson

Dartmouth College Digital Musics

The goal of this project is to discover aesthetically interesting digital video feedback processes by incorporating learned features into a hand constructed feedback process.

Consider a video feedback process defined by the mapping from images to images $x_t = \Delta_\phi(x_{t-1})$, where $\Delta$ is a transition function, $\phi$ is a parameterization which may be spatially varying or interactively controlled, and $x_t$ is the image at time step $t$.

Additionally suppose we have a deep autoencoder $\gamma$ for images: $$h^{\ell+1} = \gamma_\ell(h^\ell)$$ $$h^{\ell} \approx \gamma_\ell^{-1}(h^{\ell+1})$$ $$h^0 = x$$

Combining these two concepts, we can define a new feedback process where position in the feature hierarchy acts like another spatial dimension: $$h_t^\ell = \Delta_\phi( h_{t-1}^\ell, \gamma_{\ell-1}(h_{t-1}^{l-1}), \gamma_\ell^{-1}(h_{t-1}^{\ell+1}) )$$

The goal then is to learn a deep autoencoder which represents abstract image features and admits layer-wise encoding and decoding as above. I propose a convolutional pooling autoencoder based on the convolutional autoencoders of Masci et al. and the upsampling layers of Long et al..

Below are a number of experiments training pooled convolutional autoencoders on the CIFAR-10 dataset using caffe. The caffe model definitions are available at my GitHub.


In [3]:
#get caffe and pycaffe set up

import numpy as np
import matplotlib.pyplot as plt
import scipy.ndimage
%matplotlib inline

#assuming feature-feedback repo and caffe root are in the same directory
caffe_root = '../../caffe/'
import sys
sys.path.insert(0, caffe_root+'python')

import caffe
from caffe.proto import caffe_pb2
#I have compiled caffe for CPU only (nvidia GPUs only)
caffe.set_mode_cpu()

In [4]:
#load the cifar mean into numpy array
blob = caffe_pb2.BlobProto()
data = open('../../caffe/examples/cifar10/mean.binaryproto').read()
blob.ParseFromString(data)
mean = caffe.io.blobproto_to_array(blob)[0].transpose([1,2,0])/256

In [37]:
def get_reconstructions(net, mean, n, compare=0):
    inputs = np.hstack([ np.copy(net.blobs['data'].data[i]).transpose([1,2,0])+mean for i in range(n)])
    outputs = np.hstack([ np.copy(net.blobs['decode1neuron'].data[i]).transpose([1,2,0])+mean for i in range(n)])
    #clamp the reconstruction to [0,1]
    #even with tanh activation outputs can be out of bounds once mean is added back
    np.clip(outputs, 0, 1, outputs)
    #compare to linear interpolation through the intermediate spatial resolution
    #this is a good baseline for how well spatial information is stored and 
    #recovered by the convolutional layers
    if compare>0:
        comparisons = np.dsplit(np.copy(inputs), inputs.shape[2])
        comparisons = [scipy.ndimage.zoom(np.squeeze(c), 1./compare, order=3) for c in comparisons]
        comparisons = [scipy.ndimage.zoom(c, compare, order=3) for c in comparisons]
        comparisons = np.dstack(comparisons)
        np.clip(comparisons, 0, 1, comparisons)
        return (inputs, outputs, comparisons)
    return (inputs, outputs)
def vis_reconstructions(rec):
    disp = np.vstack(rec)
    plt.figure(figsize=(10,10))
    plt.imshow(disp, interpolation='None')

In [64]:
def get_filters(net, layer = 'encode1'):
    filters = np.copy(net.params[layer][0].data).transpose([0,2,3,1])
    biases = np.copy(net.params[layer][1].data)
    print biases
    return filters
def vis_filters(filters, rows):
    #normalize preserving 0 = 50% gray
    filters/=2*abs(filters).max()
    filters+=.5
    disp = np.hstack([np.pad(f,[(1,1),(1,1),(0,0)],'constant', constant_values=[.5]) for f in filters])
    disp = np.vstack(np.hsplit(disp,rows))
    return disp

In [35]:
def get_responses(net, layer, filts, n):
    reps = np.hstack([ net.blobs[layer].data[i].transpose([1,2,0]) for i in range(n)])
    # normalize preserving 0 = 50% gray
    reps/=2*abs(reps).max()
    reps+=.5
    reps = np.vstack(np.dsplit(reps, filts))
    return reps.squeeze()    
def vis_responses(reps):
    plt.figure(figsize=(10,10))
    plt.imshow(reps, interpolation='none', cmap='coolwarm')

In [ ]:
#run this cell to solve the model defined in the solver_file
solver_file = 'autoencoder-0-solver.prototxt'
solver = caffe.get_solver(solver_file);
solver.solve();

In [25]:
#load the model trained by the previous cell
#(and saved elsewhere in the repo) and set it up on test data
model_def_file = 'autoencoder-0.prototxt'
model_file = '../bin/cifar-tanh-20epoch-unregularized.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()


Out[25]:
{'l2_error': array(1.4730193614959717, dtype=float32)}

1 autoencoding layer:

12 3x3 filters, 2x2 max pool, 4x4 reconstruction

tanh activation

base_lr: 0.001

momentum: 0.9


In [38]:
rec = get_reconstructions(net, mean, 8, compare=2)
vis_reconstructions(rec)


Top: CIFAR test inputs

Middle: Reconstructions

Bottom: Cubic interpolation for comparison


In [154]:
filters = get_filters(net)
disp = vis_filters(filters, 3)
plt.imshow(disp, interpolation='none')


[-0.36862749 -0.1923824  -0.22088723 -0.12164118 -0.10508166 -0.49331141
 -0.53125817 -0.48567948 -0.44001433 -0.39040077 -0.32423615 -0.1737113 ]
Out[154]:
<matplotlib.image.AxesImage at 0x7d4510d2dc50>

Localized color detectors


In [156]:
reps = get_responses(net, 'pool1', 12, 8)
vis_responses(reps)


(192, 128, 1)
Out[156]:
<matplotlib.image.AxesImage at 0x7d4510c0da90>

Max pooled activations for each of 12 features. Red is positive response, blue negative.


In [3]:
solver_file = 'autoencoder-1-solver.prototxt'
solver = caffe.get_solver(solver_file)
solver.solve()

In [41]:
model_def_file = 'autoencoder-1.prototxt'
model_file = '../bin/cifar-tanh-20epoch-squeezing.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()


Out[41]:
{'l2_error': array(2.5404462814331055, dtype=float32)}

Dimensionality Reduction

1 autoencoding layer:

6 3x3 filters, 2x2 max pool, 4x4 reconstruction

tanh activation

base_lr: 0.001

momentum: 0.9


In [43]:
rec = get_reconstructions(net, mean, 8, compare=2)
vis_reconstructions(rec)


Top - CIFAR test inputs; Middle - Reconstructions; Bottom - Cubic interpolation


In [42]:
filters = get_filters(net)
disp = vis_filters(filters, 2)
plt.imshow(disp, interpolation='none')


[-0.09201549 -0.12697266 -0.11692226 -0.10681173 -0.09015708 -0.11839788]
Out[42]:
<matplotlib.image.AxesImage at 0x7e0858248190>

These filters appear to be learning color gradients in a subtractive color space


In [ ]:


In [44]:
reps = get_responses(net, 'pool1', 6, 8)
vis_responses(reps)


Another Architecture

1 autoencoding layer:

48 5x5 filters, 4x4 max pool, 8x8 reconstruction

tanh activation

base_lr: 0.001

momentum: 0.9


In [ ]:
solver_file = 'autoencoder-2-solver.prototxt'
solver = caffe.get_solver(solver_file)
solver.solve('autoencoder-2_iter_20000.solverstate')

In [45]:
model_def_file = 'autoencoder-2.prototxt'
#model_file = '../bin/cifar-tanh-20epoch-squeezing-pool3.caffemodel'
model_file = 'autoencoder-2_iter_20000.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()


Out[45]:
{'l2_error': array(5.954926013946533, dtype=float32)}

In [46]:
rec = get_reconstructions(net, mean, 8, compare=4)
vis_reconstructions(rec)


Harder to train; may be worth revisiting


In [47]:
filters = get_filters(net)
disp = vis_filters(filters, 6)
plt.imshow(disp, interpolation='none')


[-0.5458473  -0.85954159 -0.10250413 -0.11447078 -0.01339798 -0.31032813
 -0.02660202 -0.77891278  0.04860483  0.17118894 -0.18338218  0.03620156
 -0.21440832 -0.74913269 -0.677212   -0.11030301 -0.01261417 -0.31087366
 -0.56522584 -0.5849033  -0.3219969  -0.14499842  0.04703386 -0.71903616
 -0.64787728 -0.12493432 -0.73442465 -0.80459327 -0.23484692 -0.03186548
  0.02129112 -0.08047031 -0.23331846 -0.149628   -0.72716242 -0.1825075
  0.12050974 -0.21958451 -0.36699957 -0.15040933  0.08417189 -0.1206205
 -0.59321886  0.01488551 -0.28670374 -0.62786371  0.13887869 -0.20581132]
Out[47]:
<matplotlib.image.AxesImage at 0x7e085ae84510>

Filters look like incomplete set of edge detectors


In [48]:
reps = get_responses(net, 'pool1', 16, 8)
vis_responses(reps)


Better Reconstruction

1 autoencoding layer:

12 5x5 filters, 2x2 max pool, 4x4 reconstruction

tanh activation

base_lr: 0.001

momentum: 0.9


In [ ]:
solver_file = 'autoencoder-6-solver.prototxt'
solver = caffe.get_solver(solver_file)
solver.solve('autoencoder-6_iter_10000.solverstate')

In [49]:
model_def_file = 'autoencoder-6.prototxt'
model_file = 'autoencoder-6_iter_20000.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()


Out[49]:
{'l2_error': array(1.175546407699585, dtype=float32)}

In [54]:
rec = get_reconstructions(net, mean, 8, compare=2)
vis_reconstructions(rec)



In [51]:
filters = get_filters(net)
disp = vis_filters(filters, 3)
plt.imshow(disp, interpolation='none')


[-0.12973469 -0.5200218  -0.13307698 -0.24353078 -0.35262129 -0.31913307
 -0.11899948 -0.48395655 -0.23186304 -0.30745643 -0.35858005 -0.30927593]
Out[51]:
<matplotlib.image.AxesImage at 0x7e08585fb510>

Interesting--these look like 3x3 filters with a random fringe. Curiously it learned better than the first architecture above, even though the extra pixels appear to be wasted. Perhaps it got a better random initialization, or the filter noisiness acts like a kind of regularization. It may have learned small filters because the reconstruction filter size was too small. Let's bump that up too:


In [9]:
solver_file = 'autoencoder-7-solver.prototxt'
solver = caffe.get_solver(solver_file)
solver.solve('autoencoder-7_iter_10000.solverstate')

In [52]:
model_def_file = 'autoencoder-7.prototxt'
model_file = 'autoencoder-7_iter_40000.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()


Out[52]:
{'l2_error': array(0.9563984870910645, dtype=float32)}

6x6 reconstruction


In [55]:
rec = get_reconstructions(net, mean, 8, compare=2)
vis_reconstructions(rec)



In [12]:
filters = get_filters(net)
disp = vis_filters(filters, 3)
plt.imshow(disp, interpolation='none')


[-0.12125222 -0.08644268 -0.33110973 -0.0643307  -0.09994981 -0.36133596
 -0.30555385 -0.06790387 -0.26083776 -0.45829996 -0.09892169 -0.32911256]
Out[12]:
<matplotlib.image.AxesImage at 0x7ec2880568d0>

The more expressive decoder did reduce error and visual fidelity is now very close to perfect. It did not change the noisy-fringed character of the learned filters. The center filters are mostly in pairs which appear to be mirrors, rotations and/or or color inverses. neat!


In [66]:
filters = get_filters(net, 'decode1')
disp = vis_filters(filters, 3)
plt.imshow(disp, interpolation='none')


[ 0.078804    0.04081469  0.10308732]
Out[66]:
<matplotlib.image.AxesImage at 0x7e085ad99050>

In [58]:
reps = get_responses(net, 'pool1', 12, 8)
vis_responses(reps)


We could keep going to 7x7 encoders and 8x8 decoders; but at some point I expect larger filters to have trouble with CIFAR since the images are so tiny. With 7x7 filters about half of all convolutions are going to include some padding.

Deeper Network

Let's try stacking another layer on the last architecture, above. We'll freeze the first encoder/decoder and treat the first pooling layer as input to a new autoencoder.


In [ ]:
solver_file = 'autoencoder-8-solver.prototxt'
solver = caffe.get_solver(solver_file)
#initialize the first layer with previously trained weights
#first let's try stacking with the lower weights frozen
pre_net = caffe.Net('autoencoder-7.prototxt', 'autoencoder-7_iter_40000.caffemodel', caffe.TEST)
for layer in ['encode1', 'decode1']:
    solver.net.params[layer][0].data[:] = pre_net.params[layer][0].data
    solver.net.params[layer][1].data[:] = pre_net.params[layer][1].data
solver.solve()

In [59]:
model_def_file = 'autoencoder-8.prototxt'
#model_file = '../bin/cifar-tanh-20epoch-2layer.caffemodel'
model_file = 'autoencoder-8_iter_40000.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()


Out[59]:
{'l2_error1': array(5.685671329498291, dtype=float32),
 'l2_error2': array(2.0855603218078613, dtype=float32)}

In [60]:
rec = get_reconstructions(net, mean, 8, compare=4)
vis_reconstructions(rec)


The loss for the new layer went pretty low, but the overall reconstruction error is high. Let's try fine tuning on all weights both L2 errors:

Fine Tuning with double objective


In [2]:
solver_file = 'autoencoder-9-solver.prototxt'
solver = caffe.get_solver(solver_file)
#initialize the first layer with previously trained weights
#this time bring over all the parameters
pre_net = caffe.Net('autoencoder-8.prototxt', 'autoencoder-8_iter_40000.caffemodel', caffe.TEST)
for layer in ['encode1', 'decode1', 'encode2', 'decode2']:
    solver.net.params[layer][0].data[:] = pre_net.params[layer][0].data
    solver.net.params[layer][1].data[:] = pre_net.params[layer][1].data
solver.solve()

In [61]:
model_def_file = 'autoencoder-9.prototxt'
model_file = '../bin/cifar-tanh-60epoch-2layer-finetuned-dualobjective.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()


Out[61]:
{'l2_error1': array(3.2483773231506348, dtype=float32),
 'l2_error2': array(0.21920974552631378, dtype=float32)}

In [62]:
rec = get_reconstructions(net, mean, 8, compare=4)
vis_reconstructions(rec)


Fine tuning reduced both parts of the loss, but still looks much worse than the single layer model


In [63]:
filters = get_filters(net)
disp = vis_filters(filters, 3)
plt.imshow(disp, interpolation='none')


[ 0.05174831 -0.02071226 -0.15018784  0.01387358  0.00702072 -0.02757885
 -0.11727612  0.00808964  0.01447153 -0.16562356  0.06779021 -0.09179834]
Out[63]:
<matplotlib.image.AxesImage at 0x7e0858367390>

Fine tuning on the first layer appears to have corrupted the nice filters we had before


In [ ]:


In [9]:
#map triples of filters to colors
reps = get_responses(net, 'pool2', 16, 8)
vis_responses(reps)



In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: