Feature Learning for Video Feedback

Victor Shepardson, Dartmouth College Digital Musics

Introduction

The goal of this project is to discover aesthetically interesting digital video feedback processes by incorporating learned features into a hand constructed feedback process.

Consider a video feedback process defined by the mapping from images to images $x_t = \Delta_\phi(x_{t-1})$, where $\Delta$ is a transition function, $\phi$ is a parameterization which may be spatially varying or interactively controlled, and $x_t$ is the image at time step $t$.

Additionally suppose we have a deep autoencoder $\gamma$ for images: $$h^{\ell+1} = \gamma_\ell(h^\ell)$$ $$h^{\ell} \approx \gamma_\ell^{-1}(h^{\ell+1})$$ $$h^0 = x$$

Combining these two concepts, we can define a new feedback process where position in the feature hierarchy acts like another spatial dimension: $$h_t^\ell = \Delta_\phi( h_{t-1}^\ell, \gamma_{\ell-1}(h_{t-1}^{l-1}), \gamma_\ell^{-1}(h_{t-1}^{\ell+1}) )$$

The goal then is to learn a deep autoencoder which represents abstract image features and admits layer-wise encoding and decoding as above. A constraint for my model is that inference run in real time on commodity hardware and that it be easily interoperable with other real-time systems.

Prior Work

Many authors have used autoencoders for unsupervised feature learning from images. Masci et al. defined the convolutional autoencoder. My model closely resembles their CAE, except that they use a kind of pseudo-pooling layer which zeros non maximal filter responses but does not destroy spatial information. They stack their autoencoders with the goal of producing a classifier, but do not employ the reconstruction as a generative model. Le used a sparse deep autoencoder for feature extraction. Le's model used local receptive fields but was not convolutional; it was also not a deep autoencoder in the end-to-end sense but rather stacked sparse autoencoders together with pooling and LCN layers.

The Model

The idea is to move information out of the spatial dimensions and into a feature dimensions with each layer. This process should be restricted enough to prevent learning a trivial encoder/decoder. I also imposed the requirement that each hidden representation fit into an RGB texture of the same dimensions as the input layer. Therefore, the number of channels in each feature map increases by the same factor as the spatial resolution decreases due to pooling. Reconstruction must interpolate the low resolution feature map. Intuition says that a learned filter bank can exploit context to do better than naive interpolation. To accomplish this, I propose a convolutional pooling autoencoder based on the convolutional autoencoders of Masci et al. and the upsampling layers of Long et al..

My model is formed by stacking autoencoding units which consist of encoding, pooling, and decoding layers. Encoding layers are convolutional filter banks with a tanh activation function. Pooling layers are either max or mean pooling over non overlapping regions. Decoding layers are reverse convolution layers i.e. the filter size corresponds to a footprint on the output and a stride of 2 means upsampling by a factor of 2. They also use a tanh activation function. Images are treated as bipolar signal by subtracting the mean over the training set.

Experiments

I performed a number of experiments training pooled convolutional autoencoders on the CIFAR-10 dataset using caffe. The IPython notebook and caffe model definitions are available at my GitHub. All experiments used a momentum term of 0.9 and batch size of 100. I found weight decay to be unnecessary. Learning rates were set mostly to .001, sometimes with a policy to decrease every 10 or 20 epochs. Models were trained for 80 epochs (40000 iterations of size 100 batches) or until they appeared to converge.

In experiments with single autoencoding units, I found reconstruction quality to increase with filter size. Increasing the reconstruction filter size while holding the encoding filter size fixed also improved reconstruction. Mean pooling improved reconstruction over max pooling, but somewhat decreased filter plausibility--the encoding might be less meaningful.

It was virtually impossible to overfit the single layer models. This is unsurprising since CIFAR is large compared to both the size of its images and the number of parameters in the first layer.

I was able to train multiple layers with some success by stacking and training from scratch. More experiments are needed to draw conclusions about how effective or necessary stacking is. I tried using a double objective to fine tune a stacked two layer model, and found it difficult to train. When fine tuning end-to-end, it seems that my autoencoders naturally become "asymmetrical": corresponding encoded and decoded hidden layers have a high L2 difference except for the input and output layers.

Graphics

I reimplemented inference for my network as a set of fragment shaders using OpenGL and the openFrameworks creative coding framework. The motivation was to make the process more portable and easier to integrate with sound and graphics in a real-time improvisation or performance environment. I wanted to avoid dependence on caffe. In retrospect, OpenCL may have been a better choice but I also wanted to avoid the learning curve and potential lack of integration with openFrameworks.

The number of parameters is expoential in the number of layers, and the cost of inference is linear in the number of parameters. To scale up real-time inference for deep models, it's possible to subsample the deeper layers in time. I have implemented a simple scheme of running the feedback, inference and decoding shaders for layer $\ell$ every $2^\ell$ frames, staggered so as not to run on the same frame. This is only a partial solution; it still leads to stutter on expensive frames. It would be better to distribute computation of higher layers across frames and temporally smooth the results at the primary frame rate.

Another useful optimization is to drop max pooling for inference. Using a stride on the convolution instead saves a factor of the pooling area for inference. With these optimizations, the system can run at about 22 fps at 400x400 resolution with 3x3 encoding, 2x2 pooling, 2x2 decoding, and 3 hidden layers on a mid range GPU (Radeon HD 7790). This is a bit disappointing, but there is probably room for more optimization in the shaders. It might also be worth limiting receptive fields in the feature dimension.

To use learned parameters in OpenGL, I used pycaffe to save caffe's memory blobs as numpy files and used Carl Rogers's cnpy to read them from C++. I found that random filters make for a variety of interesting feedback processes. Learned filters are more difficult to find a good feedback process for; it's easy to reach an equilibrium or highlight the artifacts of reconstruction.

Future Work

An important experiment I didn't get to was to investigate the meaningfulness of top layer representations. A good way would be to train a linear classifier for CIFAR-10 or CIFAR-100 on the top layer features. I suspect that mean pooling would perform worse than max pooling despite its better reconstruction error.

I consider the convnet-inspired random convolutional feedback process a success. The potential to use learned filters in an interesting way is still unrealized. It might be better to use a inherently generative model like Lee et al. or to use a discriminative model together with a more indirect feedback process. Another direction could be to define an aesthetic reward function and explicitly define the feedback process as an RNN.


In [4]:
#don't try to run this notebook yourself, I recommend looking at experiments.ipynb instead
model_def_file = 'autoencoder-15.prototxt'
model_file = 'autoencoder-15-finetune_iter_30000.caffemodel'
net = caffe.Net(model_def_file, model_file, caffe.TEST)
#run a batch
net.forward()
rec = get_reconstructions(net, mean, 8, compare=4)
vis_reconstructions(rec)

In [3]:
#setup code for the above experiments

#get caffe and pycaffe set up
import numpy as np
import matplotlib.pyplot as plt
import scipy.ndimage
%matplotlib inline

#assuming feature-feedback repo and caffe root are in the same directory
caffe_root = '../../caffe/'
import sys
sys.path.insert(0, caffe_root+'python')

import caffe
from caffe.proto import caffe_pb2
#I have compiled caffe for CPU only (nvidia GPUs only)
caffe.set_mode_cpu()

#load the cifar mean into numpy array
blob = caffe_pb2.BlobProto()
data = open('../../caffe/examples/cifar10/mean.binaryproto').read()
blob.ParseFromString(data)
mean = caffe.io.blobproto_to_array(blob)[0].transpose([1,2,0])/256

def get_reconstructions(net, mean, n, compare=0):
    inputs = np.hstack([ np.copy(net.blobs['data'].data[i]).transpose([1,2,0])+mean for i in range(n)])
    outputs = np.hstack([ np.copy(net.blobs['decode1neuron'].data[i]).transpose([1,2,0])+mean for i in range(n)])
    #clamp the reconstruction to [0,1]
    #even with tanh activation outputs can be out of bounds once mean is added back
    np.clip(outputs, 0, 1, outputs)
    #compare to cubic resampling through the intermediate spatial resolution
    #this is a good baseline for how well spatial information is stored and 
    #recovered by the convolutional layers
    if compare>0:
        comparisons = np.dsplit(np.copy(inputs), inputs.shape[2])
        comparisons = [scipy.ndimage.zoom(np.squeeze(c), 1./compare, order=3) for c in comparisons]
        comparisons = [scipy.ndimage.zoom(c, compare, order=3) for c in comparisons]
        comparisons = np.dstack(comparisons)
        np.clip(comparisons, 0, 1, comparisons)
        return (inputs, outputs, comparisons)
    return (inputs, outputs)
def vis_reconstructions(rec):
    disp = np.vstack(rec)
    plt.imshow(disp, interpolation='None')
    
def get_filters(net, layer = 'encode1'):
    filters = np.copy(net.params[layer][0].data).transpose([0,2,3,1])
    biases = np.copy(net.params[layer][1].data)
    print biases
    return filters
def vis_filters(filters, rows):
    #normalize preserving 0 = 50% gray
    filters/=2*abs(filters).max()
    filters+=.5
    disp = np.hstack([np.pad(f,[(1,1),(1,1),(0,0)],'constant', constant_values=[.5]) for f in filters])
    disp = np.vstack(np.hsplit(disp,rows))
    return disp

def get_responses(net, layer, filts, n):
    reps = np.hstack([ net.blobs[layer].data[i].transpose([1,2,0]) for i in range(n)])
    # normalize preserving 0 = 50% gray
    reps/=2*abs(reps).max()
    reps+=.5
    reps = np.vstack(np.dsplit(reps, filts))
    return reps.squeeze()    
def vis_responses(reps):
    plt.figure(figsize=(10,10))
    plt.imshow(reps, interpolation='none', cmap='coolwarm')

In [ ]: