Deep learning as a mixed convex-combinatorial optimization problem

Hard activations, can't use gradient descent
Network quantization -> reduce time and energy requirements
Create large integrated systems which may have non-differentiable components and must avoid vanishing/exploding gradients
Settings targets for hard-threshold hidden units in order to minimize loss is a discrete optimization problem
Each unit has a linearly separable problem to solve -> network decomposes into individual perceptrons
Straight Through Estimation (STE) is a special case
Feasible target propagation (FRPROP), recursive
STE can lead to gradient mismatch error???
A perceptron can be learned for a non-linearly-separable dataset by minimizing the hinge loss, a convex loss on the perceptron's pre-activation output z, and target t that maximizes the margin when combined with L2 regularization

Convolutional layer imposes structure on weight matrix -> less likely to be linearly separable. Feasbility constraint relaxed.
FTPROP-MB (minibatch) closesly resembles backpropagation-based methods
Derivatives cannot be propagated through layers but can be computed within a layer
Hinge loss -> learning stalled and erratic due to (1) sensitivity to noise (2)