Deep learning as a mixed convex-combinatorial optimization problem
https://arxiv.org/abs/1710.11573
- Hard activations, can't use gradient descent
- Network quantization -> reduce time and energy requirements
- Create large integrated systems which may have non-differentiable components and must avoid vanishing/exploding gradients
- Settings targets for hard-threshold hidden units in order to minimize loss is a discrete optimization problem
- Each unit has a linearly separable problem to solve -> network decomposes into individual perceptrons
- Straight Through Estimation (STE) is a special case
- Feasible target propagation (FRPROP), recursive
- STE can lead to gradient mismatch error???
- A perceptron can be learned for a non-linearly-separable dataset by minimizing the hinge loss, a convex loss on the perceptron's pre-activation output z, and target t that maximizes the margin when combined with L2 regularization
FTPROP
- Convolutional layer imposes structure on weight matrix -> less likely to be linearly separable. Feasbility constraint relaxed.
- FTPROP-MB (minibatch) closesly resembles backpropagation-based methods
- Derivatives cannot be propagated through layers but can be computed within a layer
- Hinge loss -> learning stalled and erratic due to (1) sensitivity to noise (2)