Deep learning as a mixed convex-combinatorial optimization problem

https://arxiv.org/abs/1710.11573

  • Hard activations, can't use gradient descent
  • Network quantization -> reduce time and energy requirements
  • Create large integrated systems which may have non-differentiable components and must avoid vanishing/exploding gradients
  • Settings targets for hard-threshold hidden units in order to minimize loss is a discrete optimization problem
  • Each unit has a linearly separable problem to solve -> network decomposes into individual perceptrons
  • Straight Through Estimation (STE) is a special case
  • Feasible target propagation (FRPROP), recursive
  • STE can lead to gradient mismatch error???
  • A perceptron can be learned for a non-linearly-separable dataset by minimizing the hinge loss, a convex loss on the perceptron's pre-activation output z, and target t that maximizes the margin when combined with L2 regularization

FTPROP

  • Convolutional layer imposes structure on weight matrix -> less likely to be linearly separable. Feasbility constraint relaxed.
  • FTPROP-MB (minibatch) closesly resembles backpropagation-based methods
  • Derivatives cannot be propagated through layers but can be computed within a layer
  • Hinge loss -> learning stalled and erratic due to (1) sensitivity to noise (2)