Reasoning relations between things

A simple neural network module for relational reasoning

  • We describe a Relation Network (RN) and show that it can perform at superhuman levels on a challenging task.

Visual Interaction Networks

  • We describe a general purpose model that can predict the future state of a physical object based purely on visual observations.

A simple neural network module for relational reasoning

  • Resolve Visual QA

Interaction Networks for Learning about Objects, Relations and Physics (2016)

  • The VIN predicts how the underlying relative states of the objects
    • c.f. This differs from generative models
      • visually “imagine” the next few frames of a video.
  • Takes graphs as input
  • Performs object- and relation-centric reasoning
  • Evaluate its ability to reason about several challenging physical domains
    • n-body problems, rigid-body collision, and non-rigid dynamics.
  • Three powerful approaches
    • Structured models
    • Simulation
    • Deep learning

Interaction Network

  • Object-centric, Object Reasoning (Single Object)
  • Relation Reasoning
    • $o_i$: object, t: time-step, e: effect
    • $f_O$:Object function
    • $f_R$: Relation function
    • $r_i$: constant for relationship, e.g. spring constant
    • $x_i$: external effects

  • $a$: aggregation

Visual Interaction Networks

  • 3 Parts
    • Visual Encoder(Perceptual front-end based on ConvNet)
      • Parse a dynamic visual scene into a set of factored latert object representations.
    • Dynamics predictor based on Interaction networks
      • Produce a predicted physical trajectory of arbitrary length
    • State Decoder
      • a linear layer with input size $L_{code}$ and output size 4 (for a position/velocity vector)
  • Joint training
  • Applications

    • Model-based decision-making and planning
      • From raw sensory observations in complex PHYSICAL environments
  • Predict accurate 100s of time steps From just 6 input frames

  • Example: 3 steps Prediction from 4 steps
    • 2 Interaction Networks in Predictor

Perceptual front-end based on ConvNet

Image Pair Encoder

  • 32 × 32 RGB video
  • [F1,F2,F3] is a Sequance of frames.
  • Stack F1 and F2 along their color-channel dimension. e.g. 2 x RGB images -> one 6 Channel Image
  • Apply two 2-layer convolutional nets
    • Stack the outputs
  • Apply a 2-layer size-preserving convolutional net
  • Inject two constant coordinate channels, representing the x- and y-coordinates of the feature matrix.
    • These two channels are a meshgrid with min value 0 and max value 1.
  • Apply 5 each of convolutional and max-pooling layers into 1x1x32 tensor
  • Apply this process on [F1,F2] and [F2,F3], obtaining S1 and S2.
  • Linear Functions on S1, S2
  • One hidden layer MLP
  • Generate $N_{object} \times 64$ tensor

Dynamics predictor based on "Interaction networks" (RNN)

Multi-Step


In [ ]: