Reasoning relations between things

We describe a Relation Network (RN) and show that it can perform at superhuman levels on a challenging task.

We describe a general purpose model that can predict the future state of a physical object based purely on visual observations.

A simple neural network module for relational reasoning

The VIN predicts how the underlying relative states of the objects
- c.f. This differs from generative models
  - visually “imagine” the next few frames of a video.
Takes graphs as input
Performs object- and relation-centric reasoning
Evaluate its ability to reason about several challenging physical domains
- n-body problems, rigid-body collision, and non-rigid dynamics.
Three powerful approaches
- Structured models
- Simulation
- Deep learning

3 Parts
- Visual Encoder(Perceptual front-end based on ConvNet)
  - Parse a dynamic visual scene into a set of factored latert object representations.
- Dynamics predictor based on Interaction networks
  - Produce a predicted physical trajectory of arbitrary length
- State Decoder
  - a linear layer with input size $L_{code}$ and output size 4 (for a position/velocity vector)
Joint training
Applications
- Model-based decision-making and planning
  - From raw sensory observations in complex PHYSICAL environments
Predict accurate 100s of time steps From just 6 input frames

32 × 32 RGB video
[F1,F2,F3] is a Sequance of frames.
Stack F1 and F2 along their color-channel dimension. e.g. 2 x RGB images -> one 6 Channel Image
Apply two 2-layer convolutional nets
- Stack the outputs
Apply a 2-layer size-preserving convolutional net
Inject two constant coordinate channels, representing the x- and y-coordinates of the feature matrix.
- These two channels are a meshgrid with min value 0 and max value 1.
Apply 5 each of convolutional and max-pooling layers into 1x1x32 tensor
Apply this process on [F1,F2] and [F2,F3], obtaining S1 and S2.
Linear Functions on S1, S2
One hidden layer MLP
Generate $N_{object} \times 64$ tensor



In [ ]: