Deep Reinforcement Learning for Partially Observable Parameterized Envrionments

presented by Microsoft Research

Atari: MDP or POMPDP?

It depends on how much information is available at each time-step.

  • This causes the game state to be inferred from partial observations.
  • "Flickering" markedly decreases performance, for example, in the Atari environment

New research: Deep Recurrent Q-Network

Addition of the LSTM provides clear benefits:

  • Improved performance over 4 and 10-frame DQN (DRQN is 10-frame)
  • LSTM infers velocity note: amazing!

Important Note: DRQN doesn't always beat DQN scores, case in point is the Beam Rider environment.

Deep Deterministic Policy Gradients

Inverted Gradients

  • Allows parameters to range bounds without exceeding them; parameters don't get "stuck" at the top or bottom of the range.