| notebook.community

notebook.community

Deep Reinforcement Learning for Partially Observable Parameterized Envrionments

presented by Microsoft Research

Atari: MDP or POMPDP?

It depends on how much information is available at each time-step.

This causes the game state to be inferred from partial observations.
"Flickering" markedly decreases performance, for example, in the Atari environment

New research: Deep Recurrent Q-Network

Addition of the LSTM provides clear benefits:

Improved performance over 4 and 10-frame DQN (DRQN is 10-frame)
LSTM infers velocity note: amazing!

Important Note: DRQN doesn't always beat DQN scores, case in point is the Beam Rider environment.

DRQN has been extended:

Deep Deterministic Policy Gradients

Model-free Deep Actor Critic architecture (paper: TP Lilicrap 2015)

Inverted Gradients

Allows parameters to range bounds without exceeding them; parameters don't get "stuck" at the top or bottom of the range.