RL data is mostly sequential, its not an iid data, Consecutive samples are highly correlated.
Reward R(t) is a scalar feedback signal.
Indicates how well agent is doing at step t.
The agent's job is to maximize cumulative reward.
Even if you multiple goals you probably need to formulate it in some weighting scheme so that the result is a scalar.
It's a framework for sequential decision making, actions may have long term consequences. you may sacrifice short term gains for long term rewards.
From the agent's perspective, at a time step t , the agent recieves an observation o(t) , recieves reward r(t) and executes action a(t). While from the environments perspective it recieves action a(t), emits observation o(t+1) and emits scalar reward r(t+1).
History is the sequence of observations, actions and rewards.
Stste is the information use decide to keep so that you can decide your next action.
Major Components of a RL agent :
Policy : Behaviour function, What you decide to do . Policies can be deterministic or stochastic .
Value Function :How good is each state / action. Is the prediction of the future reward. Used to evaluate the goodness/badness of states, and therefore to select between actions. There's two kind of value function, State value and action value. there's a discounting process.
Model : agent's representation of the environment, A model predicts what the environment will do next.
Two fundamental problems in RL: Learning and planning.
Prediction : Evaluate the future, given a policy.
Control : Optimize the future, i.e find the best policy.
In order to solve a control problem you need to solve a prediction problem .
Markov Decision Process
Fully observable env, there are partially observable mdp's as well.
A Markov Decision Process is a tuple < S,A,P,R,gamma >
S is a finite set of states, A is a finite set of actions, R is a reward function, gamma is a discount factor. P is the probability that you got to state s' when you are in state s given the action and the state. The action is some cases is definite but most often you get a distribution for your action. Its not deterministic.
A policy is a distribution over actions given states.
A policy fully defines the behaviour of an agent
We will assume that policies are stationary (time-independent)
Model Free RL
Most problems MDP is unknown.
Model Free RL can be done through Monte-Carlo Learning, work but take time. In monte carlo you go till the end then take the gain and update.
Another method is temporal-Difference learning, A moving average approach. Ultimately use weighted average of all future steps.
Directly solve the control problem.
Exploration vs Exploitation problem.
Lecture 2 - Reinforcement Learning
Papers and reading :-
- Sutton & Barto, RL: An intro
- Q Learning
- nature DQN paper
- deep rl bootcamp , https://sites.google.com/view/deep-rl-bootcamp/lectures
Optimal Value Function : Sum of discounted rewards when starting from state s and acting optimally . The policy can be stochastic.
Q values : Expected utility starting in s and taking action a.
Sampling based approximation : Because sometimes you cant explore all states.
Q Learning.
Exploration and Exploitation
Deep RL
Use DNNs to represent value function, policy and model . Optimize loss function by stochastic gradient descent.
DQN in Atari [ Use NN to represent Q Function]
Gorila ( General Reinforcement Learning Arch)
DPN , Use NN to represent the policy , Uses policy gradient