We don't have MDP, but we assume one. It means:
We can't use methods used to solve MDP (e.g. Value Iteration), because those needs model and reward function. Instead, we'll be learning from trial-and-error using samples $(s,a,r',s')$.
Prediction learning is general-purpose and scalable in one. You have a target just by waiting, so there is no need for human labeling.
It is used in Reinforcement Learning to predict future reward. TD-Learning is behind e.g. Q-Learning algorithm. It takes adventage of states' Markov property (it's predictions are based on only current state).
It's like learning a guess from another guess. In short, it works thanks to one step look ahead.
$$V(S_{t}) \gets V(S_{t}) + \alpha [R_{t+1} + \gamma V(S_{t+1}) - V(S_{t})]$$$$R_{t+1} + \gamma V(S_{t+1}) - V(S_{t}) - \text{TD error}$$$\alpha$ - learning rate
$\gamma$ - discount factor
Policy from value function: $$\pi(s) = \argmax_{a}\sum_{s'}T(s,a,s')V(s')$$
It's better to learn Q-value, because we don't need a model to get a policy: $$\newcommand{\argmax}{\arg\max} \pi(s) = \argmax_{a}Q(s,a)$$
In [5]:
import gym
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
env = gym.make('FrozenLake-v0')
# Initialize Q-table with all zeros
Q = np.zeros([env.observation_space.n,env.action_space.n])
# Set learning hiperparameters
lr = .8
y = .95
num_episodes = 2000
# Create list to store rewards
rList = []
for i in range(num_episodes):
# Reset environment and get first new observation
s = env.reset()
done = False
rSum = 0
#The Q-Table learning algorithm
for _ in range(99):
#Choose an action by greedily (with noise) picking from Q table
a = np.argmax(Q[s,:] + np.random.randn(1,env.action_space.n)*(10./(i+1)))
#Get new state and reward from environment
s1,r,done,_ = env.step(a)
#Update Q-Table with new knowledge
Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a])
rSum += r
s = s1
if done == True:
break
rList.append(rSum)
# Calculate list of running averages
rRunTmp = .0
rRun = []
for r in rList:
rRunTmp = r / 100 + rRunTmp * 99 / 100
rRun.append(rRunTmp)
plt.plot(rRun)
print("OpenAI Gym FrozenLake:")
print("SFFF (S: starting point, safe)")
print("FHFH (F: frozen surface, safe)")
print("FFFH (H: hole, fall to your doom)")
print("HFFG (G: goal, where the frisbee is located)")
print("")
print("Average score over all episodes: {}".format(sum(rList)/num_episodes))
print("")
print("Final Q-Table Values")
print(Q)
In [6]:
# Set number of validation episodes
num_episodes = 1000
# Create list to store rewards
rList = []
for i in range(num_episodes):
# Reset environment and get first new observation
s = env.reset()
done = False
rSum = 0
#The Q-Table learning algorithm
for _ in range(99):
#Choose an action by greedily picking from Q table
a = np.argmax(Q[s,:])
#Get new state and reward from environment
s1,r,done,_ = env.step(a)
rSum += r
s = s1
if done == True:
break
rList.append(rSum)
print("Average score over all episodes: {}".format(sum(rList)/num_episodes))