Imagine a position in a tic-tac-toe game (knots and crosses). How do you choose the best next action?
Which are you most likely to win from? Guess at how likely you are to win from each state. Is a win definite, likely, or maybe?
Set of possible states, $\mathcal{S}$.
Set of possible actions, $\mathcal{A}$.
We want to choose the action that we predict will result in the best possible future from the current state. Need a value that represents the future outcome.
What should the value represent?
With the correct values, multi-step decision problems are reduced to single-step decision problems. Just pick action with best value. Guaranteed to find optimal multi-step solution (dynamic programming).
The utility or cost of a single action taken from a state is the reinforcement for that action from that state. The value of that state-action is the expected value of the full return or the sum of reinforcements that will follow when that action is taken.
Say we are in state $s_t$ at time $t$. Upon taking action $a_t$ from that state we observe the one step reinforcement $r_{t+1}$, and the next state $s_{t+1}$.
Say this continues until we reach a goal state, $K$ steps later. What is the return $R_t$ from state $s_t$?
$$ \begin{align*} R_t = \sum_{k=0}^K r_{t+k+1} \end{align*} $$Use the returns to choose the best action.
Right...are we maximizing or minimizing? What does the reinforcement represent? Let's say it is energy used that we want to minimize. $a_1$, $a_2$, or $a_3$?
Where do the values come from?
What is the estimate of the return $R$ from state B?
Here is a simple maze.
From any position, how do you decide whether to move up, right, down, or left?
Right. Need an estimate of the number of steps to reach the goal. This will be the return $R$. How do we formulate this in terms of reinforcements?
Yep. $r_t = 1$ for every step. Then $R_t = \sum_{k=0}^K r_{t+k+1}$ will sum of those 1's to produce the number of steps to goal from each state.
The Monte-carlo way will assign value as average of number of steps to goal from each starting state tried.
The TD way will update value based on (1 + estimated value from next state).
Should we do Monte-Carlo update or Temporal-Difference updates? Take a look at this comparison on the maze problem.
Recall that the state-action value function is a function of both state and action and its value is a prediction of the expected sum of future reinforcements.
We will call the state-action value function $Q$, after Learning from Delayed Rewards, by C. Watkins, PhD Thesis, University of Cambridge, Cambridge, UK, 1989.
We can select our current belief of what the optimal action, $a_t^o$, is in state $s_t$ by
$$ \begin{align*} a_t^o = \argmax{a} Q(s_t,a) \end{align*} $$or
$$ \begin{align*} a_t^o = \argmin{a} Q(s_t,a) \end{align*} $$For the maze world,
$$ \begin{align*} a_t^o = \argmin{a} Q(s_t,a) \end{align*} $$looks like (argmax should be argmin)
A bit more mathematically, let the current state be given by position in $x$ and $y$ coordinates and actions are integers 1 to
Now, let's try to do this in python.
How do we implement the Q function? For the maze problem, we know we can
So, let's just store the Q function in table form.
The Q table will need three dimensions, for $x$, $y$, and the action.
How do we look up the Q values for a state?
Q values are steps to goal, so we are minimizing. Select right or down action.
We are finally ready for python. How can we make a three-dimensional table of Q values, if $x$ and $y$ have 10 possible values and we have 4 actions?
import numpy as np
Q = np.zeros((10,10,4))
How should we initialize the table? Above line initializes all values to be zero. What effect will this have as Q values for actions taken are updated to estimate steps to goal?
Actions not yet tried will have lowest (0) Q value. Forces the agent to try all actions from all states---lots of exploration.
What must we do after observing $s_t$, $a_t$, $r_{t+1}$, $s_{t+1}$, and $a_{t+1}$?
Calculate the temporal-difference error $r_{t+1} + Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)$ and use it to update the Q value stored for $s_t$ and $a_t$:
$$ \begin{align*} Q(s_t,a_t) = Q(s_t,a_t) + \rho (r_{t+1} + Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)) \end{align*} $$And, in python? Assume position, or state, $(2,3)$ is implemented as ''state = np.array([2,3])''.
r = 1
Qold = Q[stateOld[0], stateOld[1], actionOld]
Qnew = Q[state[0], state[1], action]
TDError = r + Qnew - Qold
Q[stateOld[0], stateOld[1], actionOld] = Qold + rho * TDError
This is performed for every pair of steps $t$ and $t+1$, until the final step, which must be handled differently. There is no $s_{t+1}$. The update
$$ \begin{align*} Q(s_t,a_t) = Q(s_t,a_t) + \rho (r_{t+1} + Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)) \end{align*} $$becomes
$$ \begin{align*} Q(s_t,a_t) = Q(s_t,a_t) + \rho (r_{t+1} - Q(s_t,a_t)) \end{align*} $$In python, add a test for being at the goal. Let ''maze'' be character array containing a ''G'' at the goal position.
r = 1
Qold = Q[stateOld[0], stateOld[1], actionOld]
Qnew = Q[state[0], state[1], action]
if (maze[state[0]+1,state[1]+1] == 'G'):
TDerror = r - Qold
else:
TDerror = r + Qnew - Qold
Q[stateOld[0], stateOld[1], actionOld] = Qold + rho * TDerror
To choose the best action for state $(x, y)$ stored in variable state, just need to do
a = np.argmin(Q[state[0],state[1],:])
and if we store the available actions as
actions = np.array([[0,1],[1,0],[0,-1],[-1,0]])
then the update to state based on action a is done by
state = state + actions[a,:]
For our agent to interact with its world, we must implement the steps
In Python it would look something like the following for a 10x10 maze.
Q = np.zeros((10,10,4)) # 1.
s = np.random.randint(0,10,2) # 2.
for step in xrange(10000): # 3. (or forever)
if isGoal(s): # 3.A.
Q[sOld[0],sOld[1],aOld] += # 3.A.a
rho * (1 - Q[sOld[0],sOld[1],aOld])
s = np.random.randint(0,10,2) # 3.A.b
else: # 3.B
a = np.argmin(Q[s[0],s[1],:]) # 3.B.a
if steps > 1:
Q[sOld[0],sOld[1],aOld] += # 3.B.b
rho * (1 + Q[s[0],s[1],a] - Q[sOld[0],sOld[1],aOld])
sOld, aOld = s, a # 3.B.c
s += actions[a,:] # 3.B.d
First, we start with a text file that specifies a maze. Let's use the cell magic %%writefile to make the maze file.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
# import random
In [2]:
%%writefile maze1.txt
************
* *
* *
* *
* *
* *
* ***** *
* * *
* G * *
* * *
* *
************
In [3]:
with open('maze1.txt') as f:
for line in f:
print(line.strip())
In [4]:
mazelist = []
with open('maze1.txt') as f:
for line in f:
mazelist.append(line.strip())
mazelist
Out[4]:
In [6]:
maze = np.array(mazelist).view('U1').reshape((len(mazelist), len(mazelist[0])))
print(maze.shape)
maze
Out[6]:
In [7]:
for i in range(maze.shape[0]):
for j in range(maze.shape[1]):
print(maze[i,j],end='')
print()
Need some functions, one to draw the Q surface, over the two-dimensional state space (position in the maze), and one to select an action given the Q surface.
In [8]:
### Draw Q surface, showing minimum Q value for each state
def showQ(Q,title,ax):
(m,n,_) = Q.shape
gridsize = max(m,n)
xs = np.floor(np.linspace(0,m-0.5,gridsize))
ys = np.floor(np.linspace(0,n-0.5,gridsize))
xgrid,ygrid = np.meshgrid(xs,ys)
points = np.vstack((xgrid.flat,ygrid.flat))
Qmins = [np.min( Q[s1,s2,:]) for (s1,s2) in zip(points[0,:],points[1,:])]
Qmins = np.asarray(Qmins).reshape(xgrid.shape)
ax.plot_surface(xgrid,ygrid,Qmins,rstride=1,cstride=1,color='yellow')
ax.set_zlabel("Qmin")
ax.set_title("Min %d Max %d" % tuple(np.round((np.min(Qmins),np.max(Qmins)))))
### Show current policy
def showPolicy(Q):
(m,n,_) = Q.shape
bestactions = np.argmin(Q,axis=2)
px,py = np.meshgrid(np.arange(m)-0.5, np.arange(n)-0.5)
pts = np.vstack((px.flat,py.flat)).T
arrowx = actions[:,0][bestactions]
arrowy = actions[:,1][bestactions]
plt.quiver(px,py,arrowx,arrowy)
Construct arrays to hold the tabular Q values updated by temporal differences, and one to hold Q values updated by Monte Carlo. Set Q values to np.inf for invalid actions. We have four possible actions from each position.
In [9]:
m,n = maze.shape
m -= 2 # for ceiling and floor
n -= 2 # for walls
Q = np.zeros((m,n,4))
Qmc = np.zeros((m,n,4))
actions = np.array([[0,1], [1,0], [0,-1], [-1,0]]) # changes in row and column position of RL agent
### Set Q value of invalid actions to np.inf
for mi in range(m):
for ni in range(n):
for ai in range(4):
r = mi + actions[ai,0]
c = ni + actions[ai,1]
if maze[r+1,c+1] == '*': # showing ai was invalid action
Q[mi,ni,ai] = np.inf
Qmc[mi,ni,ai] = np.inf
Now for some parameters. Let's run for 50,000 interactions with maze environment, so 50,000 updates, and let $\rho$, the learning rate, be 0.1 and $\epsilon$, the random action probability, be 0.01.
In [10]:
nSteps = 50000
rho = 0.2
epsilon = 0.1
Now we need to keep a history, or trace, of positions and reinforcement, to be used to update the MC version of Q.
In [11]:
trace = np.zeros((nSteps,3)) # for x, y, and a
In [12]:
from IPython.display import display, clear_output
Need some initializations before starting the loop.
In [13]:
fig = plt.figure(figsize=(10,10))
s = np.array([1,1]) # start position
a = 1 #first action (index)
trials = []
steps = 0
goals = 0
for step in range(nSteps):
trace[steps,:] = s.tolist() + [a]
here = maze[s[0]+1, s[1]+1]
if here == 'G':
# Found the Goal!
goals += 1
Q[s[0],s[1],a] = 0
if steps > 0:
Q[sold[0],sold[1],aold] += rho * (1 - Q[sold[0],sold[1],aold])
# Monte Carlo update
cost = 0
for sai in range(steps,-1,-1):
r,c,act = trace[sai,:]
Qmc[r,c,act] = (1-rho) * Qmc[r,c,act] + rho * cost
cost += 1
s = np.array([np.random.randint(0,m),np.random.randint(0,n)])
#sold = []
trials.append(steps)
else:
# Not goal
steps += 1
Qfunc = Q # Qfunc = Qmc # to use Monte Carlo policy to drive behavior
# Pick next action a
if np.random.uniform() < epsilon:
validActions = [a for (i,a) in enumerate(range(4)) if not np.isinf(Qfunc[s[0],s[1],i])]
a = np.random.choice(validActions)
else:
a = np.argmin(Qfunc[s[0],s[1],:])
if steps > 1:
Q[sold[0],sold[1],aold] += rho * (1 + Q[s[0],s[1],a] - Q[sold[0],sold[1],aold])
sold = s
aold = a
s = s + actions[a,:]
if here == 'G' and goals % 100 == 0:
fig.clf()
ax = fig.add_subplot(3,2,1, projection='3d')
showQ(Q,"TD",ax)
ax = fig.add_subplot(3,2,2, projection='3d')
showQ(Qmc,"Monte Carlo",ax)
plt.subplot(3,2,3)
showPolicy(Q)
plt.title("Q Policy")
plt.subplot(3,2,4)
showPolicy(Qmc)
plt.title("Monte Carlo Q Policy")
plt.subplot(3,2,5)
plt.plot(trace[:steps,0],trace[:steps,1],'o-')
plt.plot(trace[0,0],trace[0,1],'ro')
plt.xlim(0,m)
plt.ylim(0,n)
plt.title("Most Recent Trial")
plt.subplot(3,2,6)
plt.plot(trials,'-')
plt.xlabel("Trial")
plt.ylabel("Steps to Goal")
clear_output(wait=True)
display(fig);
if here == 'G':
steps = 0
clear_output(wait=True)
In [12]: