Temporal Differencing Q-Learning formula:
$$Q(s, a) \leftarrow Q(s, a) + \alpha (R(s) + \gamma \cdot max_{a'} Q(s', a') - Q(s, a))$$...also known as SARSA when the new state selection is obtained from our current $Q(s,a)$ look-up table.
We also consider the Utility Function $U(s)$ as
$$U(s) = max_{a} Q(s,a) $$so we want at each step maximize the Utility of our state by choosing the best action each time:
$$Q(s, a) \leftarrow Q(s, a) + \alpha (R(s) + \gamma \cdot U(s') - Q(s, a))$$When s is a final state, we consider the utility as the final reward obtained.
In [1]:
qlearn <- function (actions, rewards, s.initial, alpha, gamma, max.iters, q.scoring = NULL)
{
# Auxiliar function to get the index of a given action
get.action.index <- function(a)
{
which(sapply(actions, function(x) all(x == a)));
}
# Utility Function // Fitness Function
u.function <- function(s, valid.actions)
{
best.q <- -Inf;
best.a <- -Inf;
best.s <- s;
valid.actions <- sample(valid.actions);
for (a.prime in valid.actions)
{
# Tentative Status
s.prime <- a.prime + s;
# Check score for Q(state', action')
q.prime <- q.scoring[s.prime[1], s.prime[2], get.action.index(a.prime)];
if (q.prime > best.q)
{
best.q <- q.prime;
best.a <- a.prime;
best.s <- s.prime; # ... for not computing best.s again later
}
}
list(q = best.q, a = best.a, s = best.s);
}
# Viability Function
v.function <- function(s)
{
valid.actions <- list();
for (i in 1:length(actions))
{
a.prime <- actions[[i]];
# Check if current action brings us out of bounds
if (s[1] + a.prime[1] > 0 & s[2] + a.prime[2] > 0 &
s[1] + a.prime[1] <= dim(rewards)[1] & s[2] + a.prime[2] <= dim(rewards)[2])
valid.actions[[length(valid.actions) + 1]] <- a.prime;
}
valid.actions;
}
# Initialize scoring, state and action variables
if (is.null(q.scoring)) q.scoring <- array(0.5, c(dim(rewards)[1], c(dim(rewards)[2]), length(actions)));
s <- s.initial;
a <- actions[[1]];
# Initialize counters for breaking loop
convergence.count <- 0;
iteration.count <- 0;
prev.status <- c(-1,-1); # To detect two-step periods...
while (iteration.count < max.iters)
{
# Get valid actions
valid.actions <- v.function(s);
# Get maximum Q in current Status S
best <- u.function(s, valid.actions);
# Check convergence. We add a convergence threshold to stop
convergence.count <- if (all(s == best$s) || (prev.status == best$s)) convergence.count + 1 else 0;
if (convergence.count > 5) break;
# Solve Reward
r <- rewards[s[1],s[2]];
# Update State-Action Table
action.index <- get.action.index(a);
current.score <- q.scoring[s[1], s[2], action.index];
q.scoring[s[1], s[2], action.index] <- current.score + alpha * (r + gamma * best$q - current.score);
# Change the Status
prev.status <- s;
s <- best$s;
a <- best$a;
iteration.count <- iteration.count + 1;
message(paste("Iteration: ", iteration.count,
" Best Position: ", paste(best$s, collapse = ","),
" Best Action: ", paste(best$a, collapse = ","),
" Best Value: ", best$q, sep = "")
);
}
list(Final.Status = s,
Final.Action = a,
Final.Reward = rewards[s[1],s[2]],
Iterations = iteration.count,
Final.Score = q.scoring
);
}
Consider a bidimensional space of 4 x 4 cells, with the following rewards for being in each cell:
[,1] | [,2] | [,3] | [,4] | |
---|---|---|---|---|
[1,] | +0 | +0 | +0 | +0 |
[2,] | +0 | +0 | +0 | +0 |
[3,] | +0 | +0 | -0.04 | +0 |
[4,] | +0 | +0 | +1 | -0.1 |
There is a goal point, position [4,3], with high reward for being on it, and no reward or negative reward for leaving it.
Our actions are king movements in a chess game, plus the No Operation movement. Adding the NOP movement allows us to remain in the best position when found, then exhaust the convergence steps until loop breaks, finishing the game. The NOP has as drawback that we could get stuck in a local sub-optimal, while forcing us to always move could let us escape from them.
Problem Details:
In [2]:
actions <- list(c(0,1), c(1,1), c(1,0), c(1,-1), c(0,-1), c(-1,-1), c(-1,0), c(-1,1), c(0,0));
rewards <- matrix(0, ncol = 4, nrow = 4);
rewards[3,3] <- -0.04;
rewards[4,3] <- +1;
rewards[4,4] <- -0.1;
s.initial <- c(sample(1:4,1), sample(1:4,1));
result <- qlearn(actions, rewards, s.initial, alpha = 0.5, gamma = 1.0, max.iters = 50);
In [3]:
str(result)