Q-Learning

Temporal Differencing Q-Learning formula:

$$Q(s, a) \leftarrow Q(s, a) + \alpha (R(s) + \gamma \cdot max_{a'} Q(s', a') - Q(s, a))$$

...also known as SARSA when the new state selection is obtained from our current $Q(s,a)$ look-up table.

  • $Q(s,a)$ : is the score of the current status given the current action
  • $\alpha$ : is the learning rate
  • $R(s)$ : is the reward for reaching status s
  • $\gamma$ : is the discount factor for the expected new score from next status
  • $max_{a'} Q(s',a')$ : is the best expected score for available actions and status

We also consider the Utility Function $U(s)$ as

$$U(s) = max_{a} Q(s,a) $$

so we want at each step maximize the Utility of our state by choosing the best action each time:

$$Q(s, a) \leftarrow Q(s, a) + \alpha (R(s) + \gamma \cdot U(s') - Q(s, a))$$

When s is a final state, we consider the utility as the final reward obtained.

Implementation of the Q-Learning function

We create the qlearn function, including the utility function that will be used (in this case, exhaustive for each possible action.


In [1]:
qlearn <- function (actions, rewards, s.initial, alpha, gamma, max.iters, q.scoring = NULL)
{
    # Auxiliar function to get the index of a given action
    get.action.index <- function(a)
    {
        which(sapply(actions, function(x) all(x == a)));
    }
    
    # Utility Function // Fitness Function
    u.function <- function(s, valid.actions)
    {
        best.q <- -Inf;
        best.a <- -Inf;
        best.s <- s;
        
        valid.actions <- sample(valid.actions);
        for (a.prime in valid.actions)
        {
            # Tentative Status
            s.prime <- a.prime + s;

            # Check score for Q(state', action')
            q.prime <- q.scoring[s.prime[1], s.prime[2], get.action.index(a.prime)];
            
            if (q.prime > best.q)
            {
                best.q <- q.prime;
                best.a <- a.prime;
                best.s <- s.prime;    # ... for not computing best.s again later
            }
        }
        
        list(q = best.q, a = best.a, s = best.s);
    }

    # Viability Function
    v.function <- function(s)
    {
        valid.actions <- list();
        for (i in 1:length(actions))
        {
            a.prime <- actions[[i]];

            # Check if current action brings us out of bounds
            if (s[1] + a.prime[1] > 0 & s[2] + a.prime[2] > 0 &
                s[1] + a.prime[1] <= dim(rewards)[1] & s[2] + a.prime[2] <= dim(rewards)[2])
                valid.actions[[length(valid.actions) + 1]] <- a.prime;
        }
        valid.actions;
    }
    
    # Initialize scoring, state and action variables
    if (is.null(q.scoring)) q.scoring <- array(0.5, c(dim(rewards)[1], c(dim(rewards)[2]), length(actions)));
    s <- s.initial;
    a <- actions[[1]];
    
    # Initialize counters for breaking loop
    convergence.count <- 0;
    iteration.count <- 0;
    prev.status <- c(-1,-1);    # To detect two-step periods...

    while (iteration.count < max.iters)
    {
        # Get valid actions
        valid.actions <- v.function(s);
        
        # Get maximum Q in current Status S
        best <- u.function(s, valid.actions);
        
        # Check convergence. We add a convergence threshold to stop
        convergence.count <- if (all(s == best$s) || (prev.status == best$s)) convergence.count + 1 else 0;
        if (convergence.count > 5) break;
        
        # Solve Reward
        r <- rewards[s[1],s[2]];
        
        # Update State-Action Table
        action.index <- get.action.index(a);
        current.score <- q.scoring[s[1], s[2], action.index];
        q.scoring[s[1], s[2], action.index] <- current.score + alpha * (r + gamma * best$q - current.score);

        # Change the Status
        prev.status <- s;
        s <- best$s;
        a <- best$a;   
        iteration.count <- iteration.count + 1;

        message(paste("Iteration: ", iteration.count,
                      " Best Position: ", paste(best$s, collapse = ","),
                      " Best Action: ", paste(best$a, collapse = ","),
                      " Best Value: ", best$q, sep = "")
        );
    }
    
    list(Final.Status = s,
         Final.Action = a,
         Final.Reward = rewards[s[1],s[2]],
         Iterations = iteration.count,
         Final.Score = q.scoring
    );
}

Case of Example

Consider a bidimensional space of 4 x 4 cells, with the following rewards for being in each cell:

[,1] [,2] [,3] [,4]
[1,] +0 +0 +0 +0
[2,] +0 +0 +0 +0
[3,] +0 +0 -0.04 +0
[4,] +0 +0 +1 -0.1

There is a goal point, position [4,3], with high reward for being on it, and no reward or negative reward for leaving it.

Our actions are king movements in a chess game, plus the No Operation movement. Adding the NOP movement allows us to remain in the best position when found, then exhaust the convergence steps until loop breaks, finishing the game. The NOP has as drawback that we could get stuck in a local sub-optimal, while forcing us to always move could let us escape from them.

Problem Details:

  • Space has dimensions 4 x 4
  • Goal is to reach [4,3] (We don't tell which is the goal, but we reward it better)
  • Start point is at Random
  • Reward depends only on the current position

In [2]:
actions <- list(c(0,1), c(1,1), c(1,0), c(1,-1), c(0,-1), c(-1,-1), c(-1,0), c(-1,1), c(0,0));

rewards <- matrix(0, ncol = 4, nrow = 4);
rewards[3,3] <- -0.04;
rewards[4,3] <- +1;
rewards[4,4] <- -0.1;

s.initial <- c(sample(1:4,1), sample(1:4,1));

result <- qlearn(actions, rewards, s.initial, alpha = 0.5, gamma = 1.0, max.iters = 50);


Iteration: 1 Best Position: 1,4 Best Action: -1,0 Best Value: 0.5
Iteration: 2 Best Position: 2,4 Best Action: 1,0 Best Value: 0.5
Iteration: 3 Best Position: 2,3 Best Action: 0,-1 Best Value: 0.5
Iteration: 4 Best Position: 2,4 Best Action: 0,1 Best Value: 0.5
Iteration: 5 Best Position: 2,4 Best Action: 0,0 Best Value: 0.5
Iteration: 6 Best Position: 2,4 Best Action: 0,0 Best Value: 0.5
Iteration: 7 Best Position: 3,3 Best Action: 1,-1 Best Value: 0.5
Iteration: 8 Best Position: 4,3 Best Action: 1,0 Best Value: 0.5
Iteration: 9 Best Position: 3,4 Best Action: -1,1 Best Value: 0.5
Iteration: 10 Best Position: 4,3 Best Action: 1,-1 Best Value: 0.5
Iteration: 11 Best Position: 3,2 Best Action: -1,-1 Best Value: 0.5
Iteration: 12 Best Position: 4,1 Best Action: 1,-1 Best Value: 0.5
Iteration: 13 Best Position: 3,2 Best Action: -1,1 Best Value: 0.5
Iteration: 14 Best Position: 2,3 Best Action: -1,1 Best Value: 0.5
Iteration: 15 Best Position: 2,3 Best Action: 0,0 Best Value: 0.5
Iteration: 16 Best Position: 1,4 Best Action: -1,1 Best Value: 0.5
Iteration: 17 Best Position: 1,3 Best Action: 0,-1 Best Value: 0.5
Iteration: 18 Best Position: 1,3 Best Action: 0,0 Best Value: 0.5
Iteration: 19 Best Position: 2,3 Best Action: 1,0 Best Value: 0.5
Iteration: 20 Best Position: 3,4 Best Action: 1,1 Best Value: 0.5
Iteration: 21 Best Position: 4,3 Best Action: 1,-1 Best Value: 1
Iteration: 22 Best Position: 3,2 Best Action: -1,-1 Best Value: 0.5
Iteration: 23 Best Position: 4,1 Best Action: 1,-1 Best Value: 0.5
Iteration: 24 Best Position: 3,1 Best Action: -1,0 Best Value: 0.5
Iteration: 25 Best Position: 2,2 Best Action: -1,1 Best Value: 0.5
Iteration: 26 Best Position: 2,2 Best Action: 0,0 Best Value: 0.5
Iteration: 27 Best Position: 3,1 Best Action: 1,-1 Best Value: 0.5
Iteration: 28 Best Position: 3,1 Best Action: 0,0 Best Value: 0.5
Iteration: 29 Best Position: 4,1 Best Action: 1,0 Best Value: 0.5
Iteration: 30 Best Position: 4,2 Best Action: 0,1 Best Value: 0.5
Iteration: 31 Best Position: 3,2 Best Action: -1,0 Best Value: 0.5
Iteration: 32 Best Position: 2,3 Best Action: -1,1 Best Value: 0.5
Iteration: 33 Best Position: 3,4 Best Action: 1,1 Best Value: 0.75
Iteration: 34 Best Position: 4,3 Best Action: 1,-1 Best Value: 1.25
Iteration: 35 Best Position: 3,2 Best Action: -1,-1 Best Value: 0.5
Iteration: 36 Best Position: 2,3 Best Action: -1,1 Best Value: 0.625
Iteration: 37 Best Position: 3,4 Best Action: 1,1 Best Value: 1
Iteration: 38 Best Position: 4,3 Best Action: 1,-1 Best Value: 1.375
Iteration: 39 Best Position: 3,2 Best Action: -1,-1 Best Value: 0.5625
Iteration: 40 Best Position: 2,3 Best Action: -1,1 Best Value: 0.8125
Iteration: 41 Best Position: 3,4 Best Action: 1,1 Best Value: 1.1875
Iteration: 42 Best Position: 4,3 Best Action: 1,-1 Best Value: 1.46875
Iteration: 43 Best Position: 3,2 Best Action: -1,-1 Best Value: 0.6875
Iteration: 44 Best Position: 2,3 Best Action: -1,1 Best Value: 1
Iteration: 45 Best Position: 3,4 Best Action: 1,1 Best Value: 1.328125
Iteration: 46 Best Position: 4,3 Best Action: 1,-1 Best Value: 1.578125
Iteration: 47 Best Position: 3,2 Best Action: -1,-1 Best Value: 0.84375
Iteration: 48 Best Position: 2,3 Best Action: -1,1 Best Value: 1.1640625
Iteration: 49 Best Position: 3,4 Best Action: 1,1 Best Value: 1.453125
Iteration: 50 Best Position: 4,3 Best Action: 1,-1 Best Value: 1.7109375

In [3]:
str(result)


List of 5
 $ Final.Status: num [1:2] 4 3
 $ Final.Action: num [1:2] 1 -1
 $ Final.Reward: num 1
 $ Iterations  : num 50
 $ Final.Score : num [1:4, 1:4, 1:9] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...