Reinforcement Learning

This notebook serves as the supporting material for the chapter Reinforcement Learning. It illustrates the use of the reinforcement package of the code repository. Here we'll examine how an agent can learn what to do in the absence of labeled examples of what to do, from rewards and punishments.


In [1]:
%classpath add jar ../out/artifacts/aima_core_jar/aima-core.jar


Reinforcement learning refers to goal-oriented algorithms, which learn how to attain a complex objective (goal) or maximize along a particular dimension over many steps; for example, maximize the points won in a game over many moves. They can start from a blank slate, and under the right conditions, they achieve superhuman performance.

Consider an example of a problem of learning chess. A supervised agent needs to be told the correct move for each position it encounters, but such feedback is seldom available. Therefore, in the absence of feedback, the agent needs to know, that something good has happened when it accidentally checkmates its opponent and that something bad has happened when it gets checkmated. This kind of feedback is called a reward or reinforcement. Reinforcement learning differs from the supervised learning in a way that in supervised learning the training data has the answer label with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. In the absence of the training dataset, it is bound to learn from its experience.

Usually, in game playing, it is very hard for a human to provide accurate and consistent evaluations of a large number of positions. Therefore, the program is told when it has won or lost, and the agent uses this information to learn a reasonably accurate evaluation function.

Some Definitions

Let's have a look at some important concepts before proceeding further:

  • Reward ($R$): A reward is feedback by which we measure the success or failure of an agent’s actions. From any given state, an agent sends output in the form of actions to the environment, and the environment returns the agent’s new state (which resulted from acting on the previous state) as well as rewards if there are any. They effectively evaluate the agent’s action.
  • Policy ($\pi$): The policy is the strategy that the agent employs to determine the next action based on the current state. It maps states to actions, the actions that promise the highest reward. The policy that yields the highest expected utility is known as an optimal policy. We use $\pi^*$ to denote an optimal policy.
  • Discount factor ($\gamma$): The discount factor is multiplied by future rewards as discovered by the agent to dampen these rewards’ effect on the agent’s choice of action. Why? It is designed to make future rewards worth less than immediate rewards. If $\gamma$ is 0.8, and there’s a reward of 10 points after 3 time steps, the present value of that reward is 0.8³ x 10. A discount factor of 1 would make future rewards worth just as much as immediate rewards.
  • Transition model: The transition model describes the outcome of each action in each state. If the outcomes are stochastic, we write $P(s'|s,a)$ to denote the probability of reaching state $s'$ if the action $a$ is done in state $s$. We'll assume the transitions are Markovian i.e. the probability of reaching $s'$ from $s$ depends only on $s$ and not on the history of earlier states.
  • Utility ($U(s)$): The utility is defined to be the expected sum of discounted rewards if the policy $\pi$ is followed from that state onward.

Passive Reinforcement Learning

In passive learning, the agent's policy $\pi$ is fixed: in state $s$, it always executes the action $\pi(s)$. Its goal is to learn how good a policy is - that is to learn a utility function $U^{\pi}(s)$. Note that the passive learning agent does not know the transition model $P(s'|s,a)$, which specifies the probability of reaching state $s'$, from state $s$ after doing action $a$; nor does it know the reward function $R(s)$, which specifies the reward for each state. The agent executes a set of trials in the environment using its policy $\pi$. In each trial, the agent begins from the start-position and experience a sequence of state transition until it reaches one of the terminal states. It's percept supply both the current state and the reward received in that state. The objective is to use the information about the rewards to learn the expected utility $U^{\pi}(s)$ associated with each non-terminal state $s$.

Since the utility values obey the Bellman equation for a fixed policy $\pi$, i.e. the utility for each state equals its own reward plus the expected utility of its successors' states,

$U^{\pi}(s) = R(s) + \gamma\sum_{s'}P(s' | s,\pi(s))U^\pi(s')$

Adaptive Dynamic Programming

An adaptive dynamic programming agent takes advantage of the constraints among the utilities of states by learning the transition model that connects them and solving the corresponding Markov decision process using a dynamic programming method. For a passive learning agent, this means plugging a learned transition model $P(s'|s,\pi(s))$ and the observed reward $R(s)$ into the Bellman equation to calculate the utilities of states.

Let's have a look at the pseudo code of Passive ADP agent:


In [17]:
%%python
from notebookUtils import *
pseudocode('Passive ADP Agent')


Out[17]:

AIMA3e

function Passive-ADP-Agent(percept) returns and action
inputs: percept, a percept indication the current state s' and reward signal r'
persistent: π, a fixed policy
       mdp, an MDP with model P, rewards R, discount γ
       U, a table of utilities, initially empty
       Nsa, a table of frequencies for state-action pairs, initially zero
       Ns'|sa, a table of outcome frequencies given state-action pairs, initially zero
       s, a, the previous state and action, initially null
if s' is new then U[s'] ← r'; R[s'] ← r'
if s is not null then
   increment Nsa[s, a] and Ns'|sa[s', s, a]
   for each t such that Ns'|sa[t, s, a] is nonzero do
     P(t | s, a) ← Ns'|sa[t, s, a] / Nsa[s, a]
U ← Policy-Evaluation(π, U, mdp)
if s'.Terminal? then s, a ← null else s, as', π[s']


Figure ?? A passive reinforcement learning agent based on adaptive dynamic programming. The Policy-Evaluation function solves the fixed-policy Bellman equations, as described on page ??.

Let's see our Passive ADP agent in action! Consider a $4*3$ cell world with $[1,1]$ as the starting position. The policy $\pi$ for the $4*3$ world is shown in the figure below. This policy happens to be optimal with reward of $R(s)=-0.04$ in the non-terminal states and no discounting.


In [45]:
import aima.core.environment.cellworld.*;
import aima.core.learning.reinforcement.agent.PassiveADPAgent;
import aima.core.learning.reinforcement.example.CellWorldEnvironment;
import aima.core.probability.example.MDPFactory;
import aima.core.probability.mdp.impl.ModifiedPolicyEvaluation;
import aima.core.util.JavaRandomizer;

import java.util.*;;

CellWorld<Double> cw = CellWorldFactory.createCellWorldForFig17_1();;
CellWorldEnvironment cwe = new CellWorldEnvironment(
            cw.getCellAt(1, 1),
            cw.getCells(),
            MDPFactory.createTransitionProbabilityFunctionForFigure17_1(cw),
            new JavaRandomizer());
Map<Cell<Double>, CellWorldAction> fixedPolicy = new HashMap<Cell<Double>, CellWorldAction>();
fixedPolicy.put(cw.getCellAt(1, 1), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(1, 2), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(1, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(2, 1), CellWorldAction.Left);
fixedPolicy.put(cw.getCellAt(2, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(3, 1), CellWorldAction.Left);
fixedPolicy.put(cw.getCellAt(3, 2), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(3, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(4, 1), CellWorldAction.Left);
PassiveADPAgent<Cell<Double>, CellWorldAction> padpa = new PassiveADPAgent<Cell<Double>, CellWorldAction>(
                                                                fixedPolicy,
                                                                cw.getCells(), 
                                                                cw.getCellAt(1, 1), 
                                                                MDPFactory.createActionsFunctionForFigure17_1(cw),
                                                                new ModifiedPolicyEvaluation<Cell<Double>, CellWorldAction>(10,1.0));
cwe.addAgent(padpa);
padpa.reset();
cwe.executeTrials(2000);
System.out.println("Cell"  + " \t:\t" + "Expected Utility");
System.out.println("----------------------------------");
Map<Cell<Double>, Double> U = padpa.getUtility();
for(int i = 1; i<=4; i++){
    for(int j = 1; j<=3; j++){
        if(i==2 && j==2) continue; //Ignore wall
        System.out.println("[" + i + "," + j + "]"  + " \t:\t" + U.get(cw.getCellAt(i,j)));
    }
}


Cell 	:	Expected Utility
----------------------------------
[1,1] 	:	0.7081191789618011
[1,2] 	:	0.763739805626589
[1,3] 	:	0.8138902449639785
[2,1] 	:	0.6584738143530444
[2,3] 	:	0.8703399959976966
[3,1] 	:	null
[3,2] 	:	0.6759847275839783
[3,3] 	:	0.9205199990813694
[4,1] 	:	null
[4,2] 	:	-1.0
[4,3] 	:	1.0
Out[45]:
null

Note that the cells $[3,1]$ and $[4,1]$ are not reachable when starting at $[1,1]$ using the policy and the default transition model i.e. 80% intended and 10% each right angle from intended.

The learning curves of the Passive ADP agent for the 4*3 world (given the optimal policy) are shown below.


In [42]:
import aima.core.environment.cellworld.*;
import aima.core.learning.reinforcement.agent.PassiveADPAgent;
import aima.core.learning.reinforcement.example.CellWorldEnvironment;
import aima.core.probability.example.MDPFactory;
import aima.core.probability.mdp.impl.ModifiedPolicyEvaluation;
import aima.core.util.JavaRandomizer;

import java.util.*;

int numRuns = 20;
int numTrialsPerRun = 100;
int rmseTrialsToReport = 100;
int reportEveryN = 1;

CellWorld<Double> cw = CellWorldFactory.createCellWorldForFig17_1();;
CellWorldEnvironment cwe = new CellWorldEnvironment(
            cw.getCellAt(1, 1),
            cw.getCells(),
            MDPFactory.createTransitionProbabilityFunctionForFigure17_1(cw),
            new JavaRandomizer());
Map<Cell<Double>, CellWorldAction> fixedPolicy = new HashMap<Cell<Double>, CellWorldAction>();
fixedPolicy.put(cw.getCellAt(1, 1), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(1, 2), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(1, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(2, 1), CellWorldAction.Left);
fixedPolicy.put(cw.getCellAt(2, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(3, 1), CellWorldAction.Left);
fixedPolicy.put(cw.getCellAt(3, 2), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(3, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(4, 1), CellWorldAction.Left);
PassiveADPAgent<Cell<Double>, CellWorldAction> padpa = new PassiveADPAgent<Cell<Double>, CellWorldAction>(
                                                                fixedPolicy,
                                                                cw.getCells(), 
                                                                cw.getCellAt(1, 1), 
                                                                MDPFactory.createActionsFunctionForFigure17_1(cw),
                                                                new ModifiedPolicyEvaluation<Cell<Double>, CellWorldAction>(10,1.0));
cwe.addAgent(padpa);
Map<Integer, List<Map<Cell<Double>, Double>>> runs = new HashMap<Integer, List<Map<Cell<Double>, Double>>>();
for (int r = 0; r < numRuns; r++) {
    padpa.reset();
    List<Map<Cell<Double>, Double>> trials = new ArrayList<Map<Cell<Double>, Double>>();
    for (int t = 0; t < numTrialsPerRun; t++) {
        cwe.executeTrial();
        if (0 == t % reportEveryN) {
            Map<Cell<Double>, Double> u = padpa.getUtility();
            trials.add(u);
        }
    }
    runs.put(r, trials);
}

def T = [];
def v4_3 = [];
def v3_3 = [];
def v1_3 = [];
def v1_1 = [];
def v3_2 = [];
def v2_1 = [];
double tmp = 0.0;
for (int t = 0; t < (numTrialsPerRun/reportEveryN); t++) {
    T.add(t);
    Map<Cell<Double>, Double> u = runs.get(numRuns - 1).get(t);
    tmp = (u.containsKey(cw.getCellAt(4, 3)) ? u.get(cw.getCellAt(4, 3)) : 0.0);
    v4_3.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(3, 3)) ? u.get(cw.getCellAt(3, 3)) : 0.0);
    v3_3.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(1, 3)) ? u.get(cw.getCellAt(1, 3)) : 0.0);
    v1_3.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(1, 1)) ? u.get(cw.getCellAt(1, 1)) : 0.0);
    v1_1.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(3, 2)) ? u.get(cw.getCellAt(3, 2)) : 0.0);
    v3_2.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(2, 1)) ? u.get(cw.getCellAt(2, 1)) : 0.0);
    v2_1.add(tmp);
}

def p1 = new Plot(title: "Learning Curve", yLabel: "Utility estimates", xLabel: "Number of trails");
p1 << new Line(x: T, y: v4_3, displayName: "v4_3")
p1 << new Line(x: T, y: v3_3, displayName: "v3_3")
p1 << new Line(x: T, y: v1_3, displayName: "v1_3")
p1 << new Line(x: T, y: v1_1, displayName: "v1_1")
p1 << new Line(x: T, y: v3_2, displayName: "v3_2")
p1 << new Line(x: T, y: v2_1, displayName: "v2_1")

def trails = [];
def rmseValues = [];
for (int t = 0; t < rmseTrialsToReport; t++) {
    trails.add(t);
    double xSsquared = 0;
    for (int r = 0; r < numRuns; r++) {
        Map<Cell<Double>, Double> u = runs.get(r).get(t);
        Double val1_1 = u.get(cw.getCellAt(1, 1));
        xSsquared += Math.pow(0.705 - val1_1, 2);
    }
    double rmse = Math.sqrt(xSsquared/runs.size());
    rmseValues.add(rmse);
}
def p2 = new Plot(yLabel: "RMS error in utility", xLabel: "Number of trails");
p2 << new Line(x: trails, y: rmseValues)
OutputCell.HIDDEN

  • The first figure shows the utility estimates for some of the states as a function of the number of trails. Notice the large changes occurring around the 63rd trial - this is the first time that the agent falls into the -1 terminal state at $[4,2]$.
  • The second plot shows the root-mean-square error in the estimate for $U(1,1)$, averaged over 20 runs of 100 trails each.

Temporal-difference Learning

Another way to solve the underlying MDP as in the preceding section is to use the observed transition to adjust the utilities of the observed states so that they agree with the constraint equations. More generally, when a transition occurs from state $s$ to state $s'$, we apply the following update rule to $U^\pi(s)$:

$U^{\pi}(s) \leftarrow U^{\pi}(s) + \alpha(R(s) + \gamma U^\pi(s') - U^{\pi}(s) )$

Here $\alpha$ is the learning rate parameter. Because this update rule uses the difference in the utilities of the successive states, it is often called the temporal-difference equation.

In the case of passive learning, the equilibrium is given by equation $U^{\pi}(s) = R(s) + \gamma\sum_{s'}P(s' | s,\pi(s))U^\pi(s')$. Now the temporal-difference equation does, in fact, cause the agent to reach the equilibrium but notice that the update involves only the observed successor $s'$, whereas the actual equilibrium conditions involve all possible next states. One might think that this causes improperly large changes in $U^\pi(s)$, but, in fact, because rare transitions occur only rarely, the average value of $U^\pi(s)$ will converge to the correct value. Furthermore, if we change $\alpha$ from a fixed parameter to a function that decreases as the number of times a state has been visited increases, then $U^\pi(s)$ will itself converge to the correct value.

Let's have a look at the pseudo-code of the Passive TD agent.


In [47]:
%%python
from notebookUtils import *
pseudocode('Passive TD Agent')


Out[47]:

AIMA3e

function Passive-TD-Agent(percept) returns an action
inputs: percept, a percept indication the current state s' and reward signal r'
persistent: π, a fixed policy
       U, a table of utilities, initially empty
       Ns, a table of frequencies for states, initially zero
       s, a, r, the previous state, action, and reward, initially null

if s' is new then U[s'] ← r'
if s is not null then
   increment Ns[s]
   U[s] ← U[s] + α(Ns[s])(r + γ U[s'] - U[s])
if s'.Terminal? then s, a, r ← null else s, a, rs', π[s'], r'
 return a


Figure ?? A passive reinforcement learning agent that learns utility estimates using temporal differences. The step-size function α(n) is chosen to ensure convergence, as described in the text.

Let's test this agent on the 4*3 cell world discussed above.


In [49]:
import aima.core.environment.cellworld.*;
import aima.core.learning.reinforcement.agent.PassiveTDAgent;
import aima.core.learning.reinforcement.example.CellWorldEnvironment;
import aima.core.probability.example.MDPFactory;
import aima.core.probability.mdp.impl.ModifiedPolicyEvaluation;
import aima.core.util.JavaRandomizer;

import java.util.*;;

CellWorld<Double> cw = CellWorldFactory.createCellWorldForFig17_1();;
CellWorldEnvironment cwe = new CellWorldEnvironment(
            cw.getCellAt(1, 1),
            cw.getCells(),
            MDPFactory.createTransitionProbabilityFunctionForFigure17_1(cw),
            new JavaRandomizer());
Map<Cell<Double>, CellWorldAction> fixedPolicy = new HashMap<Cell<Double>, CellWorldAction>();
fixedPolicy.put(cw.getCellAt(1, 1), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(1, 2), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(1, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(2, 1), CellWorldAction.Left);
fixedPolicy.put(cw.getCellAt(2, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(3, 1), CellWorldAction.Left);
fixedPolicy.put(cw.getCellAt(3, 2), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(3, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(4, 1), CellWorldAction.Left);
PassiveTDAgent<Cell<Double>, CellWorldAction> ptda = new PassiveTDAgent<Cell<Double>, CellWorldAction>(fixedPolicy, 0.2, 1.0);
cwe.addAgent(ptda);
ptda.reset();
cwe.executeTrials(10000);
System.out.println("Cell"  + " \t:\t" + "Expected Utility");
System.out.println("----------------------------------");
Map<Cell<Double>, Double> U = ptda.getUtility();
for(int i = 1; i<=4; i++){
    for(int j = 1; j<=3; j++){
        if(i==2 && j==2) continue; //Ignore wall
        System.out.println("[" + i + "," + j + "]"  + " \t:\t" + U.get(cw.getCellAt(i,j)));
    }
}


Cell 	:	Expected Utility
----------------------------------
[1,1] 	:	0.7008970901464446
[1,2] 	:	0.7981190806547613
[1,3] 	:	0.8382389196871616
[2,1] 	:	0.6452315395688609
[2,3] 	:	0.9154049338048013
[3,1] 	:	null
[3,2] 	:	0.66972055686578
[3,3] 	:	0.9594893300045573
[4,1] 	:	null
[4,2] 	:	-1.0
[4,3] 	:	1.0
Out[49]:
null

The learning curves of the Passive TD agent for the 4*3 cell world are shown below.


In [55]:
import aima.core.environment.cellworld.*;
import aima.core.learning.reinforcement.agent.PassiveTDAgent;
import aima.core.learning.reinforcement.example.CellWorldEnvironment;
import aima.core.probability.example.MDPFactory;
import aima.core.probability.mdp.impl.ModifiedPolicyEvaluation;
import aima.core.util.JavaRandomizer;

import java.util.*;

int numRuns = 20;
int numTrialsPerRun = 1000;
int rmseTrialsToReport = 100;
int reportEveryN = 1;

CellWorld<Double> cw = CellWorldFactory.createCellWorldForFig17_1();;
CellWorldEnvironment cwe = new CellWorldEnvironment(
            cw.getCellAt(1, 1),
            cw.getCells(),
            MDPFactory.createTransitionProbabilityFunctionForFigure17_1(cw),
            new JavaRandomizer());
Map<Cell<Double>, CellWorldAction> fixedPolicy = new HashMap<Cell<Double>, CellWorldAction>();
fixedPolicy.put(cw.getCellAt(1, 1), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(1, 2), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(1, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(2, 1), CellWorldAction.Left);
fixedPolicy.put(cw.getCellAt(2, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(3, 1), CellWorldAction.Left);
fixedPolicy.put(cw.getCellAt(3, 2), CellWorldAction.Up);
fixedPolicy.put(cw.getCellAt(3, 3), CellWorldAction.Right);
fixedPolicy.put(cw.getCellAt(4, 1), CellWorldAction.Left);
PassiveTDAgent<Cell<Double>, CellWorldAction> ptda = new PassiveTDAgent<Cell<Double>, CellWorldAction>(fixedPolicy, 0.2, 1.0);
cwe.addAgent(ptda);
Map<Integer, List<Map<Cell<Double>, Double>>> runs = new HashMap<Integer, List<Map<Cell<Double>, Double>>>();
for (int r = 0; r < numRuns; r++) {
    ptda.reset();
    List<Map<Cell<Double>, Double>> trials = new ArrayList<Map<Cell<Double>, Double>>();
    for (int t = 0; t < numTrialsPerRun; t++) {
        cwe.executeTrial();
        if (0 == t % reportEveryN) {
            Map<Cell<Double>, Double> u = ptda.getUtility();
            trials.add(u);
        }
    }
    runs.put(r, trials);
}

def T = [];
def v4_3 = [];
def v3_3 = [];
def v1_3 = [];
def v1_1 = [];
def v3_2 = [];
def v2_1 = [];
double tmp = 0.0;
for (int t = 0; t < (numTrialsPerRun/reportEveryN); t++) {
    T.add(t);
    Map<Cell<Double>, Double> u = runs.get(numRuns - 1).get(t);
    tmp = (u.containsKey(cw.getCellAt(4, 3)) ? u.get(cw.getCellAt(4, 3)) : 0.0);
    v4_3.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(3, 3)) ? u.get(cw.getCellAt(3, 3)) : 0.0);
    v3_3.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(1, 3)) ? u.get(cw.getCellAt(1, 3)) : 0.0);
    v1_3.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(1, 1)) ? u.get(cw.getCellAt(1, 1)) : 0.0);
    v1_1.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(3, 2)) ? u.get(cw.getCellAt(3, 2)) : 0.0);
    v3_2.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(2, 1)) ? u.get(cw.getCellAt(2, 1)) : 0.0);
    v2_1.add(tmp);
}

def p1 = new Plot(title: "Learning Curve", yLabel: "Utility estimates", xLabel: "Number of trails");
p1 << new Line(x: T, y: v4_3, displayName: "v4_3")
p1 << new Line(x: T, y: v3_3, displayName: "v3_3")
p1 << new Line(x: T, y: v1_3, displayName: "v1_3")
p1 << new Line(x: T, y: v1_1, displayName: "v1_1")
p1 << new Line(x: T, y: v3_2, displayName: "v3_2")
p1 << new Line(x: T, y: v2_1, displayName: "v2_1")

def trails = [];
def rmseValues = [];
for (int t = 0; t < rmseTrialsToReport; t++) {
    trails.add(t);
    double xSsquared = 0;
    for (int r = 0; r < numRuns; r++) {
        Map<Cell<Double>, Double> u = runs.get(r).get(t);
        Double val1_1 = u.get(cw.getCellAt(1, 1));
        xSsquared += Math.pow(0.705 - val1_1, 2);
    }
    double rmse = Math.sqrt(xSsquared/runs.size());
    rmseValues.add(rmse);
}
def p2 = new Plot(yLabel: "RMS error in utility", xLabel: "Number of trails");
p2 << new Line(x: trails, y: rmseValues)
OutputCell.HIDDEN

  • The first figure shows the utility estimates for some of the states as a function of the number of trails.
  • The second plot shows the root-mean-square error in the estimate for $U(1,1)$, averaged over 20 runs of 1000 trails each. Only the first 100 trails are shown to enable comparison.

The passive TD agent does not learn quite as fast as the ADP agent and shows much higher variability, but it is much simpler and requires much lesser computation per observation. Notice that, the TD does not need the transition model to perform its updates. The environment supplies the connection between neighboring states in the form of observed transitions.

The ADP approach and the TD approach are closely related. Both try to make local adjustments to the utility estimates in order to make each state "agree" with its successors. One difference is TD adjusts a state to agree with its observed successor, whereas ADP adjusts the state to agree with all of the successors that might occur, weighted by their probabilities.

Active Reinforcement Learning

A passive learning agent has a fixed policy that determines its behavior, whereas an active agent must decide what actions to take. For this, first, it needs to learn a complete model with outcome probabilities for all actions, rather than just the model for the fixed policy. Next, we need to take into account is the fact that the agent has a choice of actions. The utilities it needs to learn are those defined by the optimal policy. Since they obey the Bellman equation:

$U(s) = R(s) + \gamma. max_{a}\sum_{s'}P(s'|s, a)U(s')$

The final issue is to learn what to do at each step. Having obtained the utility function $U$, that is optimal for the learned model, the agent can extract an optimal action by one-step look-ahead to maximize the expected utility; alternatively, if it uses policy iteration, the optimal policy is already available, so it should simply execute the action the optimal policy recommends.

But the agent that follows the recommendation of the optimal policy for the learned model at each step, fails to learn the true utilities or the true optimal policy! What happens instead is that, the agent greedily follows these recommendations and converges to a policy that proceeds to the terminal states through the greedy approach. It never learns the utilities of other states and thus never finds a more optimal path (if possible). We call such an agent, the greedy agent.

An agent, therefore, must make a tradeoff between exploitation to maximize its reward and exploration to maximize its long-term well being. Technically, any such scheme that will eventually lead to the optimal behavior by the agent, need to be greedy in the limit of infinite exploration, or GLIE. A GLIE scheme must try each action in each state an unbounded number of times to avoid having a finite probability that an optimal action is missed because of an unusually bad series of outcomes. An agent using such a scheme will eventually learn the true environment model. There are several GLIE schemes, one of the simplest is to have the agent choose a random action a fraction $1/t$ of the time and to follow the greedy policy otherwise. While this does converge to an optimal policy, it can be extremely slow. A more sensible approach would give some weights to the action that the agent has not tried very often while tending to avoid actions that are believed to be of low utility. This can be implemented by altering the above equation so that it assigns a higher utility estimate to relatively unexplored state-action pairs:

$U^{+}(s) \leftarrow R(s) + \gamma. max_{a}f(\sum_{s'}P(s'|s, a)U^{+}(s'), N(s,a))$

Here,

  • $U^{+}(s)$ is used to denote the optimistic estimate of the utility of the state $s$.
  • $N(s,a)$ be the number of times action $a$ has been tried in the state $s$.
  • $f(u, n)$ is called the exploration function. It determines how greed (preference for high values of $u$) is traded off against curiosity (preference for the actions that have not been tried often have low $n$). This function should be increasing in $u$ and decreasing in $n$.

Now that we have an active ADP agent, let's discuss how to construct an active temporal difference learning agent. There's a method called Q-learning, which learns an action-utility representation. We'll use the notation $Q(s,a)$ to denote the value of doing action $a$ in state $s$. Q-values are directly related to utility values as follows:

$U(s) = max_{a}Q(s,a)$

Note that, a TD agent that learns a Q-function does not need a model of the form $P(s'|s,a)$, either for learning or for action selection. For this reason, Q-learning is called a model-free method. Therefore, the update equation for TD Q-learning is

$Q(s,a) \leftarrow Q(s,a) + \alpha(R(s) + \gamma.max_{a'}Q(s',a')-Q(s,a))$

which is calculated whenever action $a$ is executed in state $s$ leading to state $s'$.

Let's have look at the pseudo code of Q-Learning agent:


In [3]:
%%python
from notebookUtils import *
pseudocode('Q Learning Agent')


Out[3]:

AIMA3e

function Q-Learning_Agent(percept) returns an action
inputs: percept, a percept indicating the current state s' and reward signal r'
persistent: Q, a table of action values indexed by state and action, initially zero
       Nsa, a table of frequencies for state-action pairs, initially zero
       s, a, r, the previous state, action, and reward, initially null

if Terminal?(s) then Q[s, None] ← r'
if s is not null then
   increment Nsa[s, a]
   Q[s, a] ← Q[s, a] + α(Nsa[s, a])(r + γ maxa' Q[s', a'] - Q[s, a])
s, a, rs', argmaxa' f(Q[s', a'], Nsa[s', a']), r'
return a


Figure ?? An exploratory Q-learning agent. It is an active learner that learns the value Q(s, a) of each action in each situation. It uses the same exploration function f as the exploratory ADP agent, but avoids having to learn the transition model because the Q-value of a state can be related directly to those of its neighbors.

Let's test the Q-learning agent on the 4*3 cell world discussed above.


In [11]:
import aima.core.environment.cellworld.*;
import aima.core.learning.reinforcement.agent.QLearningAgent;
import aima.core.learning.reinforcement.example.CellWorldEnvironment;
import aima.core.probability.example.MDPFactory;
import aima.core.util.JavaRandomizer;

import java.util.*;;

CellWorld<Double> cw = CellWorldFactory.createCellWorldForFig17_1();;
CellWorldEnvironment cwe = new CellWorldEnvironment(
            cw.getCellAt(1, 1),
            cw.getCells(),
            MDPFactory.createTransitionProbabilityFunctionForFigure17_1(cw),
            new JavaRandomizer());
QLearningAgent<Cell<Double>, CellWorldAction> qla = new QLearningAgent<Cell<Double>, CellWorldAction>(MDPFactory.createActionsFunctionForFigure17_1(cw), CellWorldAction.None, 0.2, 1.0, 5, 2.0);
cwe.addAgent(qla);
qla.reset();
cwe.executeTrials(100000);
System.out.println("Cell"  + " \t:\t" + "Expected Utility");
System.out.println("----------------------------------");
Map<Cell<Double>, Double> U = qla.getUtility();
for(int i = 1; i<=4; i++){
    for(int j = 1; j<=3; j++){
        if(i==2 && j==2) continue; //Ignore wall
        System.out.println("[" + i + "," + j + "]"  + " \t:\t" + U.get(cw.getCellAt(i,j)));
    }
}


Cell 	:	Expected Utility
----------------------------------
[1,1] 	:	0.7447970297243381
[1,2] 	:	0.7824342221178695
[1,3] 	:	0.8183443976230272
[2,1] 	:	0.6603248055209805
[2,3] 	:	0.881618254644549
[3,1] 	:	0.5780849497361085
[3,2] 	:	0.41519898633959246
[3,3] 	:	0.9503633060769898
[4,1] 	:	-0.048891852584282955
[4,2] 	:	-1.0
[4,3] 	:	1.0
Out[11]:
null

The learning curves of the Q-Learning agent for the 4*3 cell world are shown below.


In [10]:
import aima.core.environment.cellworld.*;
import aima.core.learning.reinforcement.agent.QLearningAgent;
import aima.core.learning.reinforcement.example.CellWorldEnvironment;
import aima.core.probability.example.MDPFactory;
import aima.core.util.JavaRandomizer;

import java.util.*;

int numRuns = 20;
int numTrialsPerRun = 10000;
int rmseTrialsToReport = 500;
int reportEveryN = 20;

CellWorld<Double> cw = CellWorldFactory.createCellWorldForFig17_1();;
CellWorldEnvironment cwe = new CellWorldEnvironment(
            cw.getCellAt(1, 1),
            cw.getCells(),
            MDPFactory.createTransitionProbabilityFunctionForFigure17_1(cw),
            new JavaRandomizer());
QLearningAgent<Cell<Double>, CellWorldAction> qla = new QLearningAgent<Cell<Double>, CellWorldAction>(MDPFactory.createActionsFunctionForFigure17_1(cw), CellWorldAction.None, 0.2, 1.0, 5, 2.0);
cwe.addAgent(qla);
Map<Integer, List<Map<Cell<Double>, Double>>> runs = new HashMap<Integer, List<Map<Cell<Double>, Double>>>();
for (int r = 0; r < numRuns; r++) {
    qla.reset();
    List<Map<Cell<Double>, Double>> trials = new ArrayList<Map<Cell<Double>, Double>>();
    for (int t = 0; t < numTrialsPerRun; t++) {
        cwe.executeTrial();
        if (0 == t % reportEveryN) {
            Map<Cell<Double>, Double> u = qla.getUtility();
            trials.add(u);
        }
    }
    runs.put(r, trials);
}

def T = [];
def v4_3 = [];
def v3_3 = [];
def v1_3 = [];
def v1_1 = [];
def v3_2 = [];
def v2_1 = [];
double tmp = 0.0;
for (int t = 0; t < (numTrialsPerRun/reportEveryN); t++) {
    T.add(t);
    Map<Cell<Double>, Double> u = runs.get(numRuns - 1).get(t);
    tmp = (u.containsKey(cw.getCellAt(4, 3)) ? u.get(cw.getCellAt(4, 3)) : 0.0);
    v4_3.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(3, 3)) ? u.get(cw.getCellAt(3, 3)) : 0.0);
    v3_3.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(1, 3)) ? u.get(cw.getCellAt(1, 3)) : 0.0);
    v1_3.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(1, 1)) ? u.get(cw.getCellAt(1, 1)) : 0.0);
    v1_1.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(3, 2)) ? u.get(cw.getCellAt(3, 2)) : 0.0);
    v3_2.add(tmp);
    tmp = (u.containsKey(cw.getCellAt(2, 1)) ? u.get(cw.getCellAt(2, 1)) : 0.0);
    v2_1.add(tmp);
}

def p1 = new Plot(title: "Learning Curve", yLabel: "Utility estimates", xLabel: "Number of trails");
p1 << new Line(x: T, y: v4_3, displayName: "v4_3")
p1 << new Line(x: T, y: v3_3, displayName: "v3_3")
p1 << new Line(x: T, y: v1_3, displayName: "v1_3")
p1 << new Line(x: T, y: v1_1, displayName: "v1_1")
p1 << new Line(x: T, y: v3_2, displayName: "v3_2")
p1 << new Line(x: T, y: v2_1, displayName: "v2_1")

def trails = [];
def rmseValues = [];
for (int t = 0; t < rmseTrialsToReport; t++) {
    trails.add(t);
    double xSsquared = 0;
    for (int r = 0; r < numRuns; r++) {
        Map<Cell<Double>, Double> u = runs.get(r).get(t);
        Double val1_1 = u.get(cw.getCellAt(1, 1));
        xSsquared += Math.pow(0.705 - val1_1, 2);
    }
    double rmse = Math.sqrt(xSsquared/runs.size());
    rmseValues.add(rmse);
}
def p2 = new Plot(yLabel: "RMS error in utility", xLabel: "Number of trails");
p2 << new Line(x: trails, y: rmseValues)
OutputCell.HIDDEN

The final approach we'll consider here is the policy search. Policy-search methods operate directly on the representation of the policy, attempting to improve it based on the observed performance. As we know that, policy $\pi$ is a function that maps states to actions, we are interested primarily in the parameterized representation of $\pi$ that have far fewer parameters than there are states in the state space. For example, we could represent $\pi$ by a collection of parameterized Q-functions, one for each action, and take the action with the highest predicted value.

$\pi(s) = max_a\hat{Q}_\theta(s,a)$

Policy search adjusts the parameter $\theta$ to improve the policy. But one problem with policy representation of this kind is that the policy is a discontinuous function of parameters when the actions are discrete i.e. there will be values of $\theta$ such that an infinitesimal change in $\theta$ causes the policy to switch from one action to another. This means that the value of the policy may also change discontinuously, which makes the gradient-based search difficult.

For this reason, policy search methods often use a stochastic policy representation $\pi_{\theta}(s,a)$, which specifies the probability of selecting action $a$ in state $s$. One popular representation is the softmax function:

$\pi_{\theta}(s,a) = e^{\hat{Q}_\theta(s,a)}/\sum _{a'}e^{\hat{Q}_\theta(s,a')}$

Softmax becomes nearly deterministic if one action is much better than the others, but it always gives a differentiable function of $\theta$, hence, the value of the policy is always a differentiable function of $\theta$.