The simplest demo of the various algorithms

Before attempting to handle the Off-Policy case with General Value Functions, it would be instructive to first examine how the algorithms perform in the On-Policy case with (more or less) fixed parameters.


In [2]:
%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

In [26]:
import algos
import features
import parametric
import policy
import chicken
from agents import OnPolicyAgent
from rlbench import *

Assessing Learning Algorithms

In theory, it is possible to solve for the value function sought by the learning algorithms directly, but in practice approximation will suffice.


In [12]:
# define the experiment
num_states = 8
num_features = 8

# set up environment
env = chicken.Chicken(num_states)

# set up policy
pol_pi = policy.FixedPolicy({s: {0: 1} if s < 4 else {0: 0.5, 1: 0.5} for s in env.states})

# set feature mapping
# phi = features.RandomBinary(num_features, num_features // 2, random_seed=101011)
phi = features.Int2Unary(num_states)

# run the algorithms for enough time to get reliable convergence
num_steps = 100000

# the TD(1) solution should minimize the mean-squared error
update_params = {
    'gm': 0.9,
    'gm_p': 0.9,
    'lm': 0.0,
}
lstd_1 = OnPolicyAgent(algos.LSTD(phi.length), pol_pi, phi, update_params)
run_episode(lstd_1, env, num_steps)
mse_values = lstd_1.get_values(env.states)

# the TD(0) solution should minimize the MSPBE
update_params = {
    'gm': 0.9,
    'gm_p': 0.9,
    'lm': 0.0,
}
lstd_0 = OnPolicyAgent(algos.LSTD(phi.length), pol_pi, phi, update_params)
run_episode(lstd_0, env, num_steps)
mspbe_values = lstd_0.get_values(env.states)

What do the target values look like?


In [30]:
# Plot the states against their target values
xvals = list(sorted(env.states))
y_mse = [mse_values[s] for s in xvals]
y_mspbe = [mspbe_values[s] for s in xvals]

# Mean-square error optimal values
plt.bar(xvals, y_mse)
plt.show()

# MSPBE optimal values
plt.bar(xvals, y_mspbe)
plt.show()


Actual Testing

We have a number of algorithms that we can try


In [16]:
algos.algo_registry


Out[16]:
{'ETD': algos.ETD,
 'GTD': algos.GTD,
 'GTD2': algos.GTD2,
 'LSTD': algos.LSTD,
 'TD': algos.TD,
 'TDC': algos.TDC}

These algorithms are given to the OnPolicyAgent, which also takes care of the function approximation and manages the parameters given to the learning algorithm.


In [39]:
# set up algorithm parameters
update_params = {
    'alpha': 0.01,
    'beta': 0.001,
    'gm': 0.9,
    'gm_p': 0.9,
    'lm': 0.0,
    'lm_p': 0.0,
    'interest': 1.0,
}


# Run all available algorithms 
max_steps = 10000
for name, alg in algos.algo_registry.items():    
    # Set up the agent, run the experiment, get state-values
    agent = OnPolicyAgent(alg(phi.length), pol_pi, phi, update_params)
    mse_lst = run_errors(agent, env, max_steps, mse_values)
    mspbe_lst = run_errors(agent, env, max_steps, mspbe_values)

    # Plot the errors
    xdata = np.arange(max_steps)
#     plt.plot(xdata, mse_lst)
#     plt.plot(xdata, mspbe_lst)
    plt.plot(xdata, np.log(mspbe_lst))
    
    # Add information to the graph
    plt.title(name)
    plt.xlabel('Timestep')
    plt.ylabel('Error')
    plt.show()



In [36]:
states


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-36-149b8992594b> in <module>()
----> 1 states

NameError: name 'states' is not defined

In [38]:
np.sqrt(np.mean(y_mspbe))


Out[38]:
0.42394620314597697

In [ ]: