"Value Function Approximation", David Silver lecture 6

1. Introduction

Value Function Approximation

Estimate value function with function approximation

$$ \hat{v}(s, \mathbf{w}) \approx v_\pi(s) \text{, or}\\ \hat{q}(s, a, \mathbf{w}) \approx q_\pi(s, a) $$

generalize from seen states to unseen states
update parameter $\mathbf{w}$ using MC or TD learning

Types of Value Function Approximation

$$ \begin{align} s & \mapsto \hat{v}(s, \mathbf{w}) \\ s, a & \mapsto \hat{q}(s, a, \mathbf{w}) \\ s & \mapsto \hat{q}(s, a_1, \mathbf{w}) \dots \hat{q}(s, a_m, \mathbf{w}) \end{align} $$

Which function approximator?

There are many function approximators, eg:

linear combinations of features
neural network
decision tree
nearest neighbor
fourier / wavelet bases
...

Which function approximator? 2

We consider differentiable function approximators, ie:

linear combinations of features
neural network

Furthermore, we require a training method that is suitable for non-stationary, non-iid data.

2. Incremental Methods

Gradient Descent

Let $J(\mathbf{w})$ be a differentiable function of parameter vector $\mathbf{w}$
define the gradient of $J(\mathbf{w})$ to be:

$$ \nabla_\mathbf{w}J(\mathbf{w}) = \begin{pmatrix} \frac{\partial J(\mathbf{w})}{\partial \mathbf{w}_1} \\ \vdots \\ \frac{\partial J(\mathbf{w})}{\partial \mathbf{w}_n} \end{pmatrix} $$

to find a local minimum of $J(\mathbf{w})$, adjust $\mathbf{w}$ in direction of -ve gradient:

$$ \Delta \mathbf{w} = - \frac{1}{2} \alpha \nabla_\mathbf{w} J(\mathbf{w}) $$

... where $\alpha$ is a step-size parameter

Value function approximation by stochastic gradient descent

goal: find parameter vector $\mathbf{w}$ minimizing mean-squared error between approximate value function $\hat{v}(s, \mathbf{w})$ and true value function $v_\pi(s)$:

$$ J(\mathbf{w}) = \mathbb{E}_\pi \left[ (v_\pi(S) - \hat{V}(S, \mathbf{w}))^2 \right] $$

gradient descent finds a local minimum:

$$ \begin{align} \Delta \mathbf{w} & = - \frac{1}{2} \alpha \nabla_\mathbf{w} J(\mathbf{w}) \\ & = \alpha \mathbb{E}_\pi\left[ (v_\pi(S) - \hat{v}(S,\mathbf{w}))\nabla_\mathbf{w} \hat{v}(S, \mathbf{w}) \right]\\ \end{align} $$

stochastic gradient descent samples the gradient

$$ \Delta \mathbf{w} = \alpha(v_\pi(S) - \hat{v}(S,\mathbf{w}))\nabla_\mathbf{w} \hat{v}(S, \mathbf{w}) $$

expected update is equal to full gradient update

21:15 to here

Feature vectors

represent state by a feature vector

$$ \mathbf{x}(S) = \begin{pmatrix} x_1(S) \\ \vdots \\ x_n(S) \end{pmatrix} $$

for example:
- distance of robot from landmarks
- trends in the stock market
- piece and pawn configurations in chess

Linear value function approximation

represent value function by a linear combination of features

$$ \hat{v}(S, \mathbf{w}) = \mathbf{x}(S)^T \mathbf{w} = \sum_{j=1}^n x_j(S)\,\mathbf{w}_j $$

objective function is quadratic in parameters $\mathbf{w}$:

$$ J(\mathbf{w}) = \mathbb{E}_\pi\left[ (v_\pi(S) - \mathbf{x}(S)^T \mathbf{w})^2 \right] $$

stochastic gradient descent converges on global optimum
update rule is particularly simple:

$$ \begin{align} \nabla_\mathbf{w} \hat{v}(S, \mathbf{w}) & = \mathbf{x}(S) \\ \Delta \mathbf{w} & = \alpha(v_\pi(S) - \hat{v}(S, \mathbf{w}))\, \mathbf{x}(S) \end{align} $$

"Update = step-size x prediction error x feature value"

Table lookup features

table lookup is a special case of linear value function approximation
using table lookup features

$$ \mathbf{x}^\text{table}(S) = \begin{pmatrix} 1(S=s_1) \\ \vdots \\ 1(S=s_n) \end{pmatrix} $$

parameter vector $\mathbf{w}$ gives values of each individual state

$$ \hat{v}(S, \mathbf{w}) = \begin{pmatrix} 1(S=s_1) \\ \vdots \\ 1(S=s_n) \end{pmatrix} \cdot \begin{pmatrix} w_1 \\ \vdots \\ w_n \end{pmatrix} $$

30:53 to here

Incremental prediction algorithms

have assumed true value function $v_\pi(s)$ given by supervisor/oracle
but in RL there is no supervisor, only rewrds
in practice, we substitute a target for $v_\pi(s)$:
- for M, the target is the return $G_t$
$$ \Delta \mathbf{w} = \alpha(G_t - \hat{v}(S_t, \mathbf{w})\, \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w})) $$
- for TD(0), the target is the TD target $R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})$:
$$ \Delta \mathbf{w} = \alpha(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w})) \, \nabla_\mathbf{w}(S_t, \mathbf{w}) $$
- for TD($\lambda$), the target is $\lambda$-return $G_t^\lambda$:
$$ \Delta \mathbf{w} = \alpha(G_t^\lambda - \hat{v}(S_t, \mathbf{w}) \, \nabla_\mathbf{w}(S_t, \mathbf{w})) $$

Monte-Carlo with value function approximation

return $G_t$ is an unbiased, noisy sample of true value $v_\pi(S_t)$
can therefore apply supervised learning to 'training data':

$$ \langle S_1, G_1 \rangle, \langle S_2, G_2\rangle, \dots, \langle S_T, G_T \rangle $$

for example, using linear Monte-Carlo policy evaluation:

$$ \begin{align} \Delta \mathbf{w} & = \alpha(G_t - \hat{v}(S_t, \mathbf{w})) \, \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w}) \\ & = \alpha(G_t - \hat{v})(S_t, \mathbf{w})) \, \mathbf{x}(S_t) \end{align} $$

Monte-Carlo evaluation converges to a local optimum
even when using non-linear value function approximation

TD Learning with Value Function Approximation

the TD-target $R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})$ is a biased sample of true value $v_\pi(S_t)$
can still apply supervised learning to 'training data':

$$ \langle S_1, R_2 + \gamma \hat{v}(S_2, \mathbf{w}) \rangle, \langle S_2, R_3 + \gamma \hat{v}(S_3, \mathbf{w}) \rangle, \dots, \langle S_{T-1}, R_T \rangle $$

for example, using linear TD(0):

$$ \begin{align} \Delta \mathbf{w} & = \alpha(R + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w}))\, \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w}) \\ & = \alpha \delta \mathbf{x}(S) \end{align} $$

where:

$$ \delta = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w})) $$

linear TD(0) converges (close) to global optimum

TD($\lambda$) with Value Function Approximation

the $\lambda$-return $G_t^\lambda$ is also a biased sample of true value $v_\pi(s)$
can again apply supervised learning to 'training data':

$$ \langle S_1, G_1^\lambda \rangle, \langle S_2, G_2^\lambda \rangle, \dots, \langle S_{T-1}, G_{T-1}^\lambda \rangle $$

forward view linear TD($\lambda$)

$$ \begin{align} \Delta \mathbf{w} & = \alpha(G_t^\lambda - \hat{v}(S_t, \mathbf{w}) \, \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w})) \\ & = \alpha(G_t^\lambda - \hat{v}(S_t, \mathbf{w})) \, \mathbf{x}(S_t) \end{align} $$

backward view linear TD($\lambda$):

$$ \Delta \mathbf{w} = \alpha \delta_t E_t $$

where:

$$ \begin{align} \delta_t & = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w}) \\ E_t & = \gamma \lambda E_{t-1} + \delta_t \mathbf{x}(S_t) \end{align} $$

Forward view and backward view TD($\lambda$) are equivalent.

49:00 to here