"Value Function Approximation", David Silver lecture 6

1. Introduction

Value Function Approximation

  • Estimate value function with function approximation
$$ \hat{v}(s, \mathbf{w}) \approx v_\pi(s) \text{, or}\\ \hat{q}(s, a, \mathbf{w}) \approx q_\pi(s, a) $$
  • generalize from seen states to unseen states
  • update parameter $\mathbf{w}$ using MC or TD learning

Types of Value Function Approximation

$$ \begin{align} s & \mapsto \hat{v}(s, \mathbf{w}) \\ s, a & \mapsto \hat{q}(s, a, \mathbf{w}) \\ s & \mapsto \hat{q}(s, a_1, \mathbf{w}) \dots \hat{q}(s, a_m, \mathbf{w}) \end{align} $$

Which function approximator?

There are many function approximators, eg:

  • linear combinations of features
  • neural network
  • decision tree
  • nearest neighbor
  • fourier / wavelet bases
  • ...

Which function approximator? 2

We consider differentiable function approximators, ie:

  • linear combinations of features
  • neural network

Furthermore, we require a training method that is suitable for non-stationary, non-iid data.

2. Incremental Methods

Gradient Descent

  • Let $J(\mathbf{w})$ be a differentiable function of parameter vector $\mathbf{w}$
  • define the gradient of $J(\mathbf{w})$ to be:
$$ \nabla_\mathbf{w}J(\mathbf{w}) = \begin{pmatrix} \frac{\partial J(\mathbf{w})}{\partial \mathbf{w}_1} \\ \vdots \\ \frac{\partial J(\mathbf{w})}{\partial \mathbf{w}_n} \end{pmatrix} $$
  • to find a local minimum of $J(\mathbf{w})$, adjust $\mathbf{w}$ in direction of -ve gradient:
$$ \Delta \mathbf{w} = - \frac{1}{2} \alpha \nabla_\mathbf{w} J(\mathbf{w}) $$

... where $\alpha$ is a step-size parameter

Value function approximation by stochastic gradient descent

  • goal: find parameter vector $\mathbf{w}$ minimizing mean-squared error between approximate value function $\hat{v}(s, \mathbf{w})$ and true value function $v_\pi(s)$:
$$ J(\mathbf{w}) = \mathbb{E}_\pi \left[ (v_\pi(S) - \hat{V}(S, \mathbf{w}))^2 \right] $$
  • gradient descent finds a local minimum:
$$ \begin{align} \Delta \mathbf{w} & = - \frac{1}{2} \alpha \nabla_\mathbf{w} J(\mathbf{w}) \\ & = \alpha \mathbb{E}_\pi\left[ (v_\pi(S) - \hat{v}(S,\mathbf{w}))\nabla_\mathbf{w} \hat{v}(S, \mathbf{w}) \right]\\ \end{align} $$
  • stochastic gradient descent samples the gradient
$$ \Delta \mathbf{w} = \alpha(v_\pi(S) - \hat{v}(S,\mathbf{w}))\nabla_\mathbf{w} \hat{v}(S, \mathbf{w}) $$
  • expected update is equal to full gradient update

21:15 to here

Feature vectors

  • represent state by a feature vector
$$ \mathbf{x}(S) = \begin{pmatrix} x_1(S) \\ \vdots \\ x_n(S) \end{pmatrix} $$
  • for example:
    • distance of robot from landmarks
    • trends in the stock market
    • piece and pawn configurations in chess

Linear value function approximation

  • represent value function by a linear combination of features
$$ \hat{v}(S, \mathbf{w}) = \mathbf{x}(S)^T \mathbf{w} = \sum_{j=1}^n x_j(S)\,\mathbf{w}_j $$
  • objective function is quadratic in parameters $\mathbf{w}$:
$$ J(\mathbf{w}) = \mathbb{E}_\pi\left[ (v_\pi(S) - \mathbf{x}(S)^T \mathbf{w})^2 \right] $$
  • stochastic gradient descent converges on global optimum
  • update rule is particularly simple:
$$ \begin{align} \nabla_\mathbf{w} \hat{v}(S, \mathbf{w}) & = \mathbf{x}(S) \\ \Delta \mathbf{w} & = \alpha(v_\pi(S) - \hat{v}(S, \mathbf{w}))\, \mathbf{x}(S) \end{align} $$
  • "Update = step-size x prediction error x feature value"

Table lookup features

  • table lookup is a special case of linear value function approximation
  • using table lookup features
$$ \mathbf{x}^\text{table}(S) = \begin{pmatrix} 1(S=s_1) \\ \vdots \\ 1(S=s_n) \end{pmatrix} $$
  • parameter vector $\mathbf{w}$ gives values of each individual state
$$ \hat{v}(S, \mathbf{w}) = \begin{pmatrix} 1(S=s_1) \\ \vdots \\ 1(S=s_n) \end{pmatrix} \cdot \begin{pmatrix} w_1 \\ \vdots \\ w_n \end{pmatrix} $$

30:53 to here

Incremental prediction algorithms

  • have assumed true value function $v_\pi(s)$ given by supervisor/oracle
  • but in RL there is no supervisor, only rewrds
  • in practice, we substitute a target for $v_\pi(s)$:

    • for M, the target is the return $G_t$

    $$ \Delta \mathbf{w} = \alpha(G_t - \hat{v}(S_t, \mathbf{w})\, \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w})) $$

    • for TD(0), the target is the TD target $R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})$:

    $$ \Delta \mathbf{w} = \alpha(R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w})) \, \nabla_\mathbf{w}(S_t, \mathbf{w}) $$

    • for TD($\lambda$), the target is $\lambda$-return $G_t^\lambda$:

    $$ \Delta \mathbf{w} = \alpha(G_t^\lambda - \hat{v}(S_t, \mathbf{w}) \, \nabla_\mathbf{w}(S_t, \mathbf{w})) $$

Monte-Carlo with value function approximation

  • return $G_t$ is an unbiased, noisy sample of true value $v_\pi(S_t)$
  • can therefore apply supervised learning to 'training data':
$$ \langle S_1, G_1 \rangle, \langle S_2, G_2\rangle, \dots, \langle S_T, G_T \rangle $$
  • for example, using linear Monte-Carlo policy evaluation:
$$ \begin{align} \Delta \mathbf{w} & = \alpha(G_t - \hat{v}(S_t, \mathbf{w})) \, \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w}) \\ & = \alpha(G_t - \hat{v})(S_t, \mathbf{w})) \, \mathbf{x}(S_t) \end{align} $$
  • Monte-Carlo evaluation converges to a local optimum
  • even when using non-linear value function approximation

TD Learning with Value Function Approximation

  • the TD-target $R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w})$ is a biased sample of true value $v_\pi(S_t)$
  • can still apply supervised learning to 'training data':
$$ \langle S_1, R_2 + \gamma \hat{v}(S_2, \mathbf{w}) \rangle, \langle S_2, R_3 + \gamma \hat{v}(S_3, \mathbf{w}) \rangle, \dots, \langle S_{T-1}, R_T \rangle $$
  • for example, using linear TD(0):
$$ \begin{align} \Delta \mathbf{w} & = \alpha(R + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w}))\, \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w}) \\ & = \alpha \delta \mathbf{x}(S) \end{align} $$

where:

$$ \delta = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w})) $$
  • linear TD(0) converges (close) to global optimum

TD($\lambda$) with Value Function Approximation

  • the $\lambda$-return $G_t^\lambda$ is also a biased sample of true value $v_\pi(s)$
  • can again apply supervised learning to 'training data':
$$ \langle S_1, G_1^\lambda \rangle, \langle S_2, G_2^\lambda \rangle, \dots, \langle S_{T-1}, G_{T-1}^\lambda \rangle $$
  • forward view linear TD($\lambda$)
$$ \begin{align} \Delta \mathbf{w} & = \alpha(G_t^\lambda - \hat{v}(S_t, \mathbf{w}) \, \nabla_\mathbf{w} \hat{v}(S_t, \mathbf{w})) \\ & = \alpha(G_t^\lambda - \hat{v}(S_t, \mathbf{w})) \, \mathbf{x}(S_t) \end{align} $$
  • backward view linear TD($\lambda$):
$$ \Delta \mathbf{w} = \alpha \delta_t E_t $$

where:

$$ \begin{align} \delta_t & = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) - \hat{v}(S_t, \mathbf{w}) \\ E_t & = \gamma \lambda E_{t-1} + \delta_t \mathbf{x}(S_t) \end{align} $$

Forward view and backward view TD($\lambda$) are equivalent.

49:00 to here