Fundamental Concepts

The Agent-Environment Interface

The agent-enviroment interface can be summarized as a sequence $\cdots S_tA_tR_{t+1}S_{t+1}A_{t+1}\cdots$.

The tuple $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ can be described as:

At time $t$, the agent observes environment's state $S_t$ and takes action $A_t$;
This action drives the environment to change its state to $S_{t+1}$, and the agent receives reward $R_{t+1}$ from the enviroment;
The Agent takes action $A_{t+1}$ and the cycle continues.

The agent's goal is to maximize its total reward in this process.

If the process is finite, we say the agent is doing episodic tasks, and the total reward is defined as $R_1 + \ldots + R_T$.
If the process is infinite, we say the agent is doing continous task, and the total reward is defined as $R_1 + \gamma R_2 + \gamma^2 R_3 + \ldots$, where $ 0 < \gamma < 1 $ is a discount factor that ensures the sum is finite.

For convenience, we use the unified notation $\sum_{k}\gamma^{k} R_{k+1}$, where $ 0 < \gamma \leq 1$, $\gamma = 1$ iff for episodic tasks.

Often, we are interesed in the return $G_t$, which is defined as the total reward after time $t$. This can be viewed as the feedback to the agent's action $A_t$ taken on state $S_t$. By definition, $G_t = R_{t+1} + \gamma R_{t+2} + \ldots$. Note the recurrsion relation $G_t = R_{t+1} + \gamma G_{t+1}$.

Markov Property

This says $S_{t+1}$ and $R_{t+1}$ only depends on $S_t, A_t$, but not on any earlier events. Thus we may write $P(S_{t+1}, R_{t+1} | S_t, A_t )$. Note we can further factorize this as $P(S_{t+1}, R_{t+1} | S_t, A_t ) = P(S_{t+1} | S_t, A_t) P(R_{t+1} | S_{t}, A_{t}, S_{t+1}).$ Note that $R_{t+1}$ has dependency on three variables.

We can simplify the notation by writing $s, a, s'$ to denote $S_t, A_t, S_{t+1}$. Two more notations:

$p(s' | s, a)$ denotes the transition probablity $(s, a) \rightarrow s' $
$r(s, a, s')$ denotes the expectation $\mathbb{E}(R_{t+1} | S_{t} = s, A_{t} = a, S_{t+1} = s')$.

Value Functions, Bellman Equation

We cannot talk about values without the policy. A policy maps the state to action. This can be deterministic or probabilistic. Usually, we denote the policy by $\pi(s)$, sometimes it's a distribution $p(a|s)$. For deterministic policy, it's just a value.

We can define value for a state or a state-action pair. Informally, the value of a state is the expected return when starting from $s$ and following $\pi$ thereafter. The value of a state-pair is the expected return when starting from $s$ and take action $a$. $$v_{\pi}(s) = \mathbb{E}_{\pi} (G_t | S_t = s )$$ $$q_{\pi}(s, a) = \mathbb{E}_{\pi}(G_t | S_t = s, A_t = a)$$ The notation is a bit sloppy here: $t$ appears on the RHS but not on the LHS. So the LHS is really $v_{\pi}(S_t = s)$ and $q_{\pi}(S_t = s, A_t = a)$.

\begin{align} q(s, a) & = \mathbb{E}[ G_t | s, a ] \\ & = \mathbb{E}[ R_{t+1} + \gamma G_{t+1} | s, a ] \\ & = \sum_{s'} p(s' | s, a) \mathbb{E} [R_{t+1} + \gamma G_{t+1} | s, a, s'] \\ & = \sum_{s'} p(s' | s, a) (r(s, a, s') + \gamma v(s')) \\ v(s) & = \sum_a p(a | s) q(s, a) \\ & = \sum_a p(a | s) \sum_{s'} p(s' | s, a) (r(s, a, s') + \gamma v(s')) \end{align}

The last equation that reveals the recurssion relation of $v(s)$ and $v(s')$ is known as the Bellman equation.

The Bellman equation has two implications:

It expresses $v(s)$ as a solution of a linear system.
It expresses $v(s)$ as a fixed point of a mapping. This suggests we can use iterative methods to solve for $v$. This gives the basic form of policy evaluation.

Optimal Value Functions

There exists an optimal policy $\pi_{*}$ such that for any other policy $\pi$, we have $v_{\pi_{*}}(s) \geq v_{\pi}(s), \forall s \in S$.

\begin{align} v_*(s) & = \max_{\pi} v_{\pi}(s) \\ q_*(s, a) & = \max_{\pi} q_{\pi}(s, a) \end{align}

$v_*(s)$ satisfies the Bellman optimal equation. The derivation of the Bellman optimal equation is similar to the regular Bellman equation. The main difference is: $v_*(s) = \max_a q(s, a)$ instead of averaging over the distribution $p(a|s)$.

Policy Iteration

Policy iteration alternates between policy evaluation and policy improvement:

Start with an arbitrary policy $\pi$.
In policy evaluation, we use an iterative method based on the Bellman equation to find $v(s)$. This gives us $q(s, a)$ too.
In policy improvment, we set $\pi(s) = \text{argmax}_a q(s, a)$.

Value Iteration

Valute iteration is an approximation of policy iteration. In policy evaluation, we iterate until convergence. In value iteration, we only iterate once.

This gives a special form of iteration: $$v(s) = \max_a q(s, a) = \max_a \sum_{s'} p(s' | s, a) (r(s, a, s') + \gamma v(s')) $$

At the end of iteration, we output policy $\pi(s) = \text{argmax}_a q(s, a)$.

Valute Iteration Example (Gambler's Problem)

A gambler bets on the outcome of a coin flip. If head, he wins the stake; otherwise, he loses the stake. The game ends if the gambler reaches \$100 dollars or \$0 dollars. The gabmler can only bet on integer amounts.

The states are $\{1, \ldots, 99\}$. For state $s$, availabe actions are betting from $\{1, \ldots, \min \{s, 100 - s\} \}$. We give +1 reward when we reach \$100 and give zero reward for all other transitions. We set discount factor $\gamma = 1$. With this setting, the value of each state (under a policy) is the winning probabiliy starting from that state. This is because $v(s) = \mathbb{E}[G_t | s]$ and $G_t$ is either 0 or 1. Thus the expectation evaluates to the probability of winning.

Let $p$ denote the probability of getting a head. Then we have transition $(s, a) \rightarrow s + a$ with probability $p$ and $(s, a) \rightarrow s - a$ with probability $1-p$. $r(s, a, s') = 1$ iff $s' = 100$.

The value iteration is $$ v(s) = \max_a [p (r(s, a, s + a) + v(s + a)) + (1-p) v(s-a)]$$



In [1]:

    
import numpy as np
P_HEAD = 0.4
WIN_STATE = 100
EPSILON = 1e-10

def r(s, a, s_prime):
    return 1 if s_prime == WIN_STATE else 0

# v[0..100]. v[0] and v[100] are held at 0.
def value_iteration(v):
    w = np.zeros(len(v))
    for s in range(1, WIN_STATE, 1):
        q = []
        for a in range(1, min(s, 100 - s) + 1, 1):
            qa = P_HEAD * (r(s, a, s + a) + v[s + a]) + (1 - P_HEAD) * v[s - a]
            q.append(qa)
        w[s] = max(q)
    return w

def solve(epsilon = EPSILON, max_iter = 1000):
    v = np.zeros(WIN_STATE + 1)
    for i in range(max_iter):
        w = value_iteration(v)
        delta = np.linalg.norm(w - v, np.inf)
        v = w
        if delta < epsilon:
            print 'converged after {} iterations'.format(i+1)
            break
    return v

def optimal_policy(v):
    actions = []
    for s in range(1, WIN_STATE, 1):
        best_a, best_qa = 0, 0
        for a in range(1, min(s, 100 - s) + 1, 1):
            qa = P_HEAD * (r(s, a, s + a) + v[s + a]) + (1 - P_HEAD) * v[s - a]
            ## A more robust max function (taking account of numerical errors)
            if qa - best_qa > 1e-6:
                best_qa = qa
                best_a = a
        actions.append(best_a)
    return actions



In [2]:

    
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

w = solve()
plt.plot(w[0:99]);
plt.title('Value functions')
plt.show()









    



converged after 34 iterations



In [3]:

    
actions = optimal_policy(w)
plt.bar(range(1, 100), actions)
plt.title('Optimal actions')









    Out[3]:





<matplotlib.text.Text at 0x105acf6d0>