Planning Algorithms

Do you remember on lesson 2 and 3 we discussed algorithms that basically solve MDPs? That is, find a policy given a exact representation of the environment. In this section, we will explore 2 such algorithms. Value Iteration and policy iteration.


In [1]:
import numpy as np
import pandas as pd
import tempfile
import pprint
import json
import sys
import gym

from gym import wrappers
from subprocess import check_output
from IPython.display import HTML

Value Iteration

The Value Iteration algorithm uses dynamic programming by dividing the problem into common sub-problems and leveraging that optimal structure to speed-up computations.

Let me show you how value iterations look like:


In [2]:
def value_iteration(S, A, P, gamma=.99, theta = 0.0000001):
 
    V = np.random.random(len(S))
    for i in range(100000):
        old_V = V.copy()
        
        Q = np.zeros((len(S), len(A)), dtype=float)
        for s in S:
            for a in A:
                for prob, s_prime, reward, done in P[s][a]:
                    Q[s][a] += prob * (reward + gamma * old_V[s_prime] * (not done))
            V[s] = Q[s].max()
        if np.all(np.abs(old_V - V) < theta):
            break
    
    pi = np.argmax(Q, axis=1)
    return pi, V

As we can see, value iteration expects a set of states, e.g. (0,1,2,3,4) a set of actions, e.g. (0,1) and a set of transition probabilities that represent the dynamics of the environment. Let's take a look at these variables:


In [3]:
mdir = tempfile.mkdtemp()
env = gym.make('FrozenLake-v0')
env = wrappers.Monitor(env, mdir, force=True)


[2017-04-26 00:30:53,851] Making new env: FrozenLake-v0

In [4]:
S = range(env.env.observation_space.n)
A = range(env.env.action_space.n)
P = env.env.env.P

In [5]:
S


Out[5]:
range(0, 16)

In [6]:
A


Out[6]:
range(0, 4)

In [9]:
P[10]


Out[9]:
{0: [(0.3333333333333333, 6, 0.0, False),
  (0.3333333333333333, 9, 0.0, False),
  (0.3333333333333333, 14, 0.0, False)],
 1: [(0.3333333333333333, 9, 0.0, False),
  (0.3333333333333333, 14, 0.0, False),
  (0.3333333333333333, 11, 0.0, True)],
 2: [(0.3333333333333333, 14, 0.0, False),
  (0.3333333333333333, 11, 0.0, True),
  (0.3333333333333333, 6, 0.0, False)],
 3: [(0.3333333333333333, 11, 0.0, True),
  (0.3333333333333333, 6, 0.0, False),
  (0.3333333333333333, 9, 0.0, False)]}

You see the world we are looking into "FrozenLake-v0" has 16 different states, 4 different actions. The P[10] is basically showing us a peek into the dynamics of the world. For example, in this case, if you are in state "10" (from P[10]) and you take action 0 (see dictionary key 0), you have equal probability of 0.3333 to land in either state 6, 9 or 14. None of those transitions give you any reward and none of them is terminal.

In contrast, we can see taking action 2, might transition you to state 11, which is terminal.

Get the hang of it? Let's run it!


In [10]:
pi, V = value_iteration(S, A, P)

Now, value iteration calculates two important things. First, it calculates V, which tells us how much should we expect from each state if we always act optimally. Second, it gives us pi, which is the optimal policy given V. Let's take a deeper look:


In [12]:
V


Out[12]:
array([  9.82775479e-006,   4.77561742e-007,   8.29890013e-006,
         7.77646736e-006,   5.68794576e-006,   0.00000000e+000,
         3.38430298e-208,   0.00000000e+000,   8.92176447e-007,
         5.28039771e-006,   3.09721331e-006,   0.00000000e+000,
         0.00000000e+000,   9.53731304e-006,   9.80392157e-001,
         0.00000000e+000])

In [13]:
pi


Out[13]:
array([0, 3, 0, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])

See? This policy basically says in state 0, take action 0. In state 1 take action 3. In state 2 take action 0 and so on. Got it?

Now, we have the "directions" or this "map". With this, we can just use this policy and solve the environment as we interact with it.

Let's try it out!


In [14]:
for _ in range(10000):
    state = env.reset()
    while True:
        state, reward, done, info = env.step(pi[state])
        if done:
            break


[2017-04-26 00:40:00,747] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video010000.json
[2017-04-26 00:40:01,236] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video011000.json
[2017-04-26 00:40:01,752] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video012000.json
[2017-04-26 00:40:02,236] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video013000.json
[2017-04-26 00:40:02,732] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video014000.json
[2017-04-26 00:40:03,235] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video015000.json
[2017-04-26 00:40:03,745] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video016000.json
[2017-04-26 00:40:04,234] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video017000.json
[2017-04-26 00:40:04,748] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video018000.json
[2017-04-26 00:40:05,262] Starting new video recorder writing to /tmp/tmpti4utrmj/openaigym.video.0.56.video019000.json

That was the agent interacting with the environment. Let's take a look at some of the episodes:


In [16]:
last_video = env.videos[-1][0]
out = check_output(["asciinema", "upload", last_video])
out = out.decode("utf-8").replace('\n', '').replace('\r', '')
print(out)


https://asciinema.org/a/6rphgm3w1rbkjvoo2rq6tpjqu

You can look on that link, or better, let's show it on the notebook:


In [18]:
castid = out.split('/')[-1]
html_tag = """
<script type="text/javascript" 
    src="https://asciinema.org/a/{0}.js" 
    id="asciicast-{0}" 
    async data-autoplay="true" data-size="big">
</script>
"""
html_tag = html_tag.format(castid)
HTML(data=html_tag)


Out[18]:

Interesting right? Did you get the world yet?

So, 'S' is the starting state, 'G' the goal. 'F' are Frozen grids, and 'H' are holes. Your goal is to go from S to G without falling into any H. The problem is, F is slippery so, often times you are better of by trying moves that seems counter-intuitive. But because you are preventing falling on 'H's it makes sense in the end. For example, the second row, first column 'F', you can see how our agent was trying so hard to go left!! Smashing his head against the wall?? Silly. But why?


In [19]:
P[4]


Out[19]:
{0: [(0.3333333333333333, 0, 0.0, False),
  (0.3333333333333333, 4, 0.0, False),
  (0.3333333333333333, 8, 0.0, False)],
 1: [(0.3333333333333333, 4, 0.0, False),
  (0.3333333333333333, 8, 0.0, False),
  (0.3333333333333333, 5, 0.0, True)],
 2: [(0.3333333333333333, 8, 0.0, False),
  (0.3333333333333333, 5, 0.0, True),
  (0.3333333333333333, 0, 0.0, False)],
 3: [(0.3333333333333333, 5, 0.0, True),
  (0.3333333333333333, 0, 0.0, False),
  (0.3333333333333333, 4, 0.0, False)]}

See how action 0 (left) doesn't have any transition leading to a terminal state??

All other actions give you a 0.333333 chance each of pushing you into the hole in state '5'!!! So it actually makes sense to go left until it slips you downward to state 8.

Cool right?


In [20]:
pi


Out[20]:
array([0, 3, 0, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])

See how the "prescribed" action is 0 (left) on the policy calculated by value iteration?

How about the values?


In [21]:
V


Out[21]:
array([  9.82775479e-006,   4.77561742e-007,   8.29890013e-006,
         7.77646736e-006,   5.68794576e-006,   0.00000000e+000,
         3.38430298e-208,   0.00000000e+000,   8.92176447e-007,
         5.28039771e-006,   3.09721331e-006,   0.00000000e+000,
         0.00000000e+000,   9.53731304e-006,   9.80392157e-001,
         0.00000000e+000])

These show the expected rewards on each state.


In [22]:
P[15]


Out[22]:
{0: [(1.0, 15, 0, True)],
 1: [(1.0, 15, 0, True)],
 2: [(1.0, 15, 0, True)],
 3: [(1.0, 15, 0, True)]}

See how the state '15' gives you a reward of +1?? These signal gets propagated all the way to the start state using Value Iteration and it shows the values all accross.

Cool? Good.


In [39]:
env.close()

If you want to submit to OpenAI Gym, get your API Key and paste it here:


In [9]:
gym.upload(mdir, api_key='<YOUR OPENAI API KEY>')


[2017-04-01 16:55:43,229] [FrozenLake-v0] Uploading 10000 episodes of training data
[2017-04-01 16:55:44,905] [FrozenLake-v0] Uploading videos of 19 training episodes (2158 bytes)
[2017-04-01 16:55:45,131] [FrozenLake-v0] Creating evaluation object from /tmp/tmpfukeltbz with learning curve and training video
[2017-04-01 16:55:45,620] 
****************************************************
You successfully uploaded your evaluation on FrozenLake-v0 to
OpenAI Gym! You can find it at:

    https://gym.openai.com/evaluations/eval_ycTPCbyiTWK6T0C4DyrvRg

****************************************************

Policy Iteration

There is another method called policy iteration. This method is composed of 2 other methods, policy evaluation and policy improvement. The logic goes that policy iteration is 'evaluating' a policy to check for convergence (meaning the policy doesn't change), and 'improving' the policy, which is applying something similar to a 1 step value iteration to get a slightly better policy, but definitely not worse.

These two functions cycling together are what policy iteration is about.

Can you implement this algorithm yourself? Try it. Make sure to look the solution notebook in case you get stuck.

I will give you the policy evaluation and policy improvement methods, you build the policy iteration cycling between the evaluation and improvement methods until there are no changes to the policy.


In [24]:
def policy_evaluation(pi, S, A, P, gamma=.99, theta=0.0000001):
    
    V = np.zeros(len(S))
    while True:
        delta = 0
        for s in S:
            v = V[s]
            
            V[s] = 0
            for prob, dst, reward, done in P[s][pi[s]]:
                V[s] += prob * (reward + gamma * V[dst] * (not done))
            
            delta = max(delta, np.abs(v - V[s]))
        if delta < theta:
            break
    return V

In [1]:
def policy_improvement(pi, V, S, A, P, gamma=.99):
    for s in S:
        old_a = pi[s]
        
        Qs = np.zeros(len(A), dtype=float)
        for a in A:
            for prob, s_prime, reward, done in P[s][a]:
                Qs[a] += prob * (reward + gamma * V[s_prime] * (not done))
        pi[s] = np.argmax(Qs)
        V[s] = np.max(Qs)
    return pi, V

In [27]:
def policy_iteration(S, A, P, gamma=.99):
    pi = np.random.choice(A, len(S))
    """ YOU COMPLETE THIS METHOD """
    return pi

After you implement the algorithms, you can run it and calculate the optimal policy:


In [29]:
mdir = tempfile.mkdtemp()
env = gym.make('FrozenLake-v0')
env = wrappers.Monitor(env, mdir, force=True)

S = range(env.env.observation_space.n)
A = range(env.env.action_space.n)
P = env.env.env.P

pi = policy_iteration(S, A, P)
print(pi)


[2017-04-26 00:54:49,917] Making new env: FrozenLake-v0
[2017-04-26 00:54:49,919] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmppra935u6')
[0 3 0 3 0 0 0 0 3 1 0 0 0 2 1 0]

And, of course, interact with the environment looking at the "directions" or "policy":


In [30]:
for _ in range(10000):
    state = env.reset()
    while True:
        state, reward, done, info = env.step(pi[state])
        if done:
            break


[2017-04-26 00:55:44,764] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000000.json
[2017-04-26 00:55:44,767] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000001.json
[2017-04-26 00:55:44,772] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000008.json
[2017-04-26 00:55:44,788] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000027.json
[2017-04-26 00:55:44,810] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000064.json
[2017-04-26 00:55:44,838] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000125.json
[2017-04-26 00:55:44,891] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000216.json
[2017-04-26 00:55:44,958] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000343.json
[2017-04-26 00:55:45,043] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000512.json
[2017-04-26 00:55:45,155] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video000729.json
[2017-04-26 00:55:45,295] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video001000.json
[2017-04-26 00:55:45,889] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video002000.json
[2017-04-26 00:55:46,418] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video003000.json
[2017-04-26 00:55:46,934] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video004000.json
[2017-04-26 00:55:47,441] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video005000.json
[2017-04-26 00:55:47,963] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video006000.json
[2017-04-26 00:55:48,473] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video007000.json
[2017-04-26 00:55:48,989] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video008000.json
[2017-04-26 00:55:49,492] Starting new video recorder writing to /tmp/tmp0oe_0gtp/openaigym.video.2.56.video009000.json

In [32]:
last_video = env.videos[-1][0]
out = check_output(["asciinema", "upload", last_video])
out = out.decode("utf-8").replace('\n', '').replace('\r', '')
print(out)


https://asciinema.org/a/c6phe9z2ntyy3y3lfflzwqwiy

In [34]:
castid = out.split('/')[-1]
html_tag = """
<script type="text/javascript" 
    src="https://asciinema.org/a/{0}.js" 
    id="asciicast-{0}" 
    async data-autoplay="true" data-size="big">
</script>
"""
html_tag = html_tag.format(castid)
HTML(data=html_tag)


Out[34]:

Similar as before. Policies could be slightly different if there is a state in which more than one action give the same value in the end.


In [35]:
V


Out[35]:
array([  9.82775479e-006,   4.77561742e-007,   8.29890013e-006,
         7.77646736e-006,   5.68794576e-006,   0.00000000e+000,
         3.38430298e-208,   0.00000000e+000,   8.92176447e-007,
         5.28039771e-006,   3.09721331e-006,   0.00000000e+000,
         0.00000000e+000,   9.53731304e-006,   9.80392157e-001,
         0.00000000e+000])

In [37]:
pi


Out[37]:
array([0, 3, 0, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])

That's it let's wrap up.


In [38]:
env.close()


[2017-04-26 00:57:28,406] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmp0oe_0gtp')

If you want to submit to OpenAI Gym, get your API Key and paste it here:


In [134]:
gym.upload(mdir, api_key='<YOUR OPENAI API KEY>')


[2017-04-01 20:40:54,103] [FrozenLake-v0] Uploading 10000 episodes of training data
[2017-04-01 20:40:55,854] [FrozenLake-v0] Uploading videos of 19 training episodes (2278 bytes)
[2017-04-01 20:40:56,102] [FrozenLake-v0] Creating evaluation object from /tmp/tmpyspcx0sa with learning curve and training video
[2017-04-01 20:40:56,451] 
****************************************************
You successfully uploaded your evaluation on FrozenLake-v0 to
OpenAI Gym! You can find it at:

    https://gym.openai.com/evaluations/eval_vAvbhsGQRVSAe5DZkFNrQ

****************************************************

Hope you liked it... Value Iteration and Policy Iteration might seem disappointing at first, and I understand. What is the intelligence on following given directions!? What if you just don't have a map of the environment you are interacting with? Come on, that's not AI. You are right, it is not. However, Value Iteration and Policy Iteration form the basis of 2 of the 3 most fundamental paradigms of algorithms in reinforcement learning.

Next notebooks we start looking into slightly more complicated environment. And also, we will learn about algorithms that learn while interacting with the environment. Also called "online" learning.