```
In [1]:
```import numpy as np
import pandas as pd
import tempfile
import pprint
import json
import sys
import gym
from gym import wrappers
from subprocess import check_output
from IPython.display import HTML

The Value Iteration algorithm uses dynamic programming by dividing the problem into common sub-problems and leveraging that optimal structure to speed-up computations.

Let me show you how value iterations look like:

```
In [2]:
```def value_iteration(S, A, P, gamma=.99, theta = 0.0000001):
V = np.random.random(len(S))
for i in range(100000):
old_V = V.copy()
Q = np.zeros((len(S), len(A)), dtype=float)
for s in S:
for a in A:
for prob, s_prime, reward, done in P[s][a]:
Q[s][a] += prob * (reward + gamma * old_V[s_prime] * (not done))
V[s] = Q[s].max()
if np.all(np.abs(old_V - V) < theta):
break
pi = np.argmax(Q, axis=1)
return pi, V

```
In [3]:
```mdir = tempfile.mkdtemp()
env = gym.make('FrozenLake-v0')
env = wrappers.Monitor(env, mdir, force=True)

```
```

```
In [4]:
```S = range(env.env.observation_space.n)
A = range(env.env.action_space.n)
P = env.env.env.P

```
In [5]:
``````
S
```

```
Out[5]:
```

```
In [6]:
``````
A
```

```
Out[6]:
```

```
In [9]:
```P[10]

```
Out[9]:
```

You see the world we are looking into "FrozenLake-v0" has 16 different states, 4 different actions. The `P[10]`

is basically showing us a peek into the dynamics of the world. For example, in this case, if you are in state "10" (from `P[10]`

) and you take action 0 (see dictionary key 0), you have equal probability of 0.3333 to land in either state 6, 9 or 14. None of those transitions give you any reward and none of them is terminal.

In contrast, we can see taking action 2, might transition you to state 11, which **is** terminal.

Get the hang of it? Let's run it!

```
In [10]:
```pi, V = value_iteration(S, A, P)

`V`

, which tells us how much should we expect from each state if we always act optimally. Second, it gives us `pi`

, which is the optimal policy given `V`

. Let's take a deeper look:

```
In [12]:
``````
V
```

```
Out[12]:
```

```
In [13]:
``````
pi
```

```
Out[13]:
```

See? This policy basically says in state `0`

, take action `0`

. In state `1`

take action `3`

. In state `2`

take action `0`

and so on. Got it?

Now, we have the "directions" or this "map". With this, we can just use this policy and solve the environment as we interact with it.

Let's try it out!

```
In [14]:
```for _ in range(10000):
state = env.reset()
while True:
state, reward, done, info = env.step(pi[state])
if done:
break

```
```

That was the agent interacting with the environment. Let's take a look at some of the episodes:

```
In [16]:
```last_video = env.videos[-1][0]
out = check_output(["asciinema", "upload", last_video])
out = out.decode("utf-8").replace('\n', '').replace('\r', '')
print(out)

```
```

You can look on that link, or better, let's show it on the notebook:

```
In [18]:
```castid = out.split('/')[-1]
html_tag = """
<script type="text/javascript"
src="https://asciinema.org/a/{0}.js"
id="asciicast-{0}"
async data-autoplay="true" data-size="big">
</script>
"""
html_tag = html_tag.format(castid)
HTML(data=html_tag)

```
Out[18]:
```

Interesting right? Did you get the world yet?

So, 'S' is the starting state, 'G' the goal. 'F' are Frozen grids, and 'H' are holes. Your goal is to go from S to G without falling into any H. The problem is, F is slippery so, often times you are better of by trying moves that seems counter-intuitive. But because you are preventing falling on 'H's it makes sense in the end. For example, the second row, first column 'F', you can see how our agent was trying so hard to go left!! Smashing his head against the wall?? Silly. But why?

```
In [19]:
```P[4]

```
Out[19]:
```

See how action 0 (left) doesn't have any transition leading to a terminal state??

All other actions give you a 0.333333 chance each of pushing you into the hole in state '5'!!! So it actually makes sense to go left until it slips you downward to state 8.

Cool right?

```
In [20]:
``````
pi
```

```
Out[20]:
```

See how the "prescribed" action is 0 (left) on the policy calculated by value iteration?

How about the values?

```
In [21]:
``````
V
```

```
Out[21]:
```

These show the expected rewards on each state.

```
In [22]:
```P[15]

```
Out[22]:
```

See how the state '15' gives you a reward of +1?? These signal gets propagated all the way to the start state using Value Iteration and it shows the values all accross.

Cool? Good.

```
In [39]:
```env.close()

If you want to submit to OpenAI Gym, get your API Key and paste it here:

```
In [9]:
```gym.upload(mdir, api_key='<YOUR OPENAI API KEY>')

```
```

There is another method called policy iteration. This method is composed of 2 other methods, policy evaluation and policy improvement. The logic goes that policy iteration is 'evaluating' a policy to check for convergence (meaning the policy doesn't change), and 'improving' the policy, which is applying something similar to a 1 step value iteration to get a slightly better policy, but definitely not worse.

These two functions cycling together are what policy iteration is about.

Can you implement this algorithm yourself? Try it. Make sure to look the solution notebook in case you get stuck.

I will give you the policy evaluation and policy improvement methods, you build the policy iteration cycling between the evaluation and improvement methods until there are no changes to the policy.

```
In [24]:
```def policy_evaluation(pi, S, A, P, gamma=.99, theta=0.0000001):
V = np.zeros(len(S))
while True:
delta = 0
for s in S:
v = V[s]
V[s] = 0
for prob, dst, reward, done in P[s][pi[s]]:
V[s] += prob * (reward + gamma * V[dst] * (not done))
delta = max(delta, np.abs(v - V[s]))
if delta < theta:
break
return V

```
In [1]:
```def policy_improvement(pi, V, S, A, P, gamma=.99):
for s in S:
old_a = pi[s]
Qs = np.zeros(len(A), dtype=float)
for a in A:
for prob, s_prime, reward, done in P[s][a]:
Qs[a] += prob * (reward + gamma * V[s_prime] * (not done))
pi[s] = np.argmax(Qs)
V[s] = np.max(Qs)
return pi, V

```
In [27]:
```def policy_iteration(S, A, P, gamma=.99):
pi = np.random.choice(A, len(S))
""" YOU COMPLETE THIS METHOD """
return pi

After you implement the algorithms, you can run it and calculate the optimal policy:

```
In [29]:
```mdir = tempfile.mkdtemp()
env = gym.make('FrozenLake-v0')
env = wrappers.Monitor(env, mdir, force=True)
S = range(env.env.observation_space.n)
A = range(env.env.action_space.n)
P = env.env.env.P
pi = policy_iteration(S, A, P)
print(pi)

```
```

And, of course, interact with the environment looking at the "directions" or "policy":

```
In [30]:
```for _ in range(10000):
state = env.reset()
while True:
state, reward, done, info = env.step(pi[state])
if done:
break

```
```

```
In [32]:
```last_video = env.videos[-1][0]
out = check_output(["asciinema", "upload", last_video])
out = out.decode("utf-8").replace('\n', '').replace('\r', '')
print(out)

```
```

```
In [34]:
```castid = out.split('/')[-1]
html_tag = """
<script type="text/javascript"
src="https://asciinema.org/a/{0}.js"
id="asciicast-{0}"
async data-autoplay="true" data-size="big">
</script>
"""
html_tag = html_tag.format(castid)
HTML(data=html_tag)

```
Out[34]:
```

```
In [35]:
``````
V
```

```
Out[35]:
```

```
In [37]:
``````
pi
```

```
Out[37]:
```

That's it let's wrap up.

```
In [38]:
```env.close()

```
```

If you want to submit to OpenAI Gym, get your API Key and paste it here:

```
In [134]:
```gym.upload(mdir, api_key='<YOUR OPENAI API KEY>')

```
```

Hope you liked it... Value Iteration and Policy Iteration might seem disappointing at first, and I understand. What is the intelligence on following given directions!? What if you just don't have a map of the environment you are interacting with? Come on, that's not AI. You are right, it is not. However, Value Iteration and Policy Iteration form the basis of 2 of the 3 most fundamental paradigms of algorithms in reinforcement learning.

Next notebooks we start looking into slightly more complicated environment. And also, we will learn about algorithms that learn while interacting with the environment. Also called "online" learning.