Algorithmic Analysis for Value and Policy Iteration

Chapter 2 of Algorithms for Sequential Decision Making (Michael Littman's PhD thesis)


In [3]:
%matplotlib inline

In [4]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Some intuition for geometric series


In [5]:
def series(discount_factor, n=1e6):
    assert 0 < discount_factor < 1, "Discount factor must be >0 and <1"
    return sum([discount_factor ** i for i in range(int(n))])

In [ ]:
betas = [0.1, 0.25, 0.5, 0.75, 0.9, 0.99]

In [7]:
# By calculation
plt.plot(betas, [series(beta) for beta in betas]);



In [8]:
# By formula (beta / (1 - beta))
plt.plot(betas, [beta / (1 - beta) for beta in betas]);


Value Iteration

Lemma 2.2 determining optimal infinite-horizon policy via value iteration takes a number of iterations proportional to $\frac{1}{1-\beta}log\frac{1}{1-\beta}$ in the worst case


In [62]:
def action1(beta, t):
    return sum([-beta ** t for t in range(t)])

def action2(beta, t):
    return -beta ** 2 / (1 - beta)

In [92]:
betas = [0.1, 0.25, 0.5, 0.75, 0.9, 0.99]
steps = 20

rows = []
for beta in betas:
    for step in range(steps):
        rows.append({
            'step': step,
            'beta': beta,
            'a1': action1(beta, step),
            'a2': action2(beta, step)
        })
        
df = pd.DataFrame(rows)
df_ = pd.melt(df, id_vars=['step', 'beta'])

At some step, depending on beta, we'll realize that $a_2$ is superior


In [113]:
g = sns.FacetGrid(df_, col='beta', hue='variable', sharey=False)
g.map(sns.pointplot, 'step', 'value');


Policy iteration

  • There are $|A|^{|S|}$ distinct policies
  • Each iteartion of PI strictly improves the approximation
  • PI terminates in at most exponential number of steps!
  • Each step of PI:
    • VI-like policy improvement with $O(|A||S|^{2})$
    • Policy evaluation $O(|S|^{3})$