Bandit based Monte-Carlo planning

  • Upper Confidence Tree (UCT)
  • UCT applies bandit ideas to guide Monte-Carlo planning
  • Used in finding near-optimal action in large state-space MDP when we have a generative model of MDP
  • The main idea is to sample actions selectively with UCB1 and call it Upper Confidence applied to Trees (UCT)
    • If we can restrict sampling to 1/2 of the actions we're reducing work by (1/2)^D where D is the lookahead depth
    • Action is considered sub-optimal (by definition) if its value is less than the best action
    • Action values depend on the value of successor states
    • We need to get the estimation error of the values to decay fast
      • This leads to the typical exploration-exploitation dilemma
  • The sampling probability of actions at children node is changing and payoff sequences drift. UCT replaces the bias term with a term that takes this into account.
  • Action selection is framed as a multi-armed bandit for each internal node of the tree.

Kearns et al sampling based lookahead search:

  • Root is the initial state where we start the planning
  • Tree with node labels alternating between states and state-actions
  • Nodes labeled by state followed by a fixed # of nodes associated with actions at that state
  • Nodes labeled by state-action followed by fixed # of state-labelled nodes sampling from the generative model of the MDP
  • Sampled rewards stored with the edge connecting state-action and state nodes
  • State-action node value computed based on average of the sum of the rewards along the edges originating at the node and the successor node values
  • State node value is the max of the values of its children
  • Fixed sized trees are sufficient under certain conditions
    • Depth is a function of ε and γ
    • Width is a function of K (number of actions), ε and γ
  • Promising but expensive
  • This work achieves 2 characteristics over this vanilla MC method:
    1. Small error probability if stopped prematurely
    2. Convergence to the best action eventually

Rollout-based planning

  • The Kerans method is stage-wise tree building
  • Rollout-based builds the lookahead tree by repeatedly sampling episodes from the initial state
  • If we keep encountering the same states then we can bias the choice of the actions to follow and potentially convergence faster
  • If we don't see the same states this method degenerates to the non-selective vanilla method

Algorithm scheme

  • Generate episodes and return highest long-term reward action
  • Update value at a given depth and increase counter for state-action

UCB1

  • Regret is loss from not always picking the best arm
  • For a large class of payoff distributions there is no policy with regret growing slower than O(ln n). For these the exploration-exploitation dilemma is resolved if regret growth rate is within a constant factor of the best possible regret rate
  • Keeps track of average reward per arm and chooses the arm with the best upper confidence bound

In [ ]: