Upper Confidence Tree (UCT)
UCT applies bandit ideas to guide Monte-Carlo planning
Used in finding near-optimal action in large state-space MDP when we have a generative model of MDP
The main idea is to sample actions selectively with UCB1 and call it Upper Confidence applied to Trees (UCT)
- If we can restrict sampling to 1/2 of the actions we're reducing work by (1/2)^D where D is the lookahead depth
- Action is considered sub-optimal (by definition) if its value is less than the best action
- Action values depend on the value of successor states
- We need to get the estimation error of the values to decay fast
  - This leads to the typical exploration-exploitation dilemma
The sampling probability of actions at children node is changing and payoff sequences drift. UCT replaces the bias term with a term that takes this into account.
Action selection is framed as a multi-armed bandit for each internal node of the tree.

Kearns et al sampling based lookahead search:

Root is the initial state where we start the planning
Tree with node labels alternating between states and state-actions
Nodes labeled by state followed by a fixed # of nodes associated with actions at that state
Nodes labeled by state-action followed by fixed # of state-labelled nodes sampling from the generative model of the MDP
Sampled rewards stored with the edge connecting state-action and state nodes
State-action node value computed based on average of the sum of the rewards along the edges originating at the node and the successor node values
State node value is the max of the values of its children
Fixed sized trees are sufficient under certain conditions
- Depth is a function of ε and γ
- Width is a function of K (number of actions), ε and γ
Promising but expensive
This work achieves 2 characteristics over this vanilla MC method:
1. Small error probability if stopped prematurely
2. Convergence to the best action eventually

Rollout-based planning

The Kerans method is stage-wise tree building
Rollout-based builds the lookahead tree by repeatedly sampling episodes from the initial state
If we keep encountering the same states then we can bias the choice of the actions to follow and potentially convergence faster
If we don't see the same states this method degenerates to the non-selective vanilla method

Algorithm scheme

UCB1

Regret is loss from not always picking the best arm
For a large class of payoff distributions there is no policy with regret growing slower than O(ln n). For these the exploration-exploitation dilemma is resolved if regret growth rate is within a constant factor of the best possible regret rate
Keeps track of average reward per arm and chooses the arm with the best upper confidence bound

Bandit based Monte-Carlo planning