- Upper Confidence Tree (UCT)
- UCT applies bandit ideas to guide Monte-Carlo planning
- Used in finding near-optimal action in large state-space MDP when we have a generative model of MDP
- The main idea is to sample actions selectively with UCB1 and call it Upper Confidence applied to Trees (UCT)
- If we can restrict sampling to 1/2 of the actions we're reducing work by (1/2)^D where D is the lookahead depth
- Action is considered sub-optimal (by definition) if its value is less than the best action
- Action values depend on the value of successor states
- We need to get the estimation error of the values to decay fast
- This leads to the typical exploration-exploitation dilemma
- The sampling probability of actions at children node is changing and payoff sequences drift. UCT replaces the bias term with a term that takes this into account.
- Action selection is framed as a multi-armed bandit for each internal node of the tree.
Kearns et al sampling based lookahead search:
- Root is the initial state where we start the planning
- Tree with node labels alternating between states and state-actions
- Nodes labeled by state followed by a fixed # of nodes associated with actions at that state
- Nodes labeled by state-action followed by fixed # of state-labelled nodes sampling from the generative model of the MDP
- Sampled rewards stored with the edge connecting state-action and state nodes
- State-action node value computed based on average of the sum of the rewards along the edges originating at the node and the successor node values
- State node value is the max of the values of its children
- Fixed sized trees are sufficient under certain conditions
- Depth is a function of ε and γ
- Width is a function of K (number of actions), ε and γ
- Promising but expensive
- This work achieves 2 characteristics over this vanilla MC method:
- Small error probability if stopped prematurely
- Convergence to the best action eventually
Rollout-based planning
- The Kerans method is stage-wise tree building
- Rollout-based builds the lookahead tree by repeatedly sampling episodes from the initial state
- If we keep encountering the same states then we can bias the choice of the actions to follow and potentially convergence faster
- If we don't see the same states this method degenerates to the non-selective vanilla method
Algorithm scheme
- Generate episodes and return highest long-term reward action
- Update value at a given depth and increase counter for state-action
UCB1
- Regret is loss from not always picking the best arm
- For a large class of payoff distributions there is no policy with regret growing slower than O(ln n). For these the exploration-exploitation dilemma is resolved if regret growth rate is within a constant factor of the best possible regret rate
- Keeps track of average reward per arm and chooses the arm with the best upper confidence bound