Reinforcement Learning in Python

Section 1 - Introduction and Outline

Notes

  • Reinforcement learning is more different from supervised and unsupervised machine learning are from each other.
  • Both supervised and unsupervised machine learning use gathered, labelled or unlabelled data.
  • Reinforcement learning interfaces with an environment (simulated environment or through sensors).
  • Objective of the RL is to minimize the cost / maximize the reward.
  • The agent makes actions and feedback signals (in form of the reward) are automatically given to the agent by the environment.
  • This course will only look at finite state space enviroment.

Tic-tac-toe example

1) What is the number of states?

  • The board has 9 locations.
  • Each location on the board has 3 possibilities: empty, X and O.
  • The example is simplified but not finishing the game when a player wins.
  • With that the game has following number of states: $$ 3x3x3x3x3x3x3x3x3 = 3^9 $$

Important vocabulary

  • Agent: goal of the development, it interacts with the environment. Hopefully with an intelligent manner.
  • Environment: real or simulated world providing sensory data to the agent in for states, list of actions and a reward.
  • State: a configuration of the environment sensed by the agent.

Notes

  • The agent tried to maximize its immediate reward, but future rewards as well
  • The reward must be programmer intelligently to avoid unintended consequences.
  • The reward as always a real number.
  • SAR = State, Action, Reward
  • Every game is a sequence of states, actions and rewards
  • Convention: start in state S(t), take action A(t), receive a reward of R(t+1)
  • R(t+1) is always a result of the agent making action A(t) as state S(t)
  • S(t) and A(t) result in environment changing to state S(t+1)
  • Triple [S(t), A(t), S(t+1)] is denoted as (s, a, s')

Section 2 - Return of the Multi-Armed Bandit

Problem Setup

  • You go to a casino and choose between 3 slot machines.
  • Each returns a reward binary reward (0 or 1).
  • Win rate is unknown, it could be 0.1, 0.3, 0.5 respectively.
  • The goal is to maximize your winnings.
  • The best choice / tactic can only be discovered by collecting data.
  • It is important to balance the exploration (collecting the data) and exploit (abusing the best tactic).

Epsilon Gready Strategy

  • A big advantage of this strategy is its simplicity.
  • Choose a small number epsilon as probability of exploration (typically 5% or 10%).
  • For each round we generate a random number.
  • If the number is less than epsilon then we pull a random arm.
  • If the number is bigger than epsilon then we pull currently the best arm.
  • In the long run, that theoretically allows us to explore each arm.
  • The problem of this approach is that with a given percentage, f.e. 10%, we spend 10% of the time exploring when it is no longer needed and thus choosing a suboptimal arm.

Estimating Bandit Rewards

  • Bandit rewards assumed to be non-binaries (this approach works for a binary reward setup as well).
  • Reward tracking can be done with the mean value of the rewards of the previous episodes
  • Mean value is given as: $$ \bar{X} = \frac{1}{N} \sum_{i=1}^{N} X_{i} $$

  • This way of calculating the mean value is not scalabable, since we would need to store all the results from the previous episodes.

  • This problem can be solved as follows: $$ \bar{X}_{N} = \frac{1}{N} \sum_{i=1}^{N} X_{i} = \frac{1}{N} \sum_{i=1}^{N-1} X_{i} + \frac{1}{N} X_{N} = \frac{N-1}{N} \bar{X}_{N-1} + \frac{1}{N} X_{N} $$
$$ \bar{X}_{N} = (1 - \frac{1}{N}) \bar{X}_{N-1} + \frac{1}{N} X_{N} $$

Code

Relevant files are:

  • 1-epsilon_greedy_strategy.py
  • 2_one-armed_bandit.py

Optimistic Initial Values

  • A simple approach to solving the explore-exploit dilemma is picking high initial value
  • We need to have some kind of knowledge about the process to do it
  • That means that while the data is collected the values for each of the arms are going to go down
  • All the values should finally converge to their true values, but since we are using the greedy algorithm, we ensure that arms that are not explored enough, will be chosen.

Code

Relevant files are:

  • 3_one-armed_bandit_optimistic_init_values.py

UCB1

  • Chernoff-Hoeffding bound states that the confidence bound changes exponentially with number of samples we collect: $$ P\left \{ \left | \bar{X} - \mu \geq \varepsilon \right | \right \} \leq 2exp\left \{ -2\varepsilon^{2}N \right \} $$

  • That leads to another, simpler equation: $$ X_{UBC-j} = \bar{X}_{j} + \sqrt{2\frac{lnN}{N_{j}}} $$ where:

    • N = number of times played in total
    • Nj = number of times played bandit j

Code

Relevant files are:

  • 4-ucb1.py

In [ ]: