Reinforcement Learning in Python

Section 1 - Introduction and Outline

Reinforcement learning is more different from supervised and unsupervised machine learning are from each other.
Both supervised and unsupervised machine learning use gathered, labelled or unlabelled data.
Reinforcement learning interfaces with an environment (simulated environment or through sensors).
Objective of the RL is to minimize the cost / maximize the reward.
The agent makes actions and feedback signals (in form of the reward) are automatically given to the agent by the environment.
This course will only look at finite state space enviroment.

1) What is the number of states?

The board has 9 locations.
Each location on the board has 3 possibilities: empty, X and O.
The example is simplified but not finishing the game when a player wins.
With that the game has following number of states: $$ 3x3x3x3x3x3x3x3x3 = 3^9 $$

Agent: goal of the development, it interacts with the environment. Hopefully with an intelligent manner.
Environment: real or simulated world providing sensory data to the agent in for states, list of actions and a reward.
State: a configuration of the environment sensed by the agent.

You go to a casino and choose between 3 slot machines.
Each returns a reward binary reward (0 or 1).
Win rate is unknown, it could be 0.1, 0.3, 0.5 respectively.
The goal is to maximize your winnings.
The best choice / tactic can only be discovered by collecting data.
It is important to balance the exploration (collecting the data) and exploit (abusing the best tactic).

A big advantage of this strategy is its simplicity.
Choose a small number epsilon as probability of exploration (typically 5% or 10%).
For each round we generate a random number.
If the number is less than epsilon then we pull a random arm.
If the number is bigger than epsilon then we pull currently the best arm.
In the long run, that theoretically allows us to explore each arm.
The problem of this approach is that with a given percentage, f.e. 10%, we spend 10% of the time exploring when it is no longer needed and thus choosing a suboptimal arm.

Bandit rewards assumed to be non-binaries (this approach works for a binary reward setup as well).
Reward tracking can be done with the mean value of the rewards of the previous episodes
Mean value is given as: $$ \bar{X} = \frac{1}{N} \sum_{i=1}^{N} X_{i} $$
This way of calculating the mean value is not scalabable, since we would need to store all the results from the previous episodes.
This problem can be solved as follows: $$ \bar{X}_{N} = \frac{1}{N} \sum_{i=1}^{N} X_{i} = \frac{1}{N} \sum_{i=1}^{N-1} X_{i} + \frac{1}{N} X_{N} = \frac{N-1}{N} \bar{X}_{N-1} + \frac{1}{N} X_{N} $$

$$ \bar{X}_{N} = (1 - \frac{1}{N}) \bar{X}_{N-1} + \frac{1}{N} X_{N} $$

A simple approach to solving the explore-exploit dilemma is picking high initial value
We need to have some kind of knowledge about the process to do it
That means that while the data is collected the values for each of the arms are going to go down
All the values should finally converge to their true values, but since we are using the greedy algorithm, we ensure that arms that are not explored enough, will be chosen.

Chernoff-Hoeffding bound states that the confidence bound changes exponentially with number of samples we collect: $$ P\left \{ \left | \bar{X} - \mu \geq \varepsilon \right | \right \} \leq 2exp\left \{ -2\varepsilon^{2}N \right \} $$
That leads to another, simpler equation: $$ X_{UBC-j} = \bar{X}_{j} + \sqrt{2\frac{lnN}{N_{j}}} $$ where:
- N = number of times played in total
- Nj = number of times played bandit j



In [ ]: