Machine Learning Engineer Nanodegree

Capstone

Project: Capstone: Creating a custom RL algorithm

Note: Code and Markdown cells can be executed using the Shift + Enter keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

Introduction

Markov Decision Problems (MDP) are useful for a wide variety of problems but it has a key weakness. It assumes perfect sense of all relevant information to the problem. This limits the range of problems it can solve optimally to those that are in controlled and simplified environments so as to make it easier to add sensors.

Partially Observable MDP’s (POMDP) lack this weakness. Broadening the range of problems an agent can solve at the cost of being more intractable to solve. Recently Cooperative Inverse reinforcement Learning, a useful set of problems that are relevant to value alignment as it can be used to analyze the incentives of human machine interaction, were shown to be reducible from a Multi-Agent MDP (MMDP) to a POMDP with a single agent.[1] Value Alignment is probably one of the most import problems to solve in AI because even if we get everything else right, if we don’t get Value Alignment right then it won’t work out. I think it's reasonable to suspect that a superhuman AI will resist any change to it's utility function and if that's the case then as Norbert Weiner (1960) wrote “If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively . . . we had better be quite sure that the purpose put into the machine is the purpose which we really desire.” The Value alignment problem is the problem of how to create a reward or utility function that benefits humans and where taking it to it’s logical end game doesn’t cause the universe to turn into paper-clips or something else that humans deem low in value if it were plugged into a superhuman AI.

Problem Statement

The problem class I set out to solve is a POMDP. A POMDP is an environment of states, Transitions between States given an action and the last state and a set of observations available to the agent. A POMDP is reducible to a Continuous Markov Decision Process (The Belief Space) Making it solvable by the Witness Algorithm[2][3][4]. I wanted to try my hand reducing the problem even further. I came up with a simple way of approximating a solution to a POMDP by arbitrarily splitting vast regions in belief space into discrete segments. This turns it into a MDP that can be solved with a q-learning algorithm. I couldn’t find a paper on it but the approach is so simple that I would be surprised if no one else had tried it before. Otherwise my key insight here is that a POMDP introduces the question “Which questions are the most relevant to solving a given problem?” Success of this approach can be measured by average reward per trial. There is a possibility that it may never find reach the goal state. For this situation I have added a time limit. If time runs out I end the simulation and it counts as a failure.

Data sets and input

I constructed two partially observable environments. The environments will feed sense data and a reward into the agent. The first environment is the simple tiger problem. The agent is between two doors. Behind one of them is a tiger. The agent can’t sense the tiger directly making this a partially observable environment. The agent wants to get away from the tiger by opening the door with no tiger. If the agent opens the door with the tiger the tiger will maul the agent. There are 3 actions the agent can take. She can listen for the tiger but the sound of the tiger is scary and she wouldn’t want to do it too much. She can open the door on her right or she can open the door on her left. I also added a hallway.

The second environment has four states. It is a 4x1 grid. Cell (0,3) is the goal state. Each cell has a color. If the agent is in the goal state they will observe green. Everywhere else she will observe blue. The agent can move left or right. If the agent moves left in the left most cell they will not move. If they move right in the rightmost cell they will not move. If the agent moves left or right in the goal state they will be teleported to a random blue state with no bias in probability.(about 1/3 chance for each blue chance)

Solution Statement

The Agent will have a model of the environment represented as an array of probabilities with the same size as the amount of States and where the sum of each element in the array equals 1. The agent will update the array whenever it receives new observations from the environment. The transition function would have to be known to the agent as well in order to update the belief co-MDP

The array of probabilities will be used to construct all states for the MDP by splitting it into discrete regions of arbitrary size. (these regions lie along the same hyperplane. [The hyperplane has as many dimensions as states]) The belief is a point in this hyperplane. When the belief enters one of these regions it is said to transition to that particular region.

For some resolution, and some amount of states in a POMPD, The amount of regions will be equal to the resolution to the power of the amount of states - 1. So if the resolution is 3 and the amount of states is 2 there will be 3^2-1 = 3 regions. If the resolution is 3 and the amount of possible states is 4 then there is 3^4-1 = 27 regions.

I'm projecting the high dimentional space onto a two dimensional regular polygon so it is easier to visualize and debug. A visualization of this polygon and which belief state is displayed when the sim is running.


Benchmark Model

A q-learner is then thrown at this MDP to solve it to the best of its ability.

I will then contrast this with the benchmark of just throwing a q-learning algorithm directly on the POMDP.

I will barrow the reliability metric from the smart car project to gauge it's success. I will then compare the vanilla q-learning algorithm against the updated algorithm to see how they both perform on the same metric.

Reliability are measured using a letter-grade system as follows:

Grade Reliability
A+ Agent Solves the POMDP
for 100% of trips.
A Agent Solves the POMDP
for at least 90% of trips.
B Agent Solves the POMDP
for at least 80% of trips.
C Agent Solves the POMDP
for at least 70% of trips.
D Agent Solves the POMDP
for at least 60% of trips.
F Agent fails to solve the POMDP on time
for at least 60% of trips.

In [2]:
# Import the visualization code
import visuals as vs

# Pretty display for notebooks"
%matplotlib inline

Basic POMDPAgent 4x1Simulation Results with no learning


In [18]:
# Load the 'sim_no-learning' log file from the initial simulation results
vs.plot_trials('sim_no-learning.csv')


Basic q-learning agent 4x1Simulation with no learning


In [14]:
# Load the 'qsim_no-learning' file from the default Q-Learning simulation
vs.plot_trials('qsim_no-learning.csv')



In [19]:
# Load the 'qsim_default-learning' file from the default Q-Learning simulation
vs.plot_trials('qsim_default-learning.csv')



In [22]:
# Load the 'qsim_default-learning' file from the modified q-learning algorithm.
vs.plot_trials('sim_default-learning.csv')



In [23]:
# Load the 'qtiger_sim_no-learning' file from the default Q-Learning simulation
vs.plot_trials('qtiger_sim_no-learning.csv')



In [24]:
# Load the 'tiger_sim_no-learning' file from the default Q-Learning simulation
vs.plot_trials('tiger_sim_no-learning.csv')



In [26]:
# Load the 'qtiger_sim_no-learning' file from the default Q-Learning simulation
vs.plot_trials('qtiger_sim_default-learning.csv')



In [ ]:
# Load the 'qtiger_sim_no-learning' file from the default Q-Learning simulation
vs.plot_trials('tiger_sim_default-learning.csv')

Question 6

Using the visualization above that was produced from your default Q-Learning simulation, provide an analysis and make observations about the driving agent like in Question 3. Note that the simulation should have also produced the Q-table in a text file which can help you make observations about the agent's learning. Some additional things you could consider:

  • Are there any observations that are similar between the basic driving agent and the default Q-Learning agent?
  • Approximately how many training trials did the driving agent require before testing? Does that number make sense given the epsilon-tolerance?
  • Is the decaying function you implemented for $\epsilon$ (the exploration factor) accurately represented in the parameters panel?
  • As the number of training trials increased, did the number of bad actions decrease? Did the average reward increase?
  • How does the safety and reliability rating compare to the initial driving agent?

Answer: It took 20 trials before the testing phase took hold. Which makes sense given the default tolerance of < 0.05. The epsilon decay function decays at a constant rate so I would expect to see the parameter value panel to draw a straight line with a downward slope.

The only real similarity between the initial driving agent and the new one is the safety and reliability rating, which they both failed.

As the number of training trials increased the average reward did increase.


Improve the Q-Learning Driving Agent

The third step to creating an optimized Q-Learning agent is to perform the optimization! Now that the Q-Learning algorithm is implemented and the driving agent is successfully learning, it's necessary to tune settings and adjust learning paramaters so the driving agent learns both safety and efficiency. Typically this step will require a lot of trial and error, as some settings will invariably make the learning worse. One thing to keep in mind is the act of learning itself and the time that this takes: In theory, we could allow the agent to learn for an incredibly long amount of time; however, another goal of Q-Learning is to transition from experimenting with unlearned behavior to acting on learned behavior. For example, always allowing the agent to perform a random action during training (if $\epsilon = 1$ and never decays) will certainly make it learn, but never let it act. When improving on your Q-Learning implementation, consider the impliciations it creates and whether it is logistically sensible to make a particular adjustment.

Improved Q-Learning Simulation Results

To obtain results from the initial Q-Learning implementation, you will need to adjust the following flags and setup:

  • 'enforce_deadline' - Set this to True to force the driving agent to capture whether it reaches the destination in time.
  • 'update_delay' - Set this to a small value (such as 0.01) to reduce the time between steps in each trial.
  • 'log_metrics' - Set this to True to log the simluation results as a .csv file and the Q-table as a .txt file in /logs/.
  • 'learning' - Set this to 'True' to tell the driving agent to use your Q-Learning implementation.
  • 'optimized' - Set this to 'True' to tell the driving agent you are performing an optimized version of the Q-Learning implementation.

Additional flags that can be adjusted as part of optimizing the Q-Learning agent:

  • 'n_test' - Set this to some positive number (previously 10) to perform that many testing trials.
  • 'alpha' - Set this to a real number between 0 - 1 to adjust the learning rate of the Q-Learning algorithm.
  • 'epsilon' - Set this to a real number between 0 - 1 to adjust the starting exploration factor of the Q-Learning algorithm.
  • 'tolerance' - set this to some small value larger than 0 (default was 0.05) to set the epsilon threshold for testing.

Furthermore, use a decaying function of your choice for $\epsilon$ (the exploration factor). Note that whichever function you use, it must decay to 'tolerance' at a reasonable rate. The Q-Learning agent will not begin testing until this occurs. Some example decaying functions (for $t$, the number of trials):

$$ \epsilon = a^t, \textrm{for } 0 < a < 1 \hspace{50px}\epsilon = \frac{1}{t^2}\hspace{50px}\epsilon = e^{-at}, \textrm{for } 0 < a < 1 \hspace{50px} \epsilon = \cos(at), \textrm{for } 0 < a < 1$$

You may also use a decaying function for $\alpha$ (the learning rate) if you so choose, however this is typically less common. If you do so, be sure that it adheres to the inequality $0 \leq \alpha \leq 1$.

If you have difficulty getting your implementation to work, try setting the 'verbose' flag to True to help debug. Flags that have been set here should be returned to their default setting when debugging. It is important that you understand what each flag does and how it affects the simulation!

Once you have successfully completed the improved Q-Learning simulation, run the code cell below to visualize the results. Note that log files are overwritten when identical simulations are run, so be careful with what log file is being loaded!


In [5]:
# Load the 'sim_improved-learning' file from the improved Q-Learning simulation
vs.plot_trials('sim_improved-learning.csv')


Question 7

Using the visualization above that was produced from your improved Q-Learning simulation, provide a final analysis and make observations about the improved driving agent like in Question 6. Questions you should answer:

  • What decaying function was used for epsilon (the exploration factor)?
  • Approximately how many training trials were needed for your agent before begining testing?
  • What epsilon-tolerance and alpha (learning rate) did you use? Why did you use them?
  • How much improvement was made with this Q-Learner when compared to the default Q-Learner from the previous section?
  • Would you say that the Q-Learner results show that your driving agent successfully learned an appropriate policy?
  • Are you satisfied with the safety and reliability ratings of the Smartcab?

Answer:

I used a modified sigmoid function for arbitrary reasons. I ran around 93000 trials before starting testing although I suspect that somewhere between 30,000 and 40,000 trials were really necessary because that is where the rate of reliability seems to get stuck. It actually didn't matter what epsilon-tolerance I used because of the nature of the decay function I used. I kept the tolerance level at default levels and whenever I wanted more trials to occur I just changed a few custom variables in the decay function.

comparison between the two decay functions is hard to make because the resolution of the graph is so small. What I can say is that that the optimised decay function allows enough exploration that smartcar is able to learn the targeted strategy. Accident and violations dropped off "quickly" in the optimized decay function dropped off completely eventually while the unoptimized decay function doesn't show this behavior. That being said there was a sigant improvement in the safety and reliability rating and am mildly satisfied with the A+ Safety rating and A Reliability rating but admit that an A+ rating on reliability would be much better.

Define an Optimal Policy

Sometimes, the answer to the important question "what am I trying to get my agent to learn?" only has a theoretical answer and cannot be concretely described. Here, however, you can concretely define what it is the agent is trying to learn, and that is the U.S. right-of-way traffic laws. Since these laws are known information, you can further define, for each state the Smartcab is occupying, the optimal action for the driving agent based on these laws. In that case, we call the set of optimal state-action pairs an optimal policy. Hence, unlike some theoretical answers, it is clear whether the agent is acting "incorrectly" not only by the reward (penalty) it receives, but also by pure observation. If the agent drives through a red light, we both see it receive a negative reward but also know that it is not the correct behavior. This can be used to your advantage for verifying whether the policy your driving agent has learned is the correct one, or if it is a suboptimal policy.

Question 8

Provide a few examples (using the states you've defined) of what an optimal policy for this problem would look like. Afterwards, investigate the 'sim_improved-learning.txt' text file to see the results of your improved Q-Learning algorithm. For each state that has been recorded from the simulation, is the policy (the action with the highest value) correct for the given state? Are there any states where the policy is different than what would be expected from an optimal policy? Provide an example of a state and all state-action rewards recorded, and explain why it is the correct policy.

Answer: There are a lot of states but it's pretty clear that the optimal policy is move forward only when the light is green and no cars from the right or left are moving forward. Only move left when the light is green and oncomming traffic is not moving right or forward. Only move right when taffic on the left is not moving forward on a red light. alway move when a light is green except when oncomming traffic turns left.

If i were to see the memmory of the q-learner and I knew it had an optimal strategy I would expect to see states like these ... ('left', 'red', 'left', None, None, 8) -- forward : -9.86 -- right : -0.08 -- None : 0.65 -- left : -9.98

This is a good policy because the waypoint is to the left, but the agent can't move left because of the red light.

('forward', 'green', 'forward', 'forward', 'forward', 26) -- forward : 1.93 -- right : 1.90 -- None : 0.00 -- left : 0.00

This is a good policy because the waypoint is forward, the light is green, and there is no traffic in the way.

('left', 'green', 'right', None, 'right', 24) -- forward : 0.00 -- right : 0.42 -- None : 0.54 -- left : -20.84

this is a good policy because the waypoint is left, the light is green, but there is traffic in the way. ...

In reality I see that the agent did find the optimal policy for some states. In others, it didn't seem to find it. A lot of actions still had the default values in them which means the agent never got a chance to learn the reward for it.


Optional: Future Rewards - Discount Factor, 'gamma'

Curiously, as part of the Q-Learning algorithm, you were asked to not use the discount factor, 'gamma' in the implementation. Including future rewards in the algorithm is used to aid in propogating positive rewards backwards from a future state to the current state. Essentially, if the driving agent is given the option to make several actions to arrive at different states, including future rewards will bias the agent towards states that could provide even more rewards. An example of this would be the driving agent moving towards a goal: With all actions and rewards equal, moving towards the goal would theoretically yield better rewards if there is an additional reward for reaching the goal. However, even though in this project, the driving agent is trying to reach a destination in the allotted time, including future rewards will not benefit the agent. In fact, if the agent were given many trials to learn, it could negatively affect Q-values!

Optional Question 9

There are two characteristics about the project that invalidate the use of future rewards in the Q-Learning algorithm. One characteristic has to do with the Smartcab itself, and the other has to do with the environment. Can you figure out what they are and why future rewards won't work for this project?

Answer:

Note: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to
File -> Download as -> HTML (.html). Include the finished document along with this notebook as your submission.


In [ ]:


In [ ]: