In [2]:

    
import gym
import numpy as np

Gym Taxi-v2 environment

lets have a look at the problem and how the env has been setup



In [3]:

    
env = gym.make("Taxi-v2")









    



[2017-08-04 13:06:34,933] Making new env: Taxi-v2



In [37]:

    
env.reset()   # init state value of env









    Out[37]:





209



In [38]:

    
env.observation_space.n   # number of possible values in this state space









    Out[38]:





500



In [39]:

    
env.action_space.n   # number of possible actions
# print(env.action_space)
# 0 = down
# 1 = up
# 2 = right
# 3 = left
# 4 = pickup
# 5 = drop-off









    Out[39]:





6



In [40]:

    
env.render()
# In this environment the yellow square represents the taxi, the (“|”) represents a wall, the blue letter represents the pick-up location, and the purple letter is the drop-off location. The taxi will turn green when it has a passenger aboard.









    



+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+



In [57]:

    
env.env.s = 114
env.render()









    



+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+



In [60]:

    
state, reward, done, info = env.step(1)



In [59]:

    
env.render()









    



+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)

The environment is considered solved when you successfully pick up a passenger and drop them off at their desired location. Upon doing this, you will receive a reward of 20 and done will equal True.|

A first naive solution

at every step, randomly choose one of the available 6 actions

A core part of evaluating any agent's performance is to compare it against a completely random agent.



In [63]:

    
def taxiRandomSearch(env):
    """ Randomly pick an action and keep guessing until the env is solved
    :param env: Gym Taxi-v2 env
    :return: number of steps required to solve the Gym Taxi-v2 env
    """
    state = env.reset()
    stepCounter = 0
    reward = None
    while reward != 20:  # reward 20 means that the env has been solved
        state, reward, done, info = env.step(env.action_space.sample())
        stepCounter += 1
    return stepCounter



In [75]:

    
print(taxiRandomSearch(env))

Let's build in some memory to remember actions and their associated rewards

the memory is going to be a Q action value table (using a np array of size 500x6, nb of states x nb of actions)

In short, the problem is solved multiple times (each time called an episode) and the Q-table (memory) is updated to improve the algorithm's efficiency and performance.



In [83]:

    
Q = np.zeros([env.observation_space.n, env.action_space.n])   # memory, stores the value (reward) for every single state and every action you can take
G = 0   # accumulated reward for each episode
alpha = 0.618   # learning rate



In [84]:

    
Q[114]









    Out[84]:





array([ 0.,  0.,  0.,  0.,  0.,  0.])



In [85]:

    
def taxiQlearning(env):
    """ basic Q learning algo
    :param env: Gym Taxi-v2 env
    :return: None
    """    
    for episode in range(1,1001):
        stepCounter = 0
        done = False
        G, reward = 0,0
        state = env.reset()
        while done != True:
            action = np.argmax(Q[state]) # 1: find action with highest value/reward at the given state
            state2, reward, done, info = env.step(action) # 2: take that 'best action' and store the future state
            Q[state,action] += alpha * (reward + np.max(Q[state2]) - Q[state,action]) # 3: update the q-value using Bellman equation
            G += reward
            state = state2    
            stepCounter += 1
        if episode % 50 == 0:
            print('Episode {} Total Reward: {}'.format(episode,G))    
            print('Steps required for this episode: %i'% stepCounter)



In [86]:

    
taxiQlearning(env)









    



Episode 50 Total Reward: 5
Steps required for this episode: 16
Episode 100 Total Reward: -40
Steps required for this episode: 61
Episode 150 Total Reward: -47
Steps required for this episode: 68
Episode 200 Total Reward: 13
Steps required for this episode: 8
Episode 250 Total Reward: 14
Steps required for this episode: 7
Episode 300 Total Reward: 7
Steps required for this episode: 14
Episode 350 Total Reward: 6
Steps required for this episode: 15
Episode 400 Total Reward: 9
Steps required for this episode: 12
Episode 450 Total Reward: 9
Steps required for this episode: 12
Episode 500 Total Reward: 7
Steps required for this episode: 14
Episode 550 Total Reward: 9
Steps required for this episode: 12
Episode 600 Total Reward: 10
Steps required for this episode: 11
Episode 650 Total Reward: 4
Steps required for this episode: 17
Episode 700 Total Reward: 9
Steps required for this episode: 12
Episode 750 Total Reward: 9
Steps required for this episode: 12
Episode 800 Total Reward: 8
Steps required for this episode: 13
Episode 850 Total Reward: 10
Steps required for this episode: 11
Episode 900 Total Reward: 9
Steps required for this episode: 12
Episode 950 Total Reward: 6
Steps required for this episode: 15
Episode 1000 Total Reward: 10
Steps required for this episode: 11



In [88]:

    
print(Q[14])









    



[ 787.81767312   -9.27         -8.74731466   -9.27        -12.36        -12.36      ]



In [ ]: