In [2]:
import gym
import numpy as np

Gym Taxi-v2 environment

lets have a look at the problem and how the env has been setup


In [3]:
env = gym.make("Taxi-v2")


[2017-08-04 13:06:34,933] Making new env: Taxi-v2

In [37]:
env.reset()   # init state value of env


Out[37]:
209

In [38]:
env.observation_space.n   # number of possible values in this state space


Out[38]:
500

In [39]:
env.action_space.n   # number of possible actions
# print(env.action_space)
# 0 = down
# 1 = up
# 2 = right
# 3 = left
# 4 = pickup
# 5 = drop-off


Out[39]:
6

In [40]:
env.render()
# In this environment the yellow square represents the taxi, the (“|”) represents a wall, the blue letter represents the pick-up location, and the purple letter is the drop-off location. The taxi will turn green when it has a passenger aboard.


+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+


In [57]:
env.env.s = 114
env.render()


+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+


In [60]:
state, reward, done, info = env.step(1)

In [59]:
env.render()


+---------+
|R: | : :G|
| : : : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+
  (North)

The environment is considered solved when you successfully pick up a passenger and drop them off at their desired location. Upon doing this, you will receive a reward of 20 and done will equal True.|

A first naive solution

at every step, randomly choose one of the available 6 actions

A core part of evaluating any agent's performance is to compare it against a completely random agent.


In [63]:
def taxiRandomSearch(env):
    """ Randomly pick an action and keep guessing until the env is solved
    :param env: Gym Taxi-v2 env
    :return: number of steps required to solve the Gym Taxi-v2 env
    """
    state = env.reset()
    stepCounter = 0
    reward = None
    while reward != 20:  # reward 20 means that the env has been solved
        state, reward, done, info = env.step(env.action_space.sample())
        stepCounter += 1
    return stepCounter

In [75]:
print(taxiRandomSearch(env))


2430

Let's build in some memory to remember actions and their associated rewards

  • the memory is going to be a Q action value table (using a np array of size 500x6, nb of states x nb of actions)

In short, the problem is solved multiple times (each time called an episode) and the Q-table (memory) is updated to improve the algorithm's efficiency and performance.


In [83]:
Q = np.zeros([env.observation_space.n, env.action_space.n])   # memory, stores the value (reward) for every single state and every action you can take
G = 0   # accumulated reward for each episode
alpha = 0.618   # learning rate

In [84]:
Q[114]


Out[84]:
array([ 0.,  0.,  0.,  0.,  0.,  0.])

In [85]:
def taxiQlearning(env):
    """ basic Q learning algo
    :param env: Gym Taxi-v2 env
    :return: None
    """    
    for episode in range(1,1001):
        stepCounter = 0
        done = False
        G, reward = 0,0
        state = env.reset()
        while done != True:
            action = np.argmax(Q[state]) # 1: find action with highest value/reward at the given state
            state2, reward, done, info = env.step(action) # 2: take that 'best action' and store the future state
            Q[state,action] += alpha * (reward + np.max(Q[state2]) - Q[state,action]) # 3: update the q-value using Bellman equation
            G += reward
            state = state2    
            stepCounter += 1
        if episode % 50 == 0:
            print('Episode {} Total Reward: {}'.format(episode,G))    
            print('Steps required for this episode: %i'% stepCounter)

In [86]:
taxiQlearning(env)


Episode 50 Total Reward: 5
Steps required for this episode: 16
Episode 100 Total Reward: -40
Steps required for this episode: 61
Episode 150 Total Reward: -47
Steps required for this episode: 68
Episode 200 Total Reward: 13
Steps required for this episode: 8
Episode 250 Total Reward: 14
Steps required for this episode: 7
Episode 300 Total Reward: 7
Steps required for this episode: 14
Episode 350 Total Reward: 6
Steps required for this episode: 15
Episode 400 Total Reward: 9
Steps required for this episode: 12
Episode 450 Total Reward: 9
Steps required for this episode: 12
Episode 500 Total Reward: 7
Steps required for this episode: 14
Episode 550 Total Reward: 9
Steps required for this episode: 12
Episode 600 Total Reward: 10
Steps required for this episode: 11
Episode 650 Total Reward: 4
Steps required for this episode: 17
Episode 700 Total Reward: 9
Steps required for this episode: 12
Episode 750 Total Reward: 9
Steps required for this episode: 12
Episode 800 Total Reward: 8
Steps required for this episode: 13
Episode 850 Total Reward: 10
Steps required for this episode: 11
Episode 900 Total Reward: 9
Steps required for this episode: 12
Episode 950 Total Reward: 6
Steps required for this episode: 15
Episode 1000 Total Reward: 10
Steps required for this episode: 11

In [88]:
print(Q[14])


[ 787.81767312   -9.27         -8.74731466   -9.27        -12.36        -12.36      ]

In [ ]: