Berater Environment

  • der observation space besteht aus 1 Wert: self.observation_space = spaces.Discrete(1). Ich nehme an dieser soll die aktuelle position auf dem graphen darstellen (S,A,B,C). Hierfür würde ich spaces.Discrete(4) verwenden.

  • mit diesem observation space «weiss» der agent lediglich die position, ansonsten aber nichts ueber das restliche «spielfeld». Ich würde hier dem agent das ganze spielfeld «geben»

  • als policy wird hier die «default» «mlp» policy verwendet. Soweit ich sehen kann ist das ein fully-connected 2 layer nn mit 64 neuronen pro layer.

Ich habe mal versucht, das «mit dem ganzen spielfeld» im notebook durchzuspielen. Anbei meine aktualisierte version. Das Training scheint nach <=60k steps fertig zu sein und erreicht jeweils einen durchschnittlichen total reward von ~0.73. wenn ich das von hand rechne komme ich auf einen aehnlichen wert. schau’s dir doch mal an.

Aktuell sind die kosten auf den kanten des graphen fix. Interessant könnte sein, während jeder episode diese neu (random) zu wählen. Damit wäre der trainierte agent dann nach dem lernen vielleicht in der lage bei «beliebigen» Kosten jeweils eine gute lösung zu finden.

habe ich verschiedene policy architekturen durchprobiert. Das waren alles policies vom «mlp» typ mit 1-5 layern und 100-4500 neuronen pro layer. Am schluss habe ich dann diejenige genommen die am «besten» und am «einfachsten» war: 1 layer, 500 neuronen mit tanh als aktivierungs-funkction.

Open Questions

  • Why does the this observation space exactly has this format?
  • How do we make this more general?
    • Do we train just one system and apply it to variations?
    • Do we train for each setting?

Installation (required for colab)


In [0]:
# !pip install -e git+https://github.com/openai/baselines#egg=berater

important for colab: comment line above and restart runtime after installation


In [0]:
cnt=0

In [0]:
import numpy
import gym
from gym.utils import seeding
from gym import spaces

def state_name_to_int(state):
    state_name_map = {
        'S': 0,
        'A': 1,
        'B': 2,
        'C': 3,
    }
    return state_name_map[state]

def int_to_state_name(state_as_int):
    state_map = {
        0: 'S',
        1: 'A',
        2: 'B',
        3: 'C'
    }
    return state_map[state_as_int]
    
class BeraterEnv(gym.Env):
    """
    The Berater Problem

    Actions: 
    There are 3 discrete deterministic actions:
    - 0: First Direction
    - 1: Second Direction
    - 2: Third Direction / Go home
    """
    metadata = {'render.modes': ['ansi']}
    
    num_envs = 1
    showStep = False
    showDone = True
    showRender = False
    envEpisodeModulo = 100

    def __init__(self):
        self.map = {
            'S': [('A', 100), ('B', 400), ('C', 200 )],
            'A': [('B', 250), ('C', 400), ('S', 100 )],
            'B': [('A', 250), ('C', 250), ('S', 400 )],
            'C': [('A', 400), ('B', 250), ('S', 200 )]
        }
        self.action_space = spaces.Discrete(3)
        self.observation_space = spaces.Box(low=numpy.array([0,-1000,-1000,-1000,-1000,-1000,-1000]),
                                             high=numpy.array([3,1000,1000,1000,1000,1000,1000]),
                                             dtype=numpy.float32)


        self.totalReward = 0
        self.stepCount = 0
        self.isDone = False

        self.envReward = 0
        self.envEpisodeCount = 0
        self.envStepCount = 0

        self.reset()
        self.optimum = self.calculate_customers_reward()

    def seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

    def step(self, actionArg):
        paths = self.map[self.state]
        action = actionArg
        destination, cost = paths[action]
        lastState = self.state
        lastObState = state_name_to_int(lastState)
        customerReward = self.customer_reward[destination]

        info = {"from": self.state, "to": destination}

        self.state = destination
        reward = (-cost + self.customer_reward[destination]) / self.optimum
        self.customer_visited(destination)
        done = destination == 'S' and self.all_customers_visited()

        stateAsInt = state_name_to_int(self.state)
        self.totalReward += reward
        self.stepCount += 1
        self.envReward += reward
        self.envStepCount += 1

        if self.showStep:
            print( "Episode: " + ("%4.0f  " % self.envEpisodeCount) + 
                   " Step: " + ("%4.0f  " % self.stepCount) + 
                   #lastState + ':' + str(lastObState) + ' --' + str(action) + '-> ' + self.state + ':' + str(stateAsInt) +
                   lastState + ' --' + str(action) + '-> ' + self.state + 
                   ' R=' + ("% 2.2f" % reward) + ' totalR=' + ("% 3.2f" % self.totalReward) + 
                   ' cost=' + ("%4.0f" % cost) + ' customerR=' + ("%4.0f" % customerReward) + ' optimum=' + ("%4.0f" % self.optimum)      
                   )

        if done and not self.isDone:
            self.envEpisodeCount += 1
            if BeraterEnv.showDone or (self.envEpisodeCount%BeraterEnv.envEpisodeModulo) == 0:
                episodes = BeraterEnv.envEpisodeModulo
                if (self.envEpisodeCount % BeraterEnv.envEpisodeModulo != 0):
                    episodes = self.envEpisodeCount % BeraterEnv.envEpisodeModulo
                print( "Done: " + 
                        ("episodes=%6.0f  " % self.envEpisodeCount) + 
                        ("avgSteps=%6.2f  " % (self.envStepCount/episodes)) + 
                        ("avgTotalReward=% 3.2f" % (self.envReward/episodes) )
                        )
                if (self.envEpisodeCount%BeraterEnv.envEpisodeModulo) == 0:
                    self.envReward = 0
                    self.envStepCount = 0

        self.isDone = done
        observation = self.getObservation(stateAsInt)

        return observation, reward, done, info

    def getObservation(self, position):
        result = numpy.array([ position, 
                               self.getEdgeObservation('S','A'),
                               self.getEdgeObservation('S','B'),
                               self.getEdgeObservation('S','C'),
                               self.getEdgeObservation('A','B'),
                               self.getEdgeObservation('A','C'),
                               self.getEdgeObservation('B','C'),
                              ],
                             dtype=numpy.float32)
        return result

    def getEdgeObservation(self, source, target):
        reward = self.customer_reward[target] 
        cost = self.getCost(source,target)
        result = reward - cost

        return result

    def getCost(self, source, target):
        paths = self.map[source]
        targetIndex=state_name_to_int(target)
        for destination, cost in paths:
            if destination == target:
                result = cost
                break

        return result

    def customer_visited(self, customer):
        self.customer_reward[customer] = 0

    def all_customers_visited(self):
        return self.calculate_customers_reward() == 0

    def calculate_customers_reward(self):
        sum = 0
        for value in self.customer_reward.values():
            sum += value
        return sum

    def reset(self):
        # print("Reset")
        
        self.totalReward = 0
        self.stepCount = 0
        self.isDone = False
        reward_per_customer = 1000
        self.customer_reward = {
            'S': 0,
            'A': reward_per_customer,
            'B': reward_per_customer,
            'C': reward_per_customer,
        }

        self.state = 'S'
        return state_name_to_int(self.state)

    def render(self, mode='human'):
        if BeraterEnv.showRender:
            print( ("steps=%4.0f  " % self.stepCount) + ' totalR=' + ("% 3.2f" % self.totalReward) + ' done=' + str(self.isDone))

Register Einvornment


In [4]:
from gym.envs.registration import register

cnt += 1
id = "Berater-v{}".format(cnt)
register(
    id=id,
    entry_point=BeraterEnv
)   

print(id)


Berater-v1

Try out Environment


In [5]:
BeraterEnv.showStep = True
BeraterEnv.showDone = True

env = gym.make(id)
observation = env.reset()
print(env)

for t in range(1000):
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    if done:
        env.render()
        break
env.close()


<BeraterEnv<Berater-v1>>
Episode:    0   Step:    1  S --0-> A R= 0.30 totalR= 0.30 cost= 100 customerR=1000 optimum=3000
Episode:    0   Step:    2  A --1-> C R= 0.20 totalR= 0.50 cost= 400 customerR=1000 optimum=3000
Episode:    0   Step:    3  C --0-> A R=-0.13 totalR= 0.37 cost= 400 customerR=   0 optimum=3000
Episode:    0   Step:    4  A --1-> C R=-0.13 totalR= 0.23 cost= 400 customerR=   0 optimum=3000
Episode:    0   Step:    5  C --1-> B R= 0.25 totalR= 0.48 cost= 250 customerR=1000 optimum=3000
Episode:    0   Step:    6  B --2-> S R=-0.13 totalR= 0.35 cost= 400 customerR=   0 optimum=3000
Done: episodes=     1  avgSteps=  6.00  avgTotalReward= 0.35

Train model


In [6]:
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from baselines.ppo2 import ppo2

BeraterEnv.showStep = False
BeraterEnv.showDone = False

wrapped_env = DummyVecEnv([lambda: gym.make(id)])
model = ppo2.learn(network='mlp', env=wrapped_env, total_timesteps=60000)


Logging to /tmp/openai-2018-11-26-10-33-07-701445
Done: episodes=   100  avgSteps=  8.18  avgTotalReward= 0.29
Done: episodes=   200  avgSteps=  8.66  avgTotalReward= 0.22
-------------------------------------
| approxkl           | 0.0018272153 |
| clipfrac           | 0.020507812  |
| eplenmean          | nan          |
| eprewmean          | nan          |
| explained_variance | -0.4         |
| fps                | 430          |
| nupdates           | 1            |
| policy_entropy     | 1.0966616    |
| policy_loss        | -0.014113387 |
| serial_timesteps   | 2048         |
| time_elapsed       | 4.75         |
| total_timesteps    | 2048         |
| value_loss         | 0.054653596  |
-------------------------------------
Done: episodes=   300  avgSteps=  8.43  avgTotalReward= 0.25
Done: episodes=   400  avgSteps=  7.89  avgTotalReward= 0.29
Done: episodes=   500  avgSteps=  7.78  avgTotalReward= 0.31
Done: episodes=   600  avgSteps=  6.82  avgTotalReward= 0.39
Done: episodes=   700  avgSteps=  6.90  avgTotalReward= 0.40
Done: episodes=   800  avgSteps=  7.13  avgTotalReward= 0.38
Done: episodes=   900  avgSteps=  6.00  avgTotalReward= 0.46
Done: episodes=  1000  avgSteps=  6.49  avgTotalReward= 0.43
Done: episodes=  1100  avgSteps=  6.29  avgTotalReward= 0.44
Done: episodes=  1200  avgSteps=  6.24  avgTotalReward= 0.45
Done: episodes=  1300  avgSteps=  6.23  avgTotalReward= 0.47
Done: episodes=  1400  avgSteps=  5.91  avgTotalReward= 0.48
Done: episodes=  1500  avgSteps=  6.09  avgTotalReward= 0.45
Done: episodes=  1600  avgSteps=  5.92  avgTotalReward= 0.47
Done: episodes=  1700  avgSteps=  5.95  avgTotalReward= 0.46
Done: episodes=  1800  avgSteps=  6.05  avgTotalReward= 0.46
Done: episodes=  1900  avgSteps=  5.20  avgTotalReward= 0.54
Done: episodes=  2000  avgSteps=  5.39  avgTotalReward= 0.52
Done: episodes=  2100  avgSteps=  5.23  avgTotalReward= 0.53
Done: episodes=  2200  avgSteps=  5.42  avgTotalReward= 0.52
Done: episodes=  2300  avgSteps=  5.07  avgTotalReward= 0.54
Done: episodes=  2400  avgSteps=  5.18  avgTotalReward= 0.54
Done: episodes=  2500  avgSteps=  5.38  avgTotalReward= 0.52
Done: episodes=  2600  avgSteps=  5.30  avgTotalReward= 0.53
Done: episodes=  2700  avgSteps=  5.01  avgTotalReward= 0.55
Done: episodes=  2800  avgSteps=  4.91  avgTotalReward= 0.57
Done: episodes=  2900  avgSteps=  5.07  avgTotalReward= 0.54
Done: episodes=  3000  avgSteps=  4.96  avgTotalReward= 0.55
Done: episodes=  3100  avgSteps=  4.87  avgTotalReward= 0.57
Done: episodes=  3200  avgSteps=  5.05  avgTotalReward= 0.53
Done: episodes=  3300  avgSteps=  4.93  avgTotalReward= 0.55
Done: episodes=  3400  avgSteps=  4.84  avgTotalReward= 0.58
-------------------------------------
| approxkl           | 0.0015142018 |
| clipfrac           | 0.013671875  |
| eplenmean          | nan          |
| eprewmean          | nan          |
| explained_variance | 0.755        |
| fps                | 504          |
| nupdates           | 10           |
| policy_entropy     | 0.7535325    |
| policy_loss        | -0.014194089 |
| serial_timesteps   | 20480        |
| time_elapsed       | 40.7         |
| total_timesteps    | 20480        |
| value_loss         | 0.00904465   |
-------------------------------------
Done: episodes=  3500  avgSteps=  4.80  avgTotalReward= 0.59
Done: episodes=  3600  avgSteps=  4.75  avgTotalReward= 0.59
Done: episodes=  3700  avgSteps=  4.60  avgTotalReward= 0.59
Done: episodes=  3800  avgSteps=  4.63  avgTotalReward= 0.57
Done: episodes=  3900  avgSteps=  4.67  avgTotalReward= 0.59
Done: episodes=  4000  avgSteps=  4.60  avgTotalReward= 0.61
Done: episodes=  4100  avgSteps=  4.56  avgTotalReward= 0.60
Done: episodes=  4200  avgSteps=  4.70  avgTotalReward= 0.59
Done: episodes=  4300  avgSteps=  4.40  avgTotalReward= 0.63
Done: episodes=  4400  avgSteps=  4.43  avgTotalReward= 0.64
Done: episodes=  4500  avgSteps=  4.40  avgTotalReward= 0.62
Done: episodes=  4600  avgSteps=  4.49  avgTotalReward= 0.61
Done: episodes=  4700  avgSteps=  4.37  avgTotalReward= 0.65
Done: episodes=  4800  avgSteps=  4.40  avgTotalReward= 0.63
Done: episodes=  4900  avgSteps=  4.32  avgTotalReward= 0.64
Done: episodes=  5000  avgSteps=  4.48  avgTotalReward= 0.63
Done: episodes=  5100  avgSteps=  4.42  avgTotalReward= 0.63
Done: episodes=  5200  avgSteps=  4.32  avgTotalReward= 0.63
Done: episodes=  5300  avgSteps=  4.32  avgTotalReward= 0.64
Done: episodes=  5400  avgSteps=  4.35  avgTotalReward= 0.65
Done: episodes=  5500  avgSteps=  4.23  avgTotalReward= 0.66
Done: episodes=  5600  avgSteps=  4.26  avgTotalReward= 0.64
Done: episodes=  5700  avgSteps=  4.32  avgTotalReward= 0.63
Done: episodes=  5800  avgSteps=  4.22  avgTotalReward= 0.66
Done: episodes=  5900  avgSteps=  4.16  avgTotalReward= 0.67
Done: episodes=  6000  avgSteps=  4.13  avgTotalReward= 0.68
Done: episodes=  6100  avgSteps=  4.26  avgTotalReward= 0.66
Done: episodes=  6200  avgSteps=  4.24  avgTotalReward= 0.66
Done: episodes=  6300  avgSteps=  4.14  avgTotalReward= 0.68
Done: episodes=  6400  avgSteps=  4.17  avgTotalReward= 0.68
Done: episodes=  6500  avgSteps=  4.25  avgTotalReward= 0.67
Done: episodes=  6600  avgSteps=  4.10  avgTotalReward= 0.68
Done: episodes=  6700  avgSteps=  4.15  avgTotalReward= 0.68
Done: episodes=  6800  avgSteps=  4.17  avgTotalReward= 0.69
Done: episodes=  6900  avgSteps=  4.12  avgTotalReward= 0.69
Done: episodes=  7000  avgSteps=  4.17  avgTotalReward= 0.68
Done: episodes=  7100  avgSteps=  4.18  avgTotalReward= 0.69
Done: episodes=  7200  avgSteps=  4.16  avgTotalReward= 0.69
Done: episodes=  7300  avgSteps=  4.11  avgTotalReward= 0.70
Done: episodes=  7400  avgSteps=  4.18  avgTotalReward= 0.69
Done: episodes=  7500  avgSteps=  4.13  avgTotalReward= 0.69
Done: episodes=  7600  avgSteps=  4.16  avgTotalReward= 0.68
Done: episodes=  7700  avgSteps=  4.10  avgTotalReward= 0.70
Done: episodes=  7800  avgSteps=  4.08  avgTotalReward= 0.71
Done: episodes=  7900  avgSteps=  4.10  avgTotalReward= 0.70
Done: episodes=  8000  avgSteps=  4.07  avgTotalReward= 0.70
Done: episodes=  8100  avgSteps=  4.07  avgTotalReward= 0.71
-------------------------------------
| approxkl           | 0.0010771031 |
| clipfrac           | 0.013793945  |
| eplenmean          | nan          |
| eprewmean          | nan          |
| explained_variance | 0.975        |
| fps                | 511          |
| nupdates           | 20           |
| policy_entropy     | 0.34031704   |
| policy_loss        | -0.017940775 |
| serial_timesteps   | 40960        |
| time_elapsed       | 80.6         |
| total_timesteps    | 40960        |
| value_loss         | 0.0010044485 |
-------------------------------------
Done: episodes=  8200  avgSteps=  4.07  avgTotalReward= 0.71
Done: episodes=  8300  avgSteps=  4.04  avgTotalReward= 0.71
Done: episodes=  8400  avgSteps=  4.07  avgTotalReward= 0.72
Done: episodes=  8500  avgSteps=  4.08  avgTotalReward= 0.71
Done: episodes=  8600  avgSteps=  4.08  avgTotalReward= 0.71
Done: episodes=  8700  avgSteps=  4.01  avgTotalReward= 0.72
Done: episodes=  8800  avgSteps=  4.10  avgTotalReward= 0.71
Done: episodes=  8900  avgSteps=  4.03  avgTotalReward= 0.72
Done: episodes=  9000  avgSteps=  4.08  avgTotalReward= 0.72
Done: episodes=  9100  avgSteps=  4.02  avgTotalReward= 0.72
Done: episodes=  9200  avgSteps=  4.09  avgTotalReward= 0.71
Done: episodes=  9300  avgSteps=  4.01  avgTotalReward= 0.72
Done: episodes=  9400  avgSteps=  4.04  avgTotalReward= 0.72
Done: episodes=  9500  avgSteps=  4.02  avgTotalReward= 0.72
Done: episodes=  9600  avgSteps=  4.03  avgTotalReward= 0.72
Done: episodes=  9700  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes=  9800  avgSteps=  4.06  avgTotalReward= 0.72
Done: episodes=  9900  avgSteps=  4.07  avgTotalReward= 0.72
Done: episodes= 10000  avgSteps=  4.03  avgTotalReward= 0.73
Done: episodes= 10100  avgSteps=  4.04  avgTotalReward= 0.73
Done: episodes= 10200  avgSteps=  4.04  avgTotalReward= 0.72
Done: episodes= 10300  avgSteps=  4.03  avgTotalReward= 0.73
Done: episodes= 10400  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes= 10500  avgSteps=  4.02  avgTotalReward= 0.73
Done: episodes= 10600  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes= 10700  avgSteps=  4.02  avgTotalReward= 0.73
Done: episodes= 10800  avgSteps=  4.02  avgTotalReward= 0.73
Done: episodes= 10900  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes= 11000  avgSteps=  4.02  avgTotalReward= 0.73
Done: episodes= 11100  avgSteps=  4.03  avgTotalReward= 0.73
Done: episodes= 11200  avgSteps=  4.03  avgTotalReward= 0.72
Done: episodes= 11300  avgSteps=  4.06  avgTotalReward= 0.73
Done: episodes= 11400  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes= 11500  avgSteps=  4.02  avgTotalReward= 0.73
Done: episodes= 11600  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 11700  avgSteps=  4.03  avgTotalReward= 0.73
Done: episodes= 11800  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 11900  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 12000  avgSteps=  4.02  avgTotalReward= 0.73
Done: episodes= 12100  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes= 12200  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 12300  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes= 12400  avgSteps=  4.02  avgTotalReward= 0.73
Done: episodes= 12500  avgSteps=  4.02  avgTotalReward= 0.73
Done: episodes= 12600  avgSteps=  4.02  avgTotalReward= 0.73
Done: episodes= 12700  avgSteps=  4.00  avgTotalReward= 0.73

Enjoy model


In [8]:
import numpy as np 

observation = wrapped_env.reset()
state = np.zeros((1, 2*128))
dones = np.zeros((1))

BeraterEnv.showStep = True
BeraterEnv.showDone = False

for t in range(1000):
    actions, _, state, _ = model.step(observation, S=state, M=dones)
    observation, reward, done, info = wrapped_env.step(actions)
    if done:
        print("Episode finished after {} timesteps".format(t+1))
        break
env.close()


Episode: 12732   Step:    1  S --2-> C R= 0.27 totalR= 0.27 cost= 200 customerR=1000 optimum=3000
Episode: 12732   Step:    2  C --1-> B R= 0.25 totalR= 0.52 cost= 250 customerR=1000 optimum=3000
Episode: 12732   Step:    3  B --0-> A R= 0.25 totalR= 0.77 cost= 250 customerR=1000 optimum=3000
Episode: 12732   Step:    4  A --2-> S R=-0.03 totalR= 0.73 cost= 100 customerR=   0 optimum=3000
Episode finished after 4 timesteps

In [0]: