Berater Environment v2

Changes from v1

  1. change of observation space
    • used to be just on discrete value: position on graph: spaces.Discrete(1)
    • give agent complete field including costs to enable learning: spaces.Box

Next Steps

  1. choose costs of traversal randomly with each episode
    • aim: agent will (hopefully) be able to work with any costs
  2. train a different graph with each episode
    • aim: agent can work on any graph

Installation (required for colab)


In [1]:
!pip install -e git+https://github.com/openai/baselines#egg=berater


Obtaining berater from git+https://github.com/openai/baselines#egg=berater
  Cloning https://github.com/openai/baselines to ./src/berater
  Running setup.py (path:/content/src/berater/setup.py) egg_info for package berater produced metadata for project name baselines. Fix your #egg=berater fragments.
Collecting gym (from baselines)
  Downloading https://files.pythonhosted.org/packages/d4/22/4ff09745ade385ffe707fb5f053548f0f6a6e7d5e98a2b9d6c07f5b931a7/gym-0.10.9.tar.gz (1.5MB)
    100% |████████████████████████████████| 1.5MB 16.6MB/s 
Requirement already satisfied: scipy in /usr/local/lib/python3.6/dist-packages (from baselines) (1.1.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from baselines) (4.28.1)
Requirement already satisfied: joblib in /usr/local/lib/python3.6/dist-packages (from baselines) (0.13.0)
Requirement already satisfied: dill in /usr/local/lib/python3.6/dist-packages (from baselines) (0.2.8.2)
Collecting progressbar2 (from baselines)
  Downloading https://files.pythonhosted.org/packages/4f/6f/acb2dd76f2c77527584bd3a4c2509782bb35c481c610521fc3656de5a9e0/progressbar2-3.38.0-py2.py3-none-any.whl
Collecting cloudpickle (from baselines)
  Downloading https://files.pythonhosted.org/packages/fc/87/7b7ef3038b4783911e3fdecb5c566e3a817ce3e890e164fc174c088edb1e/cloudpickle-0.6.1-py2.py3-none-any.whl
Collecting click (from baselines)
  Downloading https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl (81kB)
    100% |████████████████████████████████| 81kB 25.4MB/s 
Requirement already satisfied: opencv-python in /usr/local/lib/python3.6/dist-packages (from baselines) (3.4.4.19)
Requirement already satisfied: numpy>=1.10.4 in /usr/local/lib/python3.6/dist-packages (from gym->baselines) (1.14.6)
Requirement already satisfied: requests>=2.0 in /usr/local/lib/python3.6/dist-packages (from gym->baselines) (2.18.4)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from gym->baselines) (1.11.0)
Collecting pyglet>=1.2.0 (from gym->baselines)
  Downloading https://files.pythonhosted.org/packages/1c/fc/dad5eaaab68f0c21e2f906a94ddb98175662cc5a654eee404d59554ce0fa/pyglet-1.3.2-py2.py3-none-any.whl (1.0MB)
    100% |████████████████████████████████| 1.0MB 18.7MB/s 
Collecting python-utils>=2.3.0 (from progressbar2->baselines)
  Downloading https://files.pythonhosted.org/packages/eb/a0/19119d8b7c05be49baf6c593f11c432d571b70d805f2fe94c0585e55e4c8/python_utils-2.3.0-py2.py3-none-any.whl
Requirement already satisfied: idna<2.7,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests>=2.0->gym->baselines) (2.6)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.0->gym->baselines) (2018.10.15)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests>=2.0->gym->baselines) (1.22)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests>=2.0->gym->baselines) (3.0.4)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from pyglet>=1.2.0->gym->baselines) (0.16.0)
Building wheels for collected packages: gym
  Running setup.py bdist_wheel for gym ... - \ | / done
  Stored in directory: /root/.cache/pip/wheels/6c/3a/0e/b86dee98876bb56cdb482cc1f72201035e46d1baf69d10d028
Successfully built gym
Installing collected packages: pyglet, gym, python-utils, progressbar2, cloudpickle, click, baselines
  Running setup.py develop for baselines
Successfully installed baselines click-7.0 cloudpickle-0.6.1 gym-0.10.9 progressbar2-3.38.0 pyglet-1.3.2 python-utils-2.3.0

important for colab: comment line above and restart runtime after installation


In [0]:
cnt=0

In [0]:
import numpy
import gym
from gym.utils import seeding
from gym import spaces

def state_name_to_int(state):
    state_name_map = {
        'S': 0,
        'A': 1,
        'B': 2,
        'C': 3,
    }
    return state_name_map[state]

def int_to_state_name(state_as_int):
    state_map = {
        0: 'S',
        1: 'A',
        2: 'B',
        3: 'C'
    }
    return state_map[state_as_int]
    
class BeraterEnv(gym.Env):
    """
    The Berater Problem

    Actions: 
    There are 3 discrete deterministic actions:
    - 0: First Direction
    - 1: Second Direction
    - 2: Third Direction / Go home
    """
    metadata = {'render.modes': ['ansi']}
    
    num_envs = 1
    showStep = False
    showDone = True
    showRender = False
    envEpisodeModulo = 100

    def __init__(self):
        self.map = {
            'S': [('A', 100), ('B', 400), ('C', 200 )],
            'A': [('B', 250), ('C', 400), ('S', 100 )],
            'B': [('A', 250), ('C', 250), ('S', 400 )],
            'C': [('A', 400), ('B', 250), ('S', 200 )]
        }
        self.action_space = spaces.Discrete(3)
        self.observation_space = spaces.Box(low=numpy.array([0,-1000,-1000,-1000,-1000,-1000,-1000]),
                                             high=numpy.array([3,1000,1000,1000,1000,1000,1000]),
                                             dtype=numpy.float32)


        self.totalReward = 0
        self.stepCount = 0
        self.isDone = False

        self.envReward = 0
        self.envEpisodeCount = 0
        self.envStepCount = 0

        self.reset()
        self.optimum = self.calculate_customers_reward()

    def seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

    def step(self, actionArg):
        paths = self.map[self.state]
        action = actionArg
        destination, cost = paths[action]
        lastState = self.state
        lastObState = state_name_to_int(lastState)
        customerReward = self.customer_reward[destination]

        info = {"from": self.state, "to": destination}

        self.state = destination
        reward = (-cost + self.customer_reward[destination]) / self.optimum
        self.customer_visited(destination)
        done = destination == 'S' and self.all_customers_visited()

        stateAsInt = state_name_to_int(self.state)
        self.totalReward += reward
        self.stepCount += 1
        self.envReward += reward
        self.envStepCount += 1

        if self.showStep:
            print( "Episode: " + ("%4.0f  " % self.envEpisodeCount) + 
                   " Step: " + ("%4.0f  " % self.stepCount) + 
                   #lastState + ':' + str(lastObState) + ' --' + str(action) + '-> ' + self.state + ':' + str(stateAsInt) +
                   lastState + ' --' + str(action) + '-> ' + self.state + 
                   ' R=' + ("% 2.2f" % reward) + ' totalR=' + ("% 3.2f" % self.totalReward) + 
                   ' cost=' + ("%4.0f" % cost) + ' customerR=' + ("%4.0f" % customerReward) + ' optimum=' + ("%4.0f" % self.optimum)      
                   )

        if done and not self.isDone:
            self.envEpisodeCount += 1
            if BeraterEnv.showDone or (self.envEpisodeCount%BeraterEnv.envEpisodeModulo) == 0:
                episodes = BeraterEnv.envEpisodeModulo
                if (self.envEpisodeCount % BeraterEnv.envEpisodeModulo != 0):
                    episodes = self.envEpisodeCount % BeraterEnv.envEpisodeModulo
                print( "Done: " + 
                        ("episodes=%6.0f  " % self.envEpisodeCount) + 
                        ("avgSteps=%6.2f  " % (self.envStepCount/episodes)) + 
                        ("avgTotalReward=% 3.2f" % (self.envReward/episodes) )
                        )
                if (self.envEpisodeCount%BeraterEnv.envEpisodeModulo) == 0:
                    self.envReward = 0
                    self.envStepCount = 0

        self.isDone = done
        observation = self.getObservation(stateAsInt)

        return observation, reward, done, info

    def getObservation(self, position):
        result = numpy.array([ position, 
                               self.getEdgeObservation('S','A'),
                               self.getEdgeObservation('S','B'),
                               self.getEdgeObservation('S','C'),
                               self.getEdgeObservation('A','B'),
                               self.getEdgeObservation('A','C'),
                               self.getEdgeObservation('B','C'),
                              ],
                             dtype=numpy.float32)
        return result

    def getEdgeObservation(self, source, target):
        reward = self.customer_reward[target] 
        cost = self.getCost(source,target)
        result = reward - cost

        return result

    def getCost(self, source, target):
        paths = self.map[source]
        targetIndex=state_name_to_int(target)
        for destination, cost in paths:
            if destination == target:
                result = cost
                break

        return result

    def customer_visited(self, customer):
        self.customer_reward[customer] = 0

    def all_customers_visited(self):
        return self.calculate_customers_reward() == 0

    def calculate_customers_reward(self):
        sum = 0
        for value in self.customer_reward.values():
            sum += value
        return sum

    def reset(self):
        # print("Reset")
        
        self.totalReward = 0
        self.stepCount = 0
        self.isDone = False
        reward_per_customer = 1000
        self.customer_reward = {
            'S': 0,
            'A': reward_per_customer,
            'B': reward_per_customer,
            'C': reward_per_customer,
        }

        self.state = 'S'
        return state_name_to_int(self.state)

    def render(self, mode='human'):
        if BeraterEnv.showRender:
            print( ("steps=%4.0f  " % self.stepCount) + ' totalR=' + ("% 3.2f" % self.totalReward) + ' done=' + str(self.isDone))

Register Einvornment


In [3]:
from gym.envs.registration import register

cnt += 1
id = "Berater-v{}".format(cnt)
register(
    id=id,
    entry_point=BeraterEnv
)   

print(id)


Berater-v1

Try out Environment


In [4]:
BeraterEnv.showStep = True
BeraterEnv.showDone = True

env = gym.make(id)
observation = env.reset()
print(env)

for t in range(1000):
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    if done:
        env.render()
        break
env.close()


<BeraterEnv<Berater-v1>>
Episode:    0   Step:    1  S --0-> A R= 0.30 totalR= 0.30 cost= 100 customerR=1000 optimum=3000
Episode:    0   Step:    2  A --1-> C R= 0.20 totalR= 0.50 cost= 400 customerR=1000 optimum=3000
Episode:    0   Step:    3  C --0-> A R=-0.13 totalR= 0.37 cost= 400 customerR=   0 optimum=3000
Episode:    0   Step:    4  A --1-> C R=-0.13 totalR= 0.23 cost= 400 customerR=   0 optimum=3000
Episode:    0   Step:    5  C --1-> B R= 0.25 totalR= 0.48 cost= 250 customerR=1000 optimum=3000
Episode:    0   Step:    6  B --2-> S R=-0.13 totalR= 0.35 cost= 400 customerR=   0 optimum=3000
Done: episodes=     1  avgSteps=  6.00  avgTotalReward= 0.35

Train model


In [5]:
from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
from baselines.ppo2 import ppo2

BeraterEnv.showStep = False
BeraterEnv.showDone = False

wrapped_env = DummyVecEnv([lambda: gym.make(id)])
model = ppo2.learn(network='mlp', env=wrapped_env, total_timesteps=60000)


Logging to /tmp/openai-2018-12-03-16-18-54-072955
Done: episodes=   100  avgSteps=  8.49  avgTotalReward= 0.24
Done: episodes=   200  avgSteps=  8.85  avgTotalReward= 0.22
-------------------------------------
| approxkl           | 0.0015304903 |
| clipfrac           | 0.0061035156 |
| eplenmean          | nan          |
| eprewmean          | nan          |
| explained_variance | -0.45        |
| fps                | 436          |
| nupdates           | 1            |
| policy_entropy     | 1.0969917    |
| policy_loss        | -0.010166587 |
| serial_timesteps   | 2048         |
| time_elapsed       | 4.69         |
| total_timesteps    | 2048         |
| value_loss         | 0.04548084   |
-------------------------------------
Done: episodes=   300  avgSteps=  8.23  avgTotalReward= 0.25
Done: episodes=   400  avgSteps=  7.49  avgTotalReward= 0.34
Done: episodes=   500  avgSteps=  7.42  avgTotalReward= 0.33
Done: episodes=   600  avgSteps=  7.14  avgTotalReward= 0.37
Done: episodes=   700  avgSteps=  7.54  avgTotalReward= 0.32
Done: episodes=   800  avgSteps=  7.03  avgTotalReward= 0.38
Done: episodes=   900  avgSteps=  7.07  avgTotalReward= 0.38
Done: episodes=  1000  avgSteps=  6.82  avgTotalReward= 0.40
Done: episodes=  1100  avgSteps=  6.73  avgTotalReward= 0.40
Done: episodes=  1200  avgSteps=  6.03  avgTotalReward= 0.46
Done: episodes=  1300  avgSteps=  6.40  avgTotalReward= 0.44
Done: episodes=  1400  avgSteps=  6.75  avgTotalReward= 0.40
Done: episodes=  1500  avgSteps=  5.77  avgTotalReward= 0.49
Done: episodes=  1600  avgSteps=  5.95  avgTotalReward= 0.46
Done: episodes=  1700  avgSteps=  5.74  avgTotalReward= 0.49
Done: episodes=  1800  avgSteps=  5.86  avgTotalReward= 0.47
Done: episodes=  1900  avgSteps=  5.48  avgTotalReward= 0.51
Done: episodes=  2000  avgSteps=  5.86  avgTotalReward= 0.46
Done: episodes=  2100  avgSteps=  5.29  avgTotalReward= 0.53
Done: episodes=  2200  avgSteps=  5.21  avgTotalReward= 0.55
Done: episodes=  2300  avgSteps=  5.21  avgTotalReward= 0.54
Done: episodes=  2400  avgSteps=  4.98  avgTotalReward= 0.56
Done: episodes=  2500  avgSteps=  5.50  avgTotalReward= 0.51
Done: episodes=  2600  avgSteps=  4.81  avgTotalReward= 0.58
Done: episodes=  2700  avgSteps=  4.84  avgTotalReward= 0.57
Done: episodes=  2800  avgSteps=  4.97  avgTotalReward= 0.54
Done: episodes=  2900  avgSteps=  4.86  avgTotalReward= 0.56
Done: episodes=  3000  avgSteps=  4.80  avgTotalReward= 0.56
Done: episodes=  3100  avgSteps=  4.43  avgTotalReward= 0.60
Done: episodes=  3200  avgSteps=  4.54  avgTotalReward= 0.59
Done: episodes=  3300  avgSteps=  4.57  avgTotalReward= 0.60
-------------------------------------
| approxkl           | 0.0017361455 |
| clipfrac           | 0.017944336  |
| eplenmean          | nan          |
| eprewmean          | nan          |
| explained_variance | 0.876        |
| fps                | 521          |
| nupdates           | 10           |
| policy_entropy     | 0.68430805   |
| policy_loss        | -0.019066116 |
| serial_timesteps   | 20480        |
| time_elapsed       | 39.7         |
| total_timesteps    | 20480        |
| value_loss         | 0.0043647513 |
-------------------------------------
Done: episodes=  3400  avgSteps=  4.46  avgTotalReward= 0.61
Done: episodes=  3500  avgSteps=  4.45  avgTotalReward= 0.61
Done: episodes=  3600  avgSteps=  4.40  avgTotalReward= 0.63
Done: episodes=  3700  avgSteps=  4.25  avgTotalReward= 0.63
Done: episodes=  3800  avgSteps=  4.50  avgTotalReward= 0.60
Done: episodes=  3900  avgSteps=  4.44  avgTotalReward= 0.62
Done: episodes=  4000  avgSteps=  4.42  avgTotalReward= 0.64
Done: episodes=  4100  avgSteps=  4.34  avgTotalReward= 0.63
Done: episodes=  4200  avgSteps=  4.32  avgTotalReward= 0.63
Done: episodes=  4300  avgSteps=  4.47  avgTotalReward= 0.61
Done: episodes=  4400  avgSteps=  4.27  avgTotalReward= 0.64
Done: episodes=  4500  avgSteps=  4.37  avgTotalReward= 0.63
Done: episodes=  4600  avgSteps=  4.19  avgTotalReward= 0.66
Done: episodes=  4700  avgSteps=  4.24  avgTotalReward= 0.64
Done: episodes=  4800  avgSteps=  4.26  avgTotalReward= 0.65
Done: episodes=  4900  avgSteps=  4.19  avgTotalReward= 0.67
Done: episodes=  5000  avgSteps=  4.22  avgTotalReward= 0.66
Done: episodes=  5100  avgSteps=  4.17  avgTotalReward= 0.67
Done: episodes=  5200  avgSteps=  4.28  avgTotalReward= 0.65
Done: episodes=  5300  avgSteps=  4.13  avgTotalReward= 0.68
Done: episodes=  5400  avgSteps=  4.11  avgTotalReward= 0.67
Done: episodes=  5500  avgSteps=  4.15  avgTotalReward= 0.67
Done: episodes=  5600  avgSteps=  4.17  avgTotalReward= 0.67
Done: episodes=  5700  avgSteps=  4.18  avgTotalReward= 0.68
Done: episodes=  5800  avgSteps=  4.12  avgTotalReward= 0.68
Done: episodes=  5900  avgSteps=  4.05  avgTotalReward= 0.68
Done: episodes=  6000  avgSteps=  4.12  avgTotalReward= 0.68
Done: episodes=  6100  avgSteps=  4.09  avgTotalReward= 0.69
Done: episodes=  6200  avgSteps=  4.14  avgTotalReward= 0.68
Done: episodes=  6300  avgSteps=  4.09  avgTotalReward= 0.69
Done: episodes=  6400  avgSteps=  4.15  avgTotalReward= 0.70
Done: episodes=  6500  avgSteps=  4.05  avgTotalReward= 0.71
Done: episodes=  6600  avgSteps=  4.11  avgTotalReward= 0.69
Done: episodes=  6700  avgSteps=  4.09  avgTotalReward= 0.71
Done: episodes=  6800  avgSteps=  4.13  avgTotalReward= 0.69
Done: episodes=  6900  avgSteps=  4.02  avgTotalReward= 0.71
Done: episodes=  7000  avgSteps=  4.04  avgTotalReward= 0.71
Done: episodes=  7100  avgSteps=  4.07  avgTotalReward= 0.72
Done: episodes=  7200  avgSteps=  4.07  avgTotalReward= 0.71
Done: episodes=  7300  avgSteps=  4.09  avgTotalReward= 0.71
Done: episodes=  7400  avgSteps=  4.03  avgTotalReward= 0.72
Done: episodes=  7500  avgSteps=  4.07  avgTotalReward= 0.72
Done: episodes=  7600  avgSteps=  4.05  avgTotalReward= 0.71
Done: episodes=  7700  avgSteps=  4.04  avgTotalReward= 0.72
Done: episodes=  7800  avgSteps=  4.07  avgTotalReward= 0.72
Done: episodes=  7900  avgSteps=  4.04  avgTotalReward= 0.72
Done: episodes=  8000  avgSteps=  4.04  avgTotalReward= 0.71
Done: episodes=  8100  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes=  8200  avgSteps=  4.05  avgTotalReward= 0.72
Done: episodes=  8300  avgSteps=  4.04  avgTotalReward= 0.72
--------------------------------------
| approxkl           | 0.00043752656 |
| clipfrac           | 0.004638672   |
| eplenmean          | nan           |
| eprewmean          | nan           |
| explained_variance | 0.985         |
| fps                | 521           |
| nupdates           | 20            |
| policy_entropy     | 0.1696136     |
| policy_loss        | -0.0118810395 |
| serial_timesteps   | 40960         |
| time_elapsed       | 78.5          |
| total_timesteps    | 40960         |
| value_loss         | 0.000630336   |
--------------------------------------
Done: episodes=  8400  avgSteps=  4.01  avgTotalReward= 0.72
Done: episodes=  8500  avgSteps=  4.02  avgTotalReward= 0.73
Done: episodes=  8600  avgSteps=  4.06  avgTotalReward= 0.72
Done: episodes=  8700  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes=  8800  avgSteps=  4.06  avgTotalReward= 0.72
Done: episodes=  8900  avgSteps=  4.03  avgTotalReward= 0.72
Done: episodes=  9000  avgSteps=  4.03  avgTotalReward= 0.72
Done: episodes=  9100  avgSteps=  4.03  avgTotalReward= 0.73
Done: episodes=  9200  avgSteps=  4.03  avgTotalReward= 0.72
Done: episodes=  9300  avgSteps=  4.05  avgTotalReward= 0.72
Done: episodes=  9400  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes=  9500  avgSteps=  4.02  avgTotalReward= 0.73
Done: episodes=  9600  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes=  9700  avgSteps=  4.00  avgTotalReward= 0.72
Done: episodes=  9800  avgSteps=  4.03  avgTotalReward= 0.73
Done: episodes=  9900  avgSteps=  4.03  avgTotalReward= 0.73
Done: episodes= 10000  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 10100  avgSteps=  4.04  avgTotalReward= 0.73
Done: episodes= 10200  avgSteps=  4.02  avgTotalReward= 0.73
Done: episodes= 10300  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 10400  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes= 10500  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 10600  avgSteps=  4.03  avgTotalReward= 0.73
Done: episodes= 10700  avgSteps=  4.03  avgTotalReward= 0.73
Done: episodes= 10800  avgSteps=  4.03  avgTotalReward= 0.73
Done: episodes= 10900  avgSteps=  4.03  avgTotalReward= 0.73
Done: episodes= 11000  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes= 11100  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes= 11200  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes= 11300  avgSteps=  4.02  avgTotalReward= 0.73
Done: episodes= 11400  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 11500  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 11600  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 11700  avgSteps=  4.04  avgTotalReward= 0.73
Done: episodes= 11800  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes= 11900  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes= 12000  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 12100  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 12200  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 12300  avgSteps=  4.02  avgTotalReward= 0.73
Done: episodes= 12400  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 12500  avgSteps=  4.01  avgTotalReward= 0.73
Done: episodes= 12600  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 12700  avgSteps=  4.00  avgTotalReward= 0.73
Done: episodes= 12800  avgSteps=  4.02  avgTotalReward= 0.73

Enjoy model


In [6]:
import numpy as np 

observation = wrapped_env.reset()
state = np.zeros((1, 2*128))
dones = np.zeros((1))

BeraterEnv.showStep = True
BeraterEnv.showDone = False

for t in range(1000):
    actions, _, state, _ = model.step(observation, S=state, M=dones)
    observation, reward, done, info = wrapped_env.step(actions)
    if done:
        print("Episode finished after {} timesteps".format(t+1))
        break
env.close()


Episode: 12890   Step:    1  S --2-> C R= 0.27 totalR= 0.27 cost= 200 customerR=1000 optimum=3000
Episode: 12890   Step:    2  C --1-> B R= 0.25 totalR= 0.52 cost= 250 customerR=1000 optimum=3000
Episode: 12890   Step:    3  B --0-> A R= 0.25 totalR= 0.77 cost= 250 customerR=1000 optimum=3000
Episode: 12890   Step:    4  A --2-> S R=-0.03 totalR= 0.73 cost= 100 customerR=   0 optimum=3000
Episode finished after 4 timesteps

In [0]: