"For our purposes, it is convenient to place the boundary of the learning agent not at the limit of its physical body, but at the limit of its control." - 3.2

3.1

Exercise 3.2

Is the reinforcement learning framework adequate to usefully represent all goal-directed learning tasks? Can you think of any clear exceptions?

The goal-directed task doesn't work for tasks where failure is not an option (i.e. the problem doesn't reset). One example would be learning how to walk near a cliff, where failure means death with no ability to reset the episode.

Another exception would be a task in which the rewards cannot be easily reduced to a single number. For example, in teaching a car how to drive we would want to minimize amount of travel time while maximizing safetiness. With the current framework, we would have to reduce all these desirable outcomes into a single number returned by the environment.

Exercise 3.4

Suppose you treated pole-balancing as an episodic task but also used discounting, with all rewards zero except for upon failure. What then would the return be at each time? How does this return differ from that in the discounted, continuing formulation of this task?

For an episodic task with the above rewards, the discounted future reward would be exactly -1 gamma^k, where k is the time step of failure, since that is when the episode ends. In the continuing formulation task, the reward is -1 gamma^k1 + -1 * gamma^k2, etc. for each failure.

3.6 - MDPs

Exercise 3.7

Assuming a finite MDP with a finite number of reward values, write an equation for the transition probabilities and the expected rewards in terms of the joint conditional distribution in (3.5) $Pr(s_{t+1} = s', r_{t+1} = r| s_t, a_t)$

$P^a_{s,s'} = Pr(s_{t+1} = s' | s_t = s, a_t = a) = \sum_r Pr(s_{t+1} = s', r_{t+1} = r| s_t, a_t) $

$R^a_{s, s'} = E[r_{t+1} | s_t = s, a_t = a, s_{t+1} = s'] = \sum_r r \cdot Pr(s_{t+1} = s', r_{t+1} = r| s_t, a_t)$

3.7 - Value Functions

Exercise 3.8

What is the Bellman equation for action values?

$ \begin{equation} \begin{split} Q^{\pi}(s, a) &= E_{\pi} [R_t | s_t = s, a_t = a]\\ =& E_{\pi} [\sum_{k=0}^{\infty} \gamma^k r_{t + k + 1} | s_t = s, a_t = a ]\\ =& E_{\pi} [r_{t + 1} + \gamma \sum_{k=0}^{\infty} \gamma^k r_{t + k + 2} | s_t = s, a_t = a]\\ =& \sum_{s'} P^a_{ss'} R^a_{ss'} + \sum_{s'} P^a_{ss'} \gamma \sum_{k=0}^{\infty} \gamma^k r_{t + k + 2}\\ =& \sum_{s'} P^a_{ss'} [R^a_{ss'} + \gamma V^{\pi}(s')]\\ =& \sum_{s'} P^a_{ss'} [R^a_{ss'} + \gamma \sum_{a'} \pi(s', a') Q^{\pi}(s', a')]\\ \end{split} \end{equation} $

Exercise 3.9

The Bellman equation (3.10) must hold for each state for the value function shown in Figure 3.5b. As an example, show numerically that this equation holds for the center state, valued at 0.7 ,

$V^{\pi}(s_{mid}) = \frac{1}{4} [(0 + .9 * .7) + (0 + 0.9 * 2.3) + (0 + .9 * .4) + (0 - .9 * .4)] = .675 \approx .7$

Exercise 3.10

In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.2), that adding a constant $C$ to all the rewards adds a constant, $K$, to the values of all states, and thus does not affect the relative values of any states under any policies. What is $K$ in terms of $C$ and $\gamma$?

We have from previously that $V^{\pi}(s) = E_{\pi} [\sum_{k=0}^{\infty} \gamma^k r_{t+k+1} | s_t = s]$.

Let's add a constant $C$ to all rewards so that $r'_{t+1} = r_{t+1} + C$.

Then, $$V'^{\pi}(s) = E_{\pi} [\sum_{k=0}^{\infty} \gamma^k (r_{t+k+1} + C) | s_t = s]$$ $$ = V^{\pi}(s) + E_{\pi}[ \sum_{k=0}^{\infty} \gamma^k C | s_t = s ]$$ $$ = V^{\pi}(s) + K $$

We have,

$ \begin{equation} \begin{split} K =& E_{\pi}[\sum_{k=0}^{\infty} \gamma^k C | s_t = s]\\ =& \sum_a \pi(s, a) \sum_s' P^a_{ss'} (C \frac{1}{1 - \gamma})\\ =& \frac{C}{1 - \gamma} \end{split} \end{equation} $

Exercise 3.11

Now consider adding a constant C to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.

For episodic tasks we have,

$$K = \sum_a \pi(s, a) \sum_s' P^a_{ss'} \sum_k^{N}(C \gamma^k)$$

where $N$ is the number of time steps until the episode is over. It seems like the reward is maximized as N goes to infiinity, so adding a constant reward to all actions in an episodic task makes the agent never want to end interacting with the environment to avoid finishing the episode.

Exercise 3.12

The value of a state depends on the the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:

$ \begin{equation} \begin{split} V^{\pi} (s) =& E_{\pi} [Q^{\pi} (s, a) | s_t = s]\\ =& \sum_a \pi(s, a) Q^{\pi} (s, a) \end{split} \end{equation} $

Exercise 3.13

The value of an action, , can be divided into two parts, the expected next reward, which does not depend on the policy , and the expected sum of the remaining rewards, which depends on the next state and the policy. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state-action pair) and branching to the possible next states:

$ \begin{equation} \begin{split} Q^{\pi} (s, a) =& E[ r_{t+1} | s_t = s, a_t = a] + \gamma E_{\pi}[ V^{\pi}(s') | s_t = s, a_t = a] \\ Q^{\pi} (s, a) =& \sum_{s'} P^a_{ss'} [R_{ss'} + \gamma V^{\pi}(s')] \end{split} \end{equation} $

3.8 - Optimal Value Functions

3.16

Give the Bellman equation for for the recycling robot. s = search, w = wait, re = recharge h = high, l = low

$Q^*(s, a) = \sum_{s'} P^a_{s, s'} [R^a_{s,s'} + \gamma max_{a'} Q^*(s', a')]$

$Q^*(h, s) = \alpha (R^s + \gamma max_{a'} Q^*(h, a')) + (1 - \alpha) (R^s + \gamma max_{a'} Q^*(l, a'))$

$Q^*(l, s) = \beta (R^s + \gamma max_{a'} Q^*(l, a')) + (1 - \beta) (-3 + \gamma max_{a'} Q^*(h, a'))$

$Q^*(l, w) = (R^w + \gamma max_{a'} Q^*(l, a'))$

$Q^*(h, w) = (R^w + \gamma max_{a'} Q^*(h, a'))$

$Q^*(l, r) = (0 + \gamma max_{a'} Q^*(h, a'))$

$Q^*(h, r) = 0$

3.17

Figure 3.8 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.2) to express this value symbolically, and then to compute it to three decimal places.

$V^*(s) = max_{a \in A(s)} \sum_{s'} P^a_{s,s'}[R^a_{s, s'} + \gamma V^*(s')]$

So we have,

$V^*(s) = max_a E[r_{t+1} + \gamma \sum_{k=0}^{\infty} \gamma^k r_{t + k+2} | s_t=s, s_t = a]$

$ = max_a [10 + \gamma \cdot 0 + \gamma^2 \cdot 0 + \gamma^3 \cdot 0 + \gamma^4 + \gamma^5 \cdot 10 + ...]$

$ = 10(\sum_{k=0}^\infty \gamma^{5k})$

$ = 10 \frac{1}{1 - \gamma^5} \approx 24.419$

Cart-pole balancing


In [1]:
# https://raw.githubusercontent.com/stober/cartpole/master/src/__init__.py

import numpy as np
import math
import random

class CartPole(object):

    def __init__(self):

        # Constants
        self.gravity = 9.8
        self.masscart = 1.0
        self.masspole = 0.1
        self.total_mass = (self.masspole + self.masscart)
        self.length = 0.5 # actually half the pole's length
        self.polemass_length = (self.masspole * self.length)
        self.force_mag = 10.0
        self.tau = 0.02 # seconds between state updates
        self.fourthirds = 1.3333333333333

    def cart_pole(self, action, x = 0.0, xdot = 0.0, theta = 0.0, thetadot = 0.0):
        # action must be binary
        
        force = self.force_mag if action > 0 else -self.force_mag

        costheta = math.cos(theta)
        sintheta = math.sin(theta)

        tmp = (force + self.polemass_length * (thetadot ** 2) * sintheta) / self.total_mass
        thetaacc = (self.gravity * sintheta - costheta * tmp) / (self.length * (self.fourthirds - self.masspole * costheta ** 2 / self.total_mass))
        xacc = tmp - self.polemass_length * thetaacc * costheta / self.total_mass

        
        # Update the four state variables, using Euler's method
        x += self.tau * xdot
        xdot += self.tau * xacc
        theta += self.tau * thetadot
        thetadot += self.tau * thetaacc
        
        return [x, xdot, theta, thetadot]

    def get_box(self, x = 0.0, xdot = 0.0, theta = 0.0, thetadot = 0.0):
        #          get_box:  Given the current state, returns a number from 1 to 162
        #   designating the region of the state space encompassing the current state.
        #   Returns a value of -1 if a failure state is encountered.
        one_degree = 0.0174532 # 2pi/360
        six_degrees = 0.1047192
        twelve_degrees = 0.2094384
        fifty_degrees = 0.87266

        if (x < -2.4 or x > 2.4) or (theta < -twelve_degrees or theta > twelve_degrees):
            return -1

        if x < -0.8:
            box = 0
        elif x < 0.8:
            box = 1
        else:
            box = 2

        if xdot < -0.5:
            pass
        elif xdot < 0.5:
            box += 3
        else:
            box += 6

        if theta < -six_degrees:
            pass
        elif theta < -one_degree:
            box += 9
        elif theta < 0:
            box += 18
        elif theta < one_degree:
            box += 27
        elif theta < six_degrees:
            box += 36
        else:
            box += 45

        if thetadot < -fifty_degrees:
            pass
        elif thetadot < fifty_degrees:
            box += 54
        else:
            box += 108

        return box
        
    def prob_push_right(self, s):
        return (1.0 / (1.0 + np.exp(-max(-50.0, min(s, 50.0)))))

In [17]:
cp = CartPole()

N_BOXES = 162
MAX_FAILURES = 100
MAX_STEPS = 100000
LAMBDAw = 0.9         #/* Decay rate for w eligibility trace. */
LAMBDAv = 0.8         # /* Decay rate for v eligibility trace. */
GAMMA = 0.95          # /* Discount factor for critic. */
ALPHA = 1000          # /* Learning rate for action weights, w. */
BETA = 0.5            # /* Learning rate for critic weights, v. */

failures = steps = 0
w = []; v = []; xbar = []; e = []

# Initialize action and heuristic critic weights and traces.
for i in range(N_BOXES):
    w.append(0.0); v.append(0.0); xbar.append(0.0); e.append(0.0)

# Starting state is (0 0 0 0)
x = x_dot = theta = theta_dot = 0.0

# Find box in state space containing start state
box = cp.get_box(x, x_dot, theta, theta_dot)

# Iterate through the action-learn loop. ---*/
while steps < MAX_STEPS and failures < MAX_FAILURES:
    # Choose action randomly, biased by current weight
    y = (random.uniform(0, 1) < cp.prob_push_right(w[box]))
    
    # Update traces.
    e[box] += (1.0 - LAMBDAw) * (y - 0.5)
    xbar[box] += (1.0 - LAMBDAv)
    
    # Remember prediction of failure for current stat
    oldp = v[box]
    
    # /*--- Apply action to the simulated cart-pole ---*/
    x, x_dot, theta, theta_dot = cp.cart_pole(y, x, x_dot, theta, theta_dot)
    
    # /*--- Get box of state space containing the resulting state. ---*/
    box = cp.get_box(x, x_dot, theta, theta_dot)
    
    if box < 0:
        # /*--- Failure occurred. ---*/
        failed = 1
        failures += 1
        print("Trial %d was %d steps.\n" %(failures, steps))
        steps = 0

        # /*--- Reset state to (0 0 0 0).  Find the box. ---*/
        x = x_dot = theta = theta_dot = 0.0
        box = cp.get_box(x, x_dot, theta, theta_dot)

        # /*--- Reinforcement upon failure is -1. Prediction of failure is 0. ---*/
        r = -1.0
        p = 0.0
    else:
        # /*--- Not a failure. ---*/
        failed = 0
        
        # /*--- Reinforcement is 0. Prediction of failure given by v weight. ---*/
        r = 0
        p = v[box]
    
    steps += 1
    

    # /*--- Heuristic reinforcement is:   current reinforcement
    #      + gamma * new failure prediction - previous failure prediction ---*/
    rhat = r + GAMMA * p - oldp

    for i in range(N_BOXES):
        # /*--- Update all weights. ---*/
        
        w[i] += ALPHA * rhat * e[i]
        v[i] += BETA * rhat * xbar[i]
        
        if v[i] < -1.0:
            v[i] = v[i]
        
        if failed == 1:
            #/*--- If failure, zero all traces. ---*/
            e[i] = 0.0
            xbar[i] = 0.0
        else:
            # /*--- Otherwise, update (decay) the traces. ---*/
            e[i] *= LAMBDAw
            xbar[i] *= LAMBDAv

if (failures == MAX_FAILURES):
    print("Pole not balanced. Stopping after %d failures." %(failures))
else:
    print("Pole balanced successfully for at least %d steps\n" %(steps))


Trial 1 was 17 steps.

Trial 2 was 14 steps.

Trial 3 was 10 steps.

Trial 4 was 10 steps.

Trial 5 was 17 steps.

Trial 6 was 71 steps.

Trial 7 was 23 steps.

Trial 8 was 26 steps.

Trial 9 was 88 steps.

Trial 10 was 39 steps.

Trial 11 was 15 steps.

Trial 12 was 43 steps.

Trial 13 was 121 steps.

Trial 14 was 30 steps.

Trial 15 was 28 steps.

Trial 16 was 171 steps.

Trial 17 was 193 steps.

Trial 18 was 210 steps.

Trial 19 was 222 steps.

Trial 20 was 185 steps.

Trial 21 was 207 steps.

Trial 22 was 99 steps.

Trial 23 was 85 steps.

Trial 24 was 185 steps.

Trial 25 was 213 steps.

Trial 26 was 230 steps.

Trial 27 was 283 steps.

Trial 28 was 281 steps.

Trial 29 was 234 steps.

Trial 30 was 292 steps.

Trial 31 was 732 steps.

Trial 32 was 455 steps.

Trial 33 was 321 steps.

Trial 34 was 264 steps.

Trial 35 was 380 steps.

Trial 36 was 294 steps.

Trial 37 was 318 steps.

Trial 38 was 241 steps.

Trial 39 was 1554 steps.

Trial 40 was 1792 steps.

Trial 41 was 2523 steps.

Trial 42 was 1555 steps.

Trial 43 was 56268 steps.

Trial 44 was 12440 steps.

Trial 45 was 84286 steps.

Pole balanced successfully for at least 100000 steps


In [10]:
# Us AI Gym

import gym
import random
env = gym.make('CartPole-v0')

cp = CartPole()
N_BOXES = 162
MAX_FAILURES = 100
MAX_STEPS = 100000
LAMBDAw = 0.9         #/* Decay rate for w eligibility trace. */
LAMBDAv = 0.8         # /* Decay rate for v eligibility trace. */
GAMMA = 0.95          # /* Discount factor for critic. */
ALPHA = 1000          # /* Learning rate for action weights, w. */
BETA = 0.5            # /* Learning rate for critic weights, v. */

failures = steps = 0
w = []; v = []; xbar = []; e = []

# Initialize action and heuristic critic weights and traces.
for i in range(N_BOXES):
    w.append(0.0); v.append(0.0); xbar.append(0.0); e.append(0.0)

# Starting state is (0 0 0 0)
observation = env.reset()
x, x_dot, theta, theta_dot = observation

# Find box in state space containing start state
box = cp.get_box(x, x_dot, theta, theta_dot)

# Iterate through the action-learn loop. ---*/
while steps < MAX_STEPS and failures < MAX_FAILURES:
    env.render()
    
    # Choose action randomly, biased by current weight
    y = (random.uniform(0, 1) < cp.prob_push_right(w[box]))
    
    # Update traces.
    e[box] += (1.0 - LAMBDAw) * (y - 0.5)
    xbar[box] += (1.0 - LAMBDAv)
    
    # Remember prediction of failure for current stat
    oldp = v[box]
    
    # /*--- Apply action to the simulated cart-pole ---*/
    observation, reward, done, info = env.step(y)
    x, x_dot, theta, theta_dot = observation
    
    # /*--- Get box of state space containing the resulting state. ---*/
    box = cp.get_box(x, x_dot, theta, theta_dot)
    
    if done:
        # /*--- Failure occurred. ---*/
        failed = 1
        failures += 1
        print("Trial %d was %d steps.\n" %(failures, steps))
        steps = 0

        # /*--- Reset state to (0 0 0 0).  Find the box. ---*/
        observation = env.reset()
        x, x_dot, theta, theta_dot = observation
        box = cp.get_box(x, x_dot, theta, theta_dot)

        # /*--- Reinforcement upon failure is -1. Prediction of failure is 0. ---*/
        r = -1
        p = 0.0
    else:
        # /*--- Not a failure. ---*/
        failed = 0
        
        # /*--- Reinforcement is 0. Prediction of failure given by v weight. ---*/
        r = 0
        p = v[box]
    
    steps += 1
    

    # /*--- Heuristic reinforcement is:   current reinforcement
    #      + gamma * new failure prediction - previous failure prediction ---*/
    rhat = r + GAMMA * p - oldp

    for i in range(N_BOXES):
        # /*--- Update all weights. ---*/
        
        w[i] += ALPHA * rhat * e[i]
        v[i] += BETA * rhat * xbar[i]
        
        if v[i] < -1.0:
            v[i] = v[i]
        
        if failed == 1:
            #/*--- If failure, zero all traces. ---*/
            e[i] = 0.0
            xbar[i] = 0.0
        else:
            # /*--- Otherwise, update (decay) the traces. ---*/
            e[i] *= LAMBDAw
            xbar[i] *= LAMBDAv

if (failures == MAX_FAILURES):
    print("Pole not balanced. Stopping after %d failures." %(failures))
else:
    print("Pole balanced successfully for at least %d steps\n" %(steps))


[2016-07-19 01:55:07,269] Making new env: CartPole-v0
Trial 1 was 20 steps.

Trial 2 was 12 steps.

Trial 3 was 9 steps.

Trial 4 was 155 steps.

Trial 5 was 16 steps.

Trial 6 was 103 steps.

Trial 7 was 34 steps.

Trial 8 was 38 steps.

Trial 9 was 12 steps.

Trial 10 was 27 steps.

Trial 11 was 10 steps.

Trial 12 was 58 steps.

Trial 13 was 188 steps.

Trial 14 was 144 steps.

Trial 15 was 186 steps.

Trial 16 was 101 steps.

Trial 17 was 122 steps.

Trial 18 was 40 steps.

Trial 19 was 26 steps.

Trial 20 was 48 steps.

Trial 21 was 16 steps.

Trial 22 was 11 steps.

Trial 23 was 714 steps.

Trial 24 was 412 steps.

Trial 25 was 620 steps.

Trial 26 was 470 steps.

Trial 27 was 580 steps.

Trial 28 was 498 steps.

Trial 29 was 523 steps.

Trial 30 was 736 steps.

Trial 31 was 597 steps.

Trial 32 was 825 steps.

Trial 33 was 759 steps.

Trial 34 was 342 steps.

Trial 35 was 482 steps.

Trial 36 was 598 steps.

Trial 37 was 601 steps.

Trial 38 was 614 steps.

Trial 39 was 868 steps.

Trial 40 was 1189 steps.

Trial 41 was 1358 steps.

Trial 42 was 427 steps.

Trial 43 was 785 steps.

Trial 44 was 387 steps.

Trial 45 was 287 steps.

Trial 46 was 3246 steps.

Trial 47 was 10808 steps.

Trial 48 was 15820 steps.

Trial 49 was 22176 steps.

Trial 50 was 4548 steps.

Trial 51 was 11423 steps.

Trial 52 was 4957 steps.

Trial 53 was 37452 steps.

Trial 54 was 6652 steps.

Pole balanced successfully for at least 100000 steps


In [ ]:


In [1]:
import gym
import random
env = gym.make('CartPole-v0')

# env.monitor.start('/tmp/cartpole-experiment-3', force=True)

bestSteps = 0
best = [0, 0, 0, 0]
alpha = 1

for i_episode in xrange(80):
    
    test = [best[i] + (random.random() - 0.5) * alpha for i in range(4)]

    score = 0
    for ep in range(10):  # <-- key thing was to figure out that you need to do 10 tests per point
        observation = env.reset()
        for t in xrange(200): # <-- because you can't go over 200 you need to gain score hight else where
            env.render()
            if sum(observation[i] * test[i] for i in range(4)) > 0:
                action = 1
            else:
                action = 0
            observation, reward, done, info = env.step(action)
            if done:
                break

        score += t

    if bestSteps < score:
        bestSteps = score
        best = test
        alpha *= .9

    print "test:", "[%+1.2f %+1.2f %+1.2f %+1.2f]" % tuple(test), score, 
    print "best:", "[%+1.2f %+1.2f %+1.2f %+1.2f]" % tuple(best), bestSteps, alpha


print "best", best, bestSteps

# env.monitor.close()


[2016-07-19 01:43:02,117] Making new env: CartPole-v0
test: [-0.47 +0.06 +0.29 -0.32] 85 best: [-0.47 +0.06 +0.29 -0.32] 85 0.9
test: [-0.81 -0.11 +0.70 -0.55] 88 best: [-0.81 -0.11 +0.70 -0.55] 88 0.81
test: [-0.49 -0.13 +0.60 -0.53] 85 best: [-0.81 -0.11 +0.70 -0.55] 88 0.81
test: [-0.98 -0.30 +1.09 -0.47] 88 best: [-0.81 -0.11 +0.70 -0.55] 88 0.81
test: [-0.50 +0.21 +0.46 -0.53] 86 best: [-0.81 -0.11 +0.70 -0.55] 88 0.81
test: [-0.59 -0.15 +0.57 -0.91] 84 best: [-0.81 -0.11 +0.70 -0.55] 88 0.81
test: [-1.01 -0.45 +0.43 -0.57] 83 best: [-0.81 -0.11 +0.70 -0.55] 88 0.81
test: [-0.68 +0.05 +0.43 -0.18] 82 best: [-0.81 -0.11 +0.70 -0.55] 88 0.81
test: [-1.05 +0.07 +0.80 -0.15] 88 best: [-0.81 -0.11 +0.70 -0.55] 88 0.81
test: [-0.97 -0.07 +1.00 -0.36] 83 best: [-0.81 -0.11 +0.70 -0.55] 88 0.81
test: [-0.82 +0.15 +0.85 -0.47] 88 best: [-0.81 -0.11 +0.70 -0.55] 88 0.81
test: [-0.91 +0.29 +0.65 -0.24] 83 best: [-0.81 -0.11 +0.70 -0.55] 88 0.81
test: [-0.96 -0.18 +0.34 -0.44] 86 best: [-0.81 -0.11 +0.70 -0.55] 88 0.81
test: [-0.66 -0.18 +0.40 -0.34] 86 best: [-0.81 -0.11 +0.70 -0.55] 88 0.81
test: [-1.11 +0.17 +0.99 -0.32] 89 best: [-1.11 +0.17 +0.99 -0.32] 89 0.729
test: [-1.16 +0.43 +0.99 -0.02] 87 best: [-1.11 +0.17 +0.99 -0.32] 89 0.729
test: [-0.80 -0.11 +0.79 -0.03] 543 best: [-0.80 -0.11 +0.79 -0.03] 543 0.6561
test: [-0.91 -0.26 +0.93 -0.21] 317 best: [-0.80 -0.11 +0.79 -0.03] 543 0.6561
test: [-0.99 -0.32 +0.67 +0.28] 618 best: [-0.99 -0.32 +0.67 +0.28] 618 0.59049
test: [-0.72 -0.45 +0.39 +0.03] 560 best: [-0.99 -0.32 +0.67 +0.28] 618 0.59049
test: [-1.24 -0.19 +0.53 +0.31] 856 best: [-1.24 -0.19 +0.53 +0.31] 856 0.531441
test: [-1.36 -0.41 +0.77 +0.15] 727 best: [-1.24 -0.19 +0.53 +0.31] 856 0.531441
test: [-1.04 -0.13 +0.51 +0.25] 610 best: [-1.24 -0.19 +0.53 +0.31] 856 0.531441
test: [-1.47 -0.18 +0.78 +0.40] 728 best: [-1.24 -0.19 +0.53 +0.31] 856 0.531441
test: [-1.24 +0.00 +0.27 +0.41] 962 best: [-1.24 +0.00 +0.27 +0.41] 962 0.4782969
test: [-1.32 -0.12 +0.30 +0.19] 618 best: [-1.24 +0.00 +0.27 +0.41] 962 0.4782969
test: [-1.27 -0.06 +0.21 +0.27] 607 best: [-1.24 +0.00 +0.27 +0.41] 962 0.4782969
test: [-1.25 -0.20 +0.39 +0.62] 740 best: [-1.24 +0.00 +0.27 +0.41] 962 0.4782969
test: [-1.07 +0.02 +0.10 +0.43] 721 best: [-1.24 +0.00 +0.27 +0.41] 962 0.4782969
test: [-1.11 -0.19 +0.31 +0.59] 794 best: [-1.24 +0.00 +0.27 +0.41] 962 0.4782969
test: [-1.28 -0.12 +0.27 +0.61] 1068 best: [-1.28 -0.12 +0.27 +0.61] 1068 0.43046721
test: [-1.33 +0.02 +0.46 +0.72] 939 best: [-1.28 -0.12 +0.27 +0.61] 1068 0.43046721
test: [-1.38 -0.20 +0.10 +0.76] 1018 best: [-1.28 -0.12 +0.27 +0.61] 1068 0.43046721
test: [-1.15 -0.24 +0.22 +0.67] 827 best: [-1.28 -0.12 +0.27 +0.61] 1068 0.43046721
test: [-1.39 -0.12 +0.35 +0.70] 907 best: [-1.28 -0.12 +0.27 +0.61] 1068 0.43046721
test: [-1.07 -0.20 +0.19 +0.80] 1048 best: [-1.28 -0.12 +0.27 +0.61] 1068 0.43046721
test: [-1.24 -0.14 +0.43 +0.53] 941 best: [-1.28 -0.12 +0.27 +0.61] 1068 0.43046721
test: [-1.10 -0.28 +0.38 +0.64] 899 best: [-1.28 -0.12 +0.27 +0.61] 1068 0.43046721
test: [-1.47 -0.06 +0.15 +0.66] 920 best: [-1.28 -0.12 +0.27 +0.61] 1068 0.43046721
test: [-1.43 +0.03 +0.30 +0.40] 640 best: [-1.28 -0.12 +0.27 +0.61] 1068 0.43046721
test: [-1.11 +0.10 +0.14 +0.53] 970 best: [-1.28 -0.12 +0.27 +0.61] 1068 0.43046721
test: [-1.23 +0.03 +0.12 +0.60] 1033 best: [-1.28 -0.12 +0.27 +0.61] 1068 0.43046721
test: [-1.13 -0.17 +0.41 +0.71] 822 best: [-1.28 -0.12 +0.27 +0.61] 1068 0.43046721
test: [-1.39 +0.04 +0.31 +0.78] 1099 best: [-1.39 +0.04 +0.31 +0.78] 1099 0.387420489
test: [-1.23 +0.14 +0.37 +0.88] 1555 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.15 +0.28 +0.40 +1.06] 1539 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.07 +0.12 +0.50 +0.83] 1214 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.30 +0.21 +0.50 +0.80] 1064 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.30 +0.12 +0.24 +0.76] 1089 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.39 -0.02 +0.35 +0.96] 848 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.38 +0.10 +0.51 +0.96] 1334 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.21 +0.14 +0.45 +0.72] 1173 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.24 +0.27 +0.54 +0.76] 1172 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.21 +0.27 +0.54 +0.95] 1433 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.11 +0.16 +0.39 +0.93] 1553 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.20 +0.17 +0.33 +1.02] 1346 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.07 +0.26 +0.51 +0.85] 1344 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.12 +0.20 +0.21 +0.97] 963 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.40 +0.10 +0.39 +1.00] 934 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.27 +0.15 +0.21 +0.81] 1026 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.35 +0.19 +0.42 +0.96] 1062 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.20 +0.31 +0.30 +0.73] 1162 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.24 +0.23 +0.24 +0.99] 1357 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.24 +0.29 +0.27 +0.90] 1245 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.28 +0.07 +0.28 +0.99] 1056 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.22 -0.02 +0.50 +0.89] 1012 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.26 +0.16 +0.40 +0.97] 996 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.07 +0.26 +0.38 +0.76] 1166 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.11 +0.17 +0.43 +0.93] 1201 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.37 +0.08 +0.20 +0.78] 1068 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.07 +0.02 +0.24 +0.90] 1518 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.24 +0.28 +0.40 +0.92] 1266 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.28 +0.23 +0.28 +1.00] 1214 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.16 +0.29 +0.23 +0.77] 1232 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.23 +0.26 +0.21 +0.78] 1328 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.25 +0.20 +0.23 +0.98] 917 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.24 +0.06 +0.30 +0.91] 1234 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.10 +0.07 +0.29 +0.83] 1089 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.20 +0.06 +0.29 +1.03] 1375 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
test: [-1.19 +0.27 +0.30 +0.88] 1378 best: [-1.23 +0.14 +0.37 +0.88] 1555 0.3486784401
best [-1.2322893214490043, 0.14150894065716918, 0.36790056804535143, 0.881835892179857] 1555