Preliminary Setup


In [124]:
%matplotlib inline
from cs282rl.domains import hiv_treatment
from cs282rl.domains.hiv_treatment import HIVTreatment, visualize_hiv_history 
import matplotlib.pyplot as plt
import numpy as np
import random
from sklearn.ensemble import ExtraTreesRegressor, GradientBoostingRegressor, BaggingClassifier

In [166]:
num_patients = 50
eps = 50
state_histories, action_histories, reward_histories = \
HIVTreatment.generate_batch(num_patients=num_patients)
visualize_hiv_history(state_histories[0], action_histories[0])
plt.show()



In [167]:
batch = (state_histories, action_histories, reward_histories)

Question A: Algorithm Description

The general approach and algorithms written and evaluated in this problem set were based on the fitted Q discussion in the paper Tree-Based Batch Mode Reinforcement Learning by Damien Ernst, Pierre Geurts, and Louis Wehenkel, which was presented in class. In the paper, Ernst et al discuss the determination of an optimal control policy which, in batch mode, is acheivable by approximating the Q-function based on a set of four-tuples involving the current state, action, reward, and next state. The paper studies several implementations of this framework, but the most compelling approaches, as outlined in their conclusions, are the ensemble methods of extremely randomized and totally randomized trees. These are state of the art ensemble regressors that have been designed and implemented many times, so the work of this Practical avoids the necessity of implementing them by drawing on the SciKit Learn Library to provide the regressors.

Doing this, allowed me to focus fully on the actual, central algorithm aiming to approximate the four-tuple Q function, which Ernst et al refer to as the Fitted Q algorithm. In past problems we have examined, the state space was farily small, finite, and discrete. For this HIV problem, however, though the action space was a discrete collection of 4 distinct actions, the state space was continuous over the provided T1, T1, T2, T2, V, and E variables. As a result of this continuos nature, traditional Q-learning methods such as those from Practical 1 could not be employed since the Q-function could no longer be represented in discrete, tabular form. The fitted Q algorithm presented by Ernst et al, on the other hand, takes advantage of the idea of value iteration fitted to kernel-based reinforcement learning and fits an approximation architecture to the Q-function using the aforementioned four-tuple.

The idea of this algorithm is reasonably straightforward - it iterates over some number of episodes, where at each step N, it creates a Q_N-function. At the first step, a batch-mode regression algorithm (such as the extremely and totally randomized tree regressors mentioned above and explained in detail below) is applied to a training set where X is the set of state-action pairs and Y is the appropriate reward components for the state-action pairs. Once this regressor is created, for all future Q_N functions the training set of the previous, Q_N-1 function, is refreshed via value iteration using the reward added to the discounted value of the maximal Q_N-1 value found over the possible actions. In other words, q = r + (gamma)max_a Q_N-1(x_t+1, a), where a is the action, r is the reward, and x_t+1 is the next state.

After several iterations of this, these exists a model for the state-action space which allows us to create a new policy and maximize the overall reward. In the same way that we find the maximal q value in the iteration step descibed above, we can maximize the action to take at any given state in the continuous space and improve greatly on more baseline approaches. However, the policy must be epsilon-greedy to avoid inadvertent convergence to one of the baseline systems. In the development and testing of this algorithm, particularly the multiple fitted Q-algorithm described below, this converge occured seveal times, so an epsilon component was added to allow the policy to be somewhat random and to avoid this behavior. Additionally, it is important to note that no dynamic stopping conditions were implemented in this practical, since there was a lot of variation in the Q_N functions and Ernst et al themselves warned that because the episodes may not converge, it sometimes makes sense to experiment with a variety of hard-coded stopping conditions and using the most effective ones there. In this practical, as will be discussed in the results section, about 30 episodes was enough to attain near-optimal results.

For the actual regressors, as mentioned before, several were experimented with, including extremely randomized trees (Extra-Trees) and totally randomized trees. The former, which was the regressor ultimately deemed most effective in this domain, works by building M trees from complete original training sets. The algorithm then selects K-cut directions at random and for each of these, a cut-point at random. Using these, it them computes scores for each of the K tests and selects the one that is maximal. Totally randomized trees follow a similar procedure, but the parameter K is set equal to 1 such that the tests at different nodes are chosen independently from the outputs of the training set.

The implementation we see below is a full, from-scratch implementation of fitted Q, which uses the scikit learn ensemble methods to create regressions appropriately. Because the action space is discrete, the implementation actually creates an array of 4 Q_N regressions, each of which corresponds to a single action and can be used to predict the reward of a particular state for each of the actions and then maximized over. This is precisely what maxreg() does. Though this improves the performance of the algorithm by discretizing the action space and avoiding ambiguous action results, it does so at the tradeoff of computation time (predicting over 4 regressors and maximizing requires a lot more computation than predicting over a single regressor) and of regressor quality (since the separate states into discrete action-specific regressors, we have 4 small and less coherent regressors instead of 1 comprehensive one, so each of the small ones may not be ass effective). This was a design decision I made early on and decided to maintain, but it could be interesting to see if results could have been improved with a single, large regression.

The final implementation of this Practical is the multiple fitted Q implementation, created on the suggestion of the Ernst et al paper. This implementation runs fitted Q several times over compounded state, action, and reward histories, beginning with the histories obtained from the batch generated using a random policy. At the conclusion of each fitted Q run, a new batch was created from the Q policy, and this batch was concatenated with all the previous histories. Thus, though the computation time increased exponentially with each fitted Q execution, sampling over the different histories cumulatively allows the algorithm to become a lot more precise and not bound to a specific random batch, but based off of a batch that is based on an accumulation of regressions. This was fairly effective and given enough iterations, was able to achieve very significant gains in average reward, as will be discussed in the results section.

Finally, in this implementation, I evaluated many parameter combinations. Though it would have made sense to perform an actual, computation-based Linear Grid Search over these parameters (and others for ExtraTrees that I kept as defaults), I elected to use the Ernst parameters as a baseline and approximate them instead to save time and focus on the algorithm implementation. Some of the parameters evaluated in this paractical were the number of episodes to run each fitted Q for, the number of estimators used in the regression, the regressor to be used, and the epsilon for epsilon-greedy policies as well as the gamma for discounted value iteration. Many of these values are presented in the code and plots below and are discussed in the results section.

Question B (i): The Fitted-Q Algorithm


In [114]:
def maxreg(Q, xt1):
    maxr = max([Q[action].predict(xt1) if Q[action] else -float("inf") for action in range(4)])
    # In the cse nothing has been fit yet, just return 0
    if maxr == -float("inf"):
        maxr = 0
    return maxr

def fittedQ(batch, regressor, eps=30, days=200, num_patients=10, n_estimators = 10):
    Q_N = [None for i in range(4)]
    gamma = 0.98
    trainX = [[] for i in range(4)]
    trainY = [[] for i in range(4)]
    actions = []
    for N in range(eps):
        if not N%5:
            print "Just finished episode", N
        for d in range(1, days):                
            for patient in range(num_patients):
                state = batch[0][patient][d]
                action = batch[1][patient][d]
                next_state = batch[0][patient][d+1]
                reward = batch[2][patient][d] + gamma*maxreg(Q_N, next_state)
                
                trainX[action].append(state)            
                trainY[action].append(reward)
                X = np.array(trainX)
                Y = np.array(trainY)
                if not Q_N[action]:
                    Q_N[action] = regressor(n_estimators=n_estimators)
                Q_N[action].fit(X[action], np.ravel(Y[action]))
    return Q_N, batch

Basic testing below, feel free to ignore


In [169]:
myQ, tx, ty = fittedQ(batch, ExtraTreesRegressor) 
myQ


Just finished day 10
Just finished day 20
Just finished day 30
Just finished day 40
Out[169]:
[ExtraTreesRegressor(bootstrap=False, compute_importances=None,
          criterion='mse', max_depth=None, max_features='auto',
          max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
          min_samples_split=2, n_estimators=10, n_jobs=1, oob_score=False,
          random_state=None, verbose=0),
 ExtraTreesRegressor(bootstrap=False, compute_importances=None,
          criterion='mse', max_depth=None, max_features='auto',
          max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
          min_samples_split=2, n_estimators=10, n_jobs=1, oob_score=False,
          random_state=None, verbose=0),
 ExtraTreesRegressor(bootstrap=False, compute_importances=None,
          criterion='mse', max_depth=None, max_features='auto',
          max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
          min_samples_split=2, n_estimators=10, n_jobs=1, oob_score=False,
          random_state=None, verbose=0),
 ExtraTreesRegressor(bootstrap=False, compute_importances=None,
          criterion='mse', max_depth=None, max_features='auto',
          max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
          min_samples_split=2, n_estimators=10, n_jobs=1, oob_score=False,
          random_state=None, verbose=0)]

In [170]:
def policy(state, rng):
    lst = [myQ[action].predict(state) if myQ[action] else -float("inf") for action in range(4)]
    return lst.index(max(lst))

In [171]:
state_histories2, action_histories2, reward_histories2 = \
HIVTreatment.generate_batch(num_patients=num_patients, policy=policy)

In [172]:
test1 = np.array(reward_histories2)
print test1.mean()
plt.plot(test1.mean(axis=0))


32482.1719807
Out[172]:
[<matplotlib.lines.Line2D at 0x7f96c2c53a10>]

End basic testing. Below is a refactorization of the above code into more callable and scalable components


In [110]:
def run_fittedq(e, eps, days, num_patients, regressor, n_estimators, policy=hiv_treatment.random_policy, itr = 1, batch = None):
    # Generate Batch
    state_histories, action_histories, reward_histories = \
    HIVTreatment.generate_batch(num_patients=num_patients, policy=policy)
    
    # Organize and update Batch
    if not batch:
        batch = (state_histories, action_histories, reward_histories)
    else:
        batch = (np.concatenate((state_histories, batch[0])), \
                 np.concatenate((action_histories, batch[1])), \
                 np.concatenate((reward_histories, batch[2])))
    # Fit Q
    newQ, batch = fittedQ(batch, regressor, eps=eps, days=days, num_patients=num_patients, n_estimators = n_estimators)
    
    def policy(state, rng):
        lst = [newQ[action].predict(state) if newQ[action] else -float("inf") for action in range(4)]
        if random.random() < e:
            return random.choice([i for i in range(4)])
        return lst.index(max(lst))
    
    print "Q", itr
    print "Q", itr, "reward mean: ", np.array(reward_histories).mean()
    plt.plot(np.array(reward_histories).mean(axis=0), label="Iteration "+str(itr))
    
    return newQ, policy, batch, len(batch[0])

In [5]:
def plot_baselines(num_patients, rewards = []):
    baseline_rewards = []
    # Random
    if len(rewards) < 5:
        state_histories, action_histories, reward_histories = \
        HIVTreatment.generate_batch(num_patients=num_patients)
        random_rewards = np.array(reward_histories)
    else:
        random_rewards = rewards[0]
    baseline_rewards.append(random_rewards)
    print "Random reward mean: ", random_rewards.mean()
    plt.plot(random_rewards.mean(axis=0), label='Random')
    
    # Always x
    for a in range(4):
        if len(rewards) < 5:
            state_histories, action_histories, reward_histories = \
            HIVTreatment.generate_batch(num_patients=num_patients, policy = hiv_treatment.always_do(a))
            action_rewards = np.array(reward_histories)
        else:
            action_rewards = rewards[a+1]
        baseline_rewards.append(action_rewards)
        print "Action", a, "reward mean: ", action_rewards.mean()
        plt.plot(action_rewards.mean(axis=0), label='Action '+str(a))
    plt.legend()
    return baseline_rewards

In [6]:
baseline_rewards = []

Question B (ii): Plots

Single run of fitted Q using 10 estimators against baseline.


In [139]:
e = .15
num_patients = 2
eps = 31
days = 200
iters = 1
regressor = ExtraTreesRegressor
n_estimators = 10

Q, policy, batch, num_patients = run_fittedq(e, eps, days, num_patients, regressor, n_estimators)

sh, ah, rh = \
HIVTreatment.generate_batch(num_patients=num_patients, policy=policy)
final_rewards = np.array(rh)
print "Fitted Q reward mean: ", final_rewards.mean()
plt.plot(final_rewards.mean(axis=0))
plt.xlabel("Episodes")
plt.ylabel("Rewards")
plt.title("Q Iteration Reward Plot against baseline")
baseline_rewards = plot_baselines(num_patients, baseline_rewards)


Just finished episode 0
Just finished episode 5
Just finished episode 10
Just finished episode 15
Just finished episode 20
Just finished episode 25
Just finished episode 30
Q 1
Q 1 reward mean:  47131.5072698
Fitted Q reward mean:  444322.882809
Random reward mean:  53902.7759666
Action 0 reward mean:  17161.7083799
Action 1 reward mean:  15951.6383545
Action 2 reward mean:  19890.6142787
Action 3 reward mean:  35073.090681

Single run of fitted Q using 50 estimators against baseline. Performs much better.


In [140]:
e = .15
num_patients = 2
eps = 31
days = 200
iters = 1
regressor = ExtraTreesRegressor
n_estimators = 50

Q, policy, batch, num_patients = run_fittedq(e, eps, days, num_patients, regressor, n_estimators)

sh, ah, rh = \
HIVTreatment.generate_batch(num_patients=num_patients, policy=policy)
final_rewards = np.array(rh)
print "Fitted Q reward mean: ", final_rewards.mean()
plt.plot(final_rewards.mean(axis=0))
plt.xlabel("Episodes")
plt.ylabel("Rewards")
plt.title("Q Iteration Reward Plot against baseline")
baseline_rewards = plot_baselines(num_patients, baseline_rewards)


Just finished episode 0
Just finished episode 5
Just finished episode 10
Just finished episode 15
Just finished episode 20
Just finished episode 25
Just finished episode 30
Q 1
Q 1 reward mean:  46046.7313936
Fitted Q reward mean:  6191713.94908
Random reward mean:  53902.7759666
Action 0 reward mean:  17161.7083799
Action 1 reward mean:  15951.6383545
Action 2 reward mean:  19890.6142787
Action 3 reward mean:  35073.090681

In [135]:
baseline_rewards = plot_baselines(num_patients, baseline_rewards)
plt.xlabel("Episodes")
plt.ylabel("Rewards")
plt.title("Baseline Plots")


Random reward mean:  53902.7759666
Action 0 reward mean:  17161.7083799
Action 1 reward mean:  15951.6383545
Action 2 reward mean:  19890.6142787
Action 3 reward mean:  35073.090681
Out[135]:
<matplotlib.text.Text at 0x7f82eac8c810>

Below are several experiments involving a single application of fitted Q. After experimenting with all of these, and based on the results of the paper (as discussed previously), I concluded that an epsilon of .15 and a total of 50 estimators would yield the best results. Though these are not necessarily optimal (the best approach would have been to do a linear Grid Search across several possibilities for these parameters as well as some others in the ExtraTreesRegressor that were ultimately kept as defaults), the epsilon value performed the best in several trials of the below code. Though more estimators would very likely have improved the results, the tradeoff of runtime was simply not worth it after this point, so 50 was selected. Similarly, the 31 episodes proved to be most effective, as more episodes after this point did not improve results significantly. For more information on the selection and experimentation on these parameters, see Section A.


In [117]:
# Epsilon
e = .15

# Episodes, patients, days, Q iterations
eps = 31
num_patients = 1
days = 200
iters = 1

# Regressor and params
regressor = ExtraTreesRegressor
n_estimators = 10

# Batch and policy to evolve over time
policy = hiv_treatment.random_policy
batch = None

# Iterate through Qs and run the alg
for x in range(iters):
    Q, policy, batch, num_patients = \
        run_fittedq(e, eps, days, num_patients, regressor, n_estimators, policy=policy, itr = x, batch = batch)

# batch final histories and plot
sh, ah, rh = \
HIVTreatment.generate_batch(num_patients=num_patients, policy=policy)

final_rewards = np.array(rh)
print "Fitted Q reward mean: ", final_rewards.mean()
plt.plot(final_rewards.mean(axis=0), label="Final Iteration")
plt.legend()
plt.xlabel("Episodes")
plt.ylabel("Rewards")
plt.title("Q Iteration Reward Plot Over Single Fitted Q")
# See baselines above. Ommitted here to prevent obfuscation
# baseline_rewards = plot_baselines(num_patients, baseline_rewards)


Just finished episode 0
Just finished episode 5
Just finished episode 10
Just finished episode 15
Just finished episode 20
Just finished episode 25
Just finished episode 30
Q 0
Q 0 reward mean:  45285.0446738
Fitted Q reward mean:  290025.05507
Out[117]:
<matplotlib.text.Text at 0x7f82eb67c650>

In [127]:
# Epsilon
e = .15

# Episodes, patients, days, Q iterations
eps = 31
num_patients = 1
days = 200
iters = 1

# Regressor and params
regressor = ExtraTreesRegressor
n_estimators = 50

# Batch and policy to evolve over time
policy = hiv_treatment.random_policy
batch = None

# Iterate through Qs and run the alg
for x in range(iters):
    Q, policy, batch, num_patients = \
        run_fittedq(e, eps, days, num_patients, regressor, n_estimators, policy=policy, itr = x, batch = batch)

# batch final histories and plot
sh, ah, rh = \
HIVTreatment.generate_batch(num_patients=num_patients, policy=policy)

final_rewards = np.array(rh)
print "Fitted Q reward mean: ", final_rewards.mean()
plt.plot(final_rewards.mean(axis=0), label="Final Iteration")
plt.legend()
plt.xlabel("Episodes")
plt.ylabel("Rewards")
plt.title("Q Iteration Reward Plot Over Single Fitted Q (More Estimators)")
# See baselines above. Ommitted here to prevent obfuscation
# baseline_rewards = plot_baselines(num_patients, baseline_rewards)


Just finished episode 0
Just finished episode 5
Just finished episode 10
Just finished episode 15
Just finished episode 20
Just finished episode 25
Just finished episode 30
Q 0
Q 0 reward mean:  36825.436068
Fitted Q reward mean:  524336.552988
Out[127]:
<matplotlib.text.Text at 0x7f82eb4d2fd0>

In [118]:
# Epsilon
e = .15

# Episodes, patients, days, Q iterations
eps = 31
num_patients = 5
days = 200
iters = 1

# Regressor and params
regressor = ExtraTreesRegressor
n_estimators = 10

# Batch and policy to evolve over time
policy = hiv_treatment.random_policy
batch = None

# Iterate through Qs and run the alg
for x in range(iters):
    Q, policy, batch, num_patients = \
        run_fittedq(e, eps, days, num_patients, regressor, n_estimators, policy=policy, itr = x, batch = batch)

# batch final histories and plot
sh, ah, rh = \
HIVTreatment.generate_batch(num_patients=num_patients, policy=policy)

final_rewards = np.array(rh)
print "Fitted Q reward mean: ", final_rewards.mean()
plt.plot(final_rewards.mean(axis=0), label="Final Iteration")
plt.legend()
plt.xlabel("Episodes")
plt.ylabel("Rewards")
plt.title("Q Iteration Reward Plot Over Single Fitted Q (5 Patients)")
# See baselines above. Ommitted here to prevent obfuscation
# baseline_rewards = plot_baselines(num_patients, baseline_rewards)


Just finished episode 0
Just finished episode 5
Just finished episode 10
Just finished episode 15
Just finished episode 20
Just finished episode 25
Just finished episode 30
Q 0
Q 0 reward mean:  46215.7996374
Fitted Q reward mean:  251941.265607
Out[118]:
<matplotlib.text.Text at 0x7f82eb5e6e90>

Below is the setup for multiple (namely 5) applications of fitted Q. In the interest of saving time and verifying effectiveness, only 1 patient and 10 estimators were used. We would very likely get better results with a greater number of patients and estimators (as we did above with the single fitted Q application), but these results serve well as a prototype to show the effectiveness of the approach. More on this in the discussion section below.


In [128]:
# Epsilon
e = .15

# Episodes, patients, days, Q iterations
eps = 11
num_patients = 1
days = 200
iters = 5

# Regressor and params
regressor = ExtraTreesRegressor
n_estimators = 10

# Batch and policy to evolve over time
policy = hiv_treatment.random_policy
batch = None

# Iterate through Qs and run the alg
for x in range(iters):
    Q, policy, batch, num_patients = \
        run_fittedq(e, eps, days, num_patients, regressor, n_estimators, policy=policy, itr = x, batch = batch)

# batch final histories and plot
sh, ah, rh = \
HIVTreatment.generate_batch(num_patients=num_patients, policy=policy)

final_rewards = np.array(rh)
print "Fitted Q reward mean: ", final_rewards.mean()
plt.plot(final_rewards.mean(axis=0), label="Final Iteration")
plt.legend()
plt.xlabel("Episodes")
plt.ylabel("Rewards")
plt.title("Q Iteration Reward Plot Over Multiple Applications of Fitted Q")
# See baselines above. Ommitted here to prevent obfuscation
# baseline_rewards = plot_baselines(num_patients, baseline_rewards)


Just finished episode 0
Just finished episode 5
Just finished episode 10
Q 0
Q 0 reward mean:  55115.6517462
Just finished episode 0
Just finished episode 5
Just finished episode 10
Q 1
Q 1 reward mean:  1101474.18161
Just finished episode 0
Just finished episode 5
Just finished episode 10
Q 2
Q 2 reward mean:  23004.6203495
Just finished episode 0
Just finished episode 5
Just finished episode 10
Q 3
Q 3 reward mean:  86759.0504422
Just finished episode 0
Just finished episode 5
Just finished episode 10
Q 4
Q 4 reward mean:  159961.660306
Fitted Q reward mean:  1775543.13238
Out[128]:
<matplotlib.text.Text at 0x7f82eb3af390>

In [129]:
hiv_treatment.visualize_hiv_history(sh[0], ah[0])


Out[129]:
[<matplotlib.lines.Line2D at 0x7f82eaf0e850>,
 <matplotlib.lines.Line2D at 0x7f82eaf0e9d0>,
 <matplotlib.lines.Line2D at 0x7f82eaf17050>,
 <matplotlib.lines.Line2D at 0x7f82eaf17710>,
 <matplotlib.lines.Line2D at 0x7f82eaf07f10>,
 <matplotlib.lines.Line2D at 0x7f82eaf07410>,
 <matplotlib.lines.Line2D at 0x7f82eaf24190>]

Question C: Results Discussion

Much of the rationale presented in Part A (algorithm motivation), was definitely supported by the results, some of which may be seen in the plots above. For this results section, I will first discuss the success of the single-iteration Fitted Q algorithm, then delve into the even greater success of the multiple fitted Q algorithm, and finally discuss the results of the holistic parameter evaluation. I will then discuss future tests that can be doen with these implementations and touch on what else can be done to improve the implementations themselves.

Single Fitted Q The single fitted Q algorithm was a good approach to take for this domain, as anticipated. It handled the continuous space very well, and the implementation fo 4 regressors, though detracting from runtime, performed well, especially compared with the baseline implementations. he evaluation of the performance may be conducted using two means - quantitatively, using the mean reward, and qualitatively looking at the graphs provided. Consider the first two plots above. We see that the mean reward for the Fitted Q algorithm presented in this practical is substantially higher than any of the averages of the baselines. The "always" baselines, which always elect to perform one fo the actions, do not get very high rewards and thus have fairly low averages. "Always 3" seems to spike a bit at the beginning, but then decreases and stays close to 0, meaning the effectivness of taking both drugs all the time tapers off rather quickly. The random performs the best of the baselines, but is clearly outperformed by the fitted Q, which not only does better than the "Always 3" at the beginning, but consistently maintains high rewards over the entire 200 epsiode run. Part of the reason for the success is the use of the holistically optimized parameters, described in more detail below, of using 30 epsiodes and a gamma of .98 for the actual fitted Q algorithm. Note that since the fitted Q outperforms the baseline so much, it was necessary to plot the baseline separately and not superimpose it on any of the other plots. It is important to mention, however, that some of the plots are clearly not constantly increasing and experience much flucutation in their rewards over the days. This seems to indicate that fitted Q is not very good at finding good actions to take at all possible states, and even though overall it is better than any of the baselines, its fluctuations are not ideal, and it would be interesting to see what exactly causes them and how we could approach improving it to make the rewards more consistent over time. This is discussed in more detail in the section below. Other plots presented here simply show the greater effectiveness of using more estimators in the ensemble method, which follows our intuition. However, since it takes up so much computational time, 50 estimators are only used in 2 of the plots, to demonstrate their efficacy, but other plots use 10 to save time and to effectively prototype the potential results of each approach.

Multiple Fitted Q This algorithm only has a single plot to evaluate its success, but this plot and the different averages of rewards, reveal much about the general success of this approach and the pontential for even greater gains as outlined in the future work section. First, note that with only 5 overall applications of fitted Q, this algorithm attained a monumental 1.77 million average reward, far outstripping any of the other resutls in this practical. On top of this, though it is difficult to see, notice how although the early iterations exhibited much of the fluctuation we saw in the single fitted Q application plots above, and even the penultimate Q application took a steep dive near the end of the 200 episodes, the final application, indicated by the yellow line, is seen steadily increasing to higher and higher rewards over time. This, as discussed above, is a very desirable quality as we would like reward to increase over time, indicating that the actions (drugs) being taken are working well over time. This seems to only be achievable through this multiple application of the fitted Q and is a testament to its success. This follows my intuition, that building on prior batches and creating new and more powerful regressions with each fitted Q application, in tandem with an epsilon greedy approach to avoid convergence to an undesirable baseline behavior, is very effective in learning a good policy. I am confident that with even more applications of fitted Q over more episodes per application and with more patients, we could attain very large rewards gains using this approach.

Holistic parameter evaluation Throughout the process of designing the above code and constructing the plots above, I made several tweaks to variables. Here, I present the results of these evaluations, which can also be seen in the code and plots above. For the number of episodes of a single fitted Q application, I began with 200, but when this took several hours to complete, I tried again with 50 and found that the results were very close. After this, I experimented with a low number of 10 and even though this worked well (and was ultimately used for the multiple fitted Q evaluation), it was not sufficient for my purposes. I ultimately settled on 30 as this was very close to 50 but was about 4 times quicker in runtime and made more sense for my prototypes. For the discount factor, I experimented with very low and very high values, but the low values unfortunately gave me results that were not much better than the "random" baseline. This makes sense since very discounted rewards would have made disctinctions of good and bad long-term policies much more difficult to spot and would be a lot more prone to randomness, resulting in sub-optimal policies being generated. However, it was still important to settle on some discount factor since having no discount would have increased the potential of the action values diverging. Thus, after experimenting in the .7, 8, and .9 range, I found .98 to be a good discount, seemingly because of its closeness to 1, allowing the algorithm to branch out and look far ahead into potential rewards. Aside from these two, I experimented with different number of estimators, though this paramtere was very straightforward. More estimators yielded better results, though at around 150 these results began to taper off. 50 was a very good number (though not always consistent - 10 sometimes beat it out in average reward as a result of luck). We can see from the above plots that 50 estimators was better than 10, but since 10 performed reasonably well it was used for the majority of the experiments. Finally, for the policy on the multiple fitted Q application, it was necessary to apply an epsilon randomness factor, which I initially set to .05 but found the .15 suggested by Ernst to be much more effective and ultimately yielding the great result we see in the second to last plot displaying the multiple fitted Q results. I also experimented with different regressors, but it became apparent rather quickly that Extra-Trees was the best regressor for this domain because of its nature and because it was unnecessary for us to try and freeze the tree structure afte the first iteration as something like Totally Randomized Trees would have allowed us.

Future Work Two major components missing from this practical are extensive and precise parameter optimization and deeper evaluation of the algorithms by means of more iterations. Both of these things were avoided to save on computaiton time, but in future it would we interesting to see if running the code for several days on several hundred estimators and tens of Q iterations would yield tremendous results. I postulate that it will, basing my prediction off of the basic results shown in this practical, wherein using numerous estimators and over many cumulative runs of fitted Q, the average rewards drastically increase and the reward graph approaches a more linearly positive line, losing a lot of the fluctuation a sa result of the smoothing that occurs over the exponentially increasing amount of optimized batch data. Overall, however, I believe this practical shows a very solid and robust implementation of fitted Q which not only outperforms the baseline, but also outperforms itself when applied multiple times. The only main issue to consider is the somewhat sporadic fluctuation we have (notice how one of the early iterations of the multiple fitted Q approach yielded a very low 14,000 average reward?) I believe one way to combat this would be to potentially try to run fitted Q many times not on itself but on the randomized baseline with different randomization seeds each time. From these, we could then select the best result and do this many times through the multiple fitted Q approach. Though this would involve a lot of computation, I predict that it would smooth out a lot of the inconsistences we see in the results above.