Hyper-parameter optimization in concise

Hyper-parameter optimization comes quite handy in deep learning. There are different architectures to try out, different parameters, ... On the other hand, we only have a limited amount of time to inspect the results of each modeling choice and decide on the next set of hyper-parmaeters to try out.

By hyper-parameters I refer to any parameter that are not optimized by the learning algorithm (i.e. gradient descent). Some examples are: learning rate of gradient descent, batch size, number of hidden layers, whether to use skip-connections, particular way of pre-processing the data (say data augmentation amount), etc etc.

In genomics, being able to explore a vast amount of modeling choices automatically is important. Only a handful of papers (if any) have been written for your given task at hand, hence the space of different modeling choices has not been as extensively explored as for say image classification on ImageNet.

I will guide you through the typical process of hyper-parameter optimization, using the example from PWM_initialization.ipynb.

Prerequisite

please go throught the hyperopt wiki: https://github.com/hyperopt/hyperopt/wiki

Steps in hyper-parameter optimization

Parametrize your preprocessing, model and training procedure
Define a range of values (and their distribution) for all the parameters
Run the optimization in a distributed fashion (on a cluster/multiple machines in parallel)
Inspect the results
- Choose the best model
- inspect the hyper-parameter <-> evaluation loss relation

1. Parametrize your preprocessing, model and training procedure

Hyper-parameter optimization in CONCISE assumes you have defined two functions: data(...) and model(...).

data() function

First, the define data(...): a function returing the whole trainig, validation and test dataset.

data() - returns a tuple: (train, test) or (train, valid, test), where:
- train = (X_train, y_train, other1, other2, ...)
- valid = (X_valid, y_valid)
- test = (X_test, y_test)
X_* is a numpy array, list of numpy arrays or a dictionary of numpy array
y_* is the response variable

X_train and y_train are directly feed to the .fit method of a Keras model: model.fit(x=X_train, y = y_train, ...)

Example



In [1]:

    
from concise.preprocessing import encodeDNA
import pandas as pd
import numpy as np









    



Using TensorFlow backend.



In [2]:

    
def data(seq_length=101):
      
    def load(split="train"):
        dt = pd.read_csv("../data/RBP/PUM2_{0}.csv".format(split))
        # DNA/RNA sequence
        xseq = encodeDNA(dt.seq, maxlen=seq_length, seq_align='center')
        # response variable
        y = dt.binding_site.as_matrix().reshape((-1, 1)).astype("float")
        if split=="train":
            from concise.data import attract
            # add also the pwm_list
            pwm_list = attract.get_pwm_list(["129"])
            return {"seq": xseq}, y, pwm_list
        else:
            return {"seq": xseq}, y

    return load("train"), load("valid"), load("test")



In [3]:

    
train, valid, test = data()

model() function

model() returns a compiled Keras model. The only restriction on the parameters is that the function needs a train_data parameter. This allows you to extract the shape of your dataset.



In [101]:

    
import concise.layers as cl
import keras.layers as kl
import concise.initializers as ci
import concise.regularizers as cr
from keras.callbacks import EarlyStopping
from keras.models import Model, load_model
from keras.optimizers import Adam

def model(train_data, filters=1, kernel_size=9, motif_init=None, lr=0.001):
    seq_length = train_data[0]["seq"].shape[1]
    pwm_list = train_data[2]
    
    if motif_init is not None:
        # Motif init is a dictionary with fields: "stddev" 
        kinit = ci.PSSMKernelInitializer(pwm_list, 
                                         stddev=motif_init.get("stddev", 0.05),  # if not specified, use 0.05
                                         add_noise_before_Pwm2Pssm=True)
        binit = "zeros"
    else:
        kinit = "glorot_uniform"
        binit = "zeros"
        
        
    # sequence
    in_dna = cl.InputDNA(seq_length=seq_length, name="seq")
    x = cl.ConvDNA(filters=filters, 
                   kernel_size=kernel_size, 
                   activation="relu",
                   kernel_initializer=kinit,
                   bias_initializer=binit,
                   name="conv1")(in_dna)
    x = kl.AveragePooling1D(pool_size=4)(x)
    x = kl.Flatten()(x)
    
    x = kl.Dense(units=1)(x)
    m = Model(in_dna, x)
    m.compile(Adam(lr=lr), loss="binary_crossentropy", metrics=["acc"])
    return m

2. Define a search space for each hyper-parameter

Current hyper-parameters are the arguments of data and model functions:

data:
- seq_length
model:
- filters
- kernel_size
- motif_init (yes or no)
  - if yes: stddev
- lr

Let's use a hyperopt pyll graph to define the hyper-parameter search space:



In [18]:

    
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

hyper_params = {
    "data": {
        "seq_length": hp.choice("d_seq_length", (101, 51, 21))
    },
    "model": {
        "filters": hp.choice("m_filters", (1, 16)),
        "kernel_size": 15,
        "motif_init": hp.choice("m_motif_init", (
                None,
                {"stddev": hp.uniform("m_stddev", 0, 0.2)}
            )),
        "lr": hp.loguniform("m_lr", np.log(1e-4), np.log(1e-2)) # 0.0001 - 0.01
        },
    "fit": {
        "epochs": 1,
        "patience": 5,
        "batch_size": 128,
    }
}

We are saying that

seq_length in the data function should be 21, 51 or 101
we should use 1 or 16 filter
the learning rate should be from the range [1e-4, 1e-2], sampled on a log-scale
...

Hyperopt also allows parameter nesting. This is demonstrated with the ["model"]["motif_init"] parameter: we want to try a model with and without motif initialization...and if we decide on motif initialization, we want to explore different options for the "sttdev" parameter.

Read the hyperopt documentation about defining the search space for more information.

Note that in addition to the data() and model() function arguments, we can also specify the hyper-parameters related to the training procedure (say batch_size or early-stopping patience).

3. Run the optimization



In [13]:

    
from concise.hyopt import CompileFN, CMongoTrials, test_fn

To run hyperopt, you need to define an objective function to minimize and the space of parameters you want to explore. Here is an example from the hyperopt tutorial:



In [16]:

    
import pickle
import time
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

logger = logging.getLogger()
logger.setLevel(logging.WARN)

def objective(x):
    return {'loss': x ** 2, 'status': STATUS_OK }

import logging
best = fmin(objective,
        space=hp.uniform('x', -10, 10),
        algo=tpe.suggest,
        max_evals=100)

print(best)









    



{'x': -0.0037076787079246554}

For supervised learning, the objective function is the evaluation metric on the validation dataset. Concise provides a class to compile the objective function given the data() and the model() function:



In [32]:

    
import hyopt_example # module with data.py and model.py
# We are using the data() and model() functions defined in a file
# The reason is that the serialization doens't work with functions
# defined in the notebook
objective = CompileFN(db_name="mydb", exp_name="motif_initialization",  # experiment name
                      data_fn=hyopt_example.data.data,
                      model_fn=hyopt_example.model.model, 
                      add_eval_metrics=["auprc", "auc"], # metrics from concise.eval_metrics, you can also use your own
                      optim_metric="auprc", # which metric to optimize for
                      optim_metric_mode="max", # maximum should be searched for
                      valid_split=None, # use valid from the data function
                      save_model='best', # checkpoint the best model
                      save_results=True, # save the results as .json (in addition to mongoDB)
                      save_dir="./saved_models/")  # place to store the models

Now, we can use hyper-opt do search the hyper-parameters:

Test if everything works

Before starting many workers, its good to try running the objective function once. It's much easier to debug it that way.



In [31]:

    
# Test if the objective function works on a subset of data for one random hyper-parameter set
test_fn(objective, hyper_params)









    



['loss', 'acc']
Train on 1000 samples, validate on 4881 samples
Epoch 1/1
0s - loss: 3.3390 - acc: 0.7110 - val_loss: 3.0101 - val_acc: 0.7630
Returned value:
{'eval': {'acc': 0.76295840740203857,
          'auc': 0.52598029221882436,
          'auprc': 0.26792649640861294,
          'loss': 3.0101253986358643},
 'history': {'loss': {'acc': [0.71100000238418581],
                      'epoch': [0],
                      'loss': [3.3390192050933836],
                      'val_acc': [0.76295841021069832],
                      'val_loss': [3.0101239079458013]},
             'params': {'batch_size': 128,
                        'do_validation': True,
                        'epochs': 1,
                        'metrics': ['loss', 'acc', 'val_loss', 'val_acc'],
                        'samples': 1000,
                        'verbose': 2}},
 'loss': -0.26792649640861294,
 'name': {'data': 'data',
          'optim_metric': 'auprc',
          'optim_metric_mode': 'auprc',
          'model': 'model'},
 'param': {'data': {'seq_length': 21},
           'fit': {'batch_size': 128,
                   'early_stop_monitor': 'val_loss',
                   'epochs': 1,
                   'patience': 5},
           'model': {'filters': 1,
                     'kernel_size': 15,
                     'lr': 0.003247782705479872,
                     'motif_init': {'stddev': 0.14415181695216578}}},
 'path': {'model': '/tmp/concise_hyopt_test//mydb/motif_initialization//train_models/87d14216-6556-434b-8ccd-fe220468e28b.h5',
          'results': '/tmp/concise_hyopt_test//mydb/motif_initialization//train_models/87d14216-6556-434b-8ccd-fe220468e28b.json'},
 'status': 'ok',
 'time': {'duration': {'dataload': 10.259849,
                       'total': 11.342775,
                       'training': 1.082926},
          'end': '2017-08-16 15:00:05.971063',
          'start': '2017-08-16 14:59:54.628288'}}

Run locally and sequentially

Trials() from hyperopt will run the jobs sequentially on a single machine



In [ ]:

    
cd hyopt_example



In [28]:

    
trials = Trials("hyopt_example")
best = fmin(objective,
            space=hyper_params,
            algo=tpe.suggest,
            trials=trials,
            max_evals=3)









    



['loss', 'acc']
Train on 17713 samples, validate on 4881 samples
Epoch 1/1
0s - loss: 0.6748 - acc: 0.7959 - val_loss: 0.5141 - val_acc: 0.7990
['loss', 'acc']
Train on 17713 samples, validate on 4881 samples
Epoch 1/1
0s - loss: 0.5069 - acc: 0.7957 - val_loss: 0.4666 - val_acc: 0.8007
['loss', 'acc']
Train on 17713 samples, validate on 4881 samples
Epoch 1/1
1s - loss: 0.5480 - acc: 0.7957 - val_loss: 0.4886 - val_acc: 0.7990

Running the optimization in parallel on multiple workers

MongoTrials will defer the work to multiple workers running in parallel. All the results will get stored to the MongoDB database.



In [30]:

    
# just replaced Trials with CMongoTrials
trials = CMongoTrials(db_name="hyopt_example", exp_name="motif_initialization")
best = fmin(objective,
            space=hyper_params,
            algo=tpe.suggest,
            trials=trials,
            max_evals=300)









    



2017-08-16 14:59:26,520 [WARNING] over-writing old domain trials attachment

This script waits for workers to complete the work. Start the workers by running the following bash script many times in parallel:

#!/bin/bash
#
# Slurm parameters
#SBATCH --mem=16G
#SBATCH -c 4
# ------------------

DB=hyopt_example
# path to the repository containing the script
export PYTHONPATH="."

hyperopt-mongo-worker \
    --mongo=ouga03:1234/$DB \
    --poll-interval=1 \
    --reserve-timeout=3600 \
    --workdir="."

Running the workers on SLURM

Run:

$ for i in {1..20}; do sbatch start_worker.bash; done

Sidenote: Code organization

I typically organize the code in the following files:

data.py
- contains data() function
model.py
- contains model() function
train.py
- Defines the hyper_parameter ranges
- Runs hyperopt.fmin()

See the nbs/hyopt_example folder.

4. Inspect the results

Hyperopt keeps track of all the trials (parameters + resulting objective loss) in a Monogo database. The object hyperopt.MongoTrials provides a handle to the databse. Concise extends the class hyperopt.MongoTrials with some additional methods for easier hyper-parameter inspection.

Since the results are stored in a database, you only need to know the database and the experiment name in order to access the database from any folder:



In [173]:

    
trials = CMongoTrials(db_name="hyopt_example", exp_name="motif_initialization", ip="ouga03")









    



2017-08-16 11:54:49,098 [INFO] PROTOCOL mongo
2017-08-16 11:54:49,098 [INFO] USERNAME None
2017-08-16 11:54:49,099 [INFO] HOSTNAME ouga03
2017-08-16 11:54:49,100 [INFO] PORT 1234
2017-08-16 11:54:49,101 [INFO] PATH /hyopt_example/jobs
2017-08-16 11:54:49,101 [INFO] DB hyopt_example
2017-08-16 11:54:49,102 [INFO] COLLECTION jobs
2017-08-16 11:54:49,103 [INFO] PASS None



In [174]:

    
# Number of completed trials
len(trials)









    Out[174]:





300

Tidy table of hyperparameters



In [175]:

    
df = trials.as_df().sort_values("eval.auprc")
df.sort_values("eval.auprc", ascending=False).head()









    Out[175]:







  
    
      
      tid
      loss
      status
      eval.acc
      eval.auc
      eval.auprc
      eval.loss
      name.data
      name.optim_metric
      name.optim_metric_mode
      name.model
      param.data.seq_length
      param.fit.batch_size
      param.fit.early_stop_monitor
      param.fit.epochs
      param.fit.patience
      param.model.filters
      param.model.kernel_size
      param.model.lr
      param.model.motif_init
      param.model.motif_init.stddev
      path.model
      path.results
      time.duration.dataload
      time.duration.total
      time.duration.training
      time.end
      time.start
      n_epoch
    
  
  
    
      257
      277
      -0.7221
      ok
      0.8701
      0.8693
      0.7221
      0.3543
      data
      auprc
      auprc
      model
      101
      128
      val_loss
      50
      5
      16
      15
      0.0054
      NaN
      0.0252
      ./saved_models/hyopt_...
      ./saved_models/hyopt_...
      23.1172
      52.1519
      29.0347
      2017-08-16 11:52:06.8...
      2017-08-16 11:51:14.6...
      14
    
    
      137
      157
      -0.7218
      ok
      0.8689
      0.8725
      0.7218
      0.3417
      data
      auprc
      auprc
      model
      101
      128
      val_loss
      50
      5
      16
      15
      0.0050
      NaN
      0.0102
      ./saved_models/hyopt_...
      ./saved_models/hyopt_...
      28.3887
      93.0859
      64.6972
      2017-08-16 11:46:30.3...
      2017-08-16 11:44:57.2...
      34
    
    
      275
      295
      -0.7191
      ok
      0.8697
      0.8708
      0.7191
      0.3504
      data
      auprc
      auprc
      model
      101
      128
      val_loss
      50
      5
      16
      15
      0.0055
      NaN
      0.0253
      ./saved_models/hyopt_...
      ./saved_models/hyopt_...
      28.5676
      66.6112
      38.0437
      2017-08-16 11:53:20.5...
      2017-08-16 11:52:13.9...
      21
    
    
      294
      314
      -0.7188
      ok
      0.8728
      0.8660
      0.7188
      0.3597
      data
      auprc
      auprc
      model
      101
      128
      val_loss
      50
      5
      16
      15
      0.0090
      NaN
      0.0363
      ./saved_models/hyopt_...
      ./saved_models/hyopt_...
      28.3512
      50.3183
      21.9670
      2017-08-16 11:54:08.6...
      2017-08-16 11:53:18.3...
      10
    
    
      270
      290
      -0.7148
      ok
      0.8687
      0.8686
      0.7148
      0.3518
      data
      auprc
      auprc
      model
      101
      128
      val_loss
      50
      5
      16
      15
      0.0067
      NaN
      0.0463
      ./saved_models/hyopt_...
      ./saved_models/hyopt_...
      23.6120
      58.8660
      35.2540
      2017-08-16 11:53:04.3...
      2017-08-16 11:52:05.4...
      17

We can see that the best trial achieved auPRC=0.72 on the validation set.

Was the number of evaulated trials enough?



In [176]:

    
best_loss = np.maximum.accumulate(df.sort_values("tid")["eval.auprc"])



In [178]:

    
fig, ax = plt.subplots(nrows=2, figsize=(7,7))
fig.tight_layout()
ax[0].plot(df.sort_values("tid")["eval.auprc"], linestyle="", marker="o")
ax[0].set(ylabel="Validation auPRC")
ax[1].plot(best_loss)
ax[1].set(xlabel="Iteration",ylabel="Best validation auPRC");

We can see that through time, hyperopt finds more and more good hyper-parameter configurations.

Load best model, evaluate it on the test set



In [181]:

    
cd hyopt_example









    



/data/nasif12/home_if12/avsec/projects-work/concise/nbs/hyopt_example



In [186]:

    
best_tid = trials.best_trial_tid()
params = trials.get_param(best_tid).to_dict()
model = trials.load_model(best_tid)



In [188]:

    
# Best hyper-parameters were:
params









    Out[188]:





{'data': {'seq_length': 101},
 'fit': {'batch_size': 128,
  'early_stop_monitor': 'val_loss',
  'epochs': 50,
  'patience': 5},
 'model': {'filters': 16,
  'kernel_size': 15,
  'lr': 0.005370016473899783,
  'motif_init': {'stddev': 0.02517910263989817}}}



In [189]:

    
import data



In [191]:

    
# Load the train, valid, test
train, valid, test = data.data(**params["data"])



In [192]:

    
y_pred_test = model.predict(test[0])



In [193]:

    
import concise.eval_metrics as cem



In [200]:

    
# Performance on the test-set:
cem.auprc(test[1], y_pred_test)









    Out[200]:





0.73280952522362308

Training curve of the best model



In [250]:

    
train_hist = trials.train_history(best_tid)



In [251]:

    
train_hist.head()



In [261]:

    
plt.plot(train_hist.epoch + 1, train_hist.loss, label="Training")
plt.plot(train_hist.epoch + 1, train_hist.val_loss, label="Validation")
plt.xlabel("Epoch")
plt.ylabel("Cross-entropy loss")
plt.legend();

We can see that the optimal number of epochs is 10.

Which hyper-parameters are better than others?

To draw conclusions and guide future design, its useful to inspect which hyper-parameters lead to good results. Typically, this means to plot the loss function against the hyper-parameter we are interested in. Since we already have a tidy table with the hyper-parameter experiment results, this is pretty straighforward.

Learning rate



In [218]:

    
plt.scatter(df["param.model.lr"], df["eval.auprc"])
plt.xlim([1e-4, 1e-2])
plt.xscale("log")
plt.xlabel("Learning rate")
plt.ylabel("Validation auPR");

Motif initialization



In [219]:

    
import seaborn as sns



In [267]:

    
df["motif_init_used"] = ~df["param.model.motif_init.stddev"].isnull()
sns.swarmplot("motif_init_used", "eval.auprc", data=df)
plt.ylabel("validation auPR");

Motif initialization helped to achieve better performance. We can also see that the hyper-parameter optimization algorithm has drawn more samples using it, since it figured out it yields better performance.

Additional variance on the motif initialization



In [245]:

    
plt.scatter(df["param.model.motif_init.stddev"], df["eval.auprc"])
plt.ylabel("validation auPR");
plt.xlabel("Added random noise to the motifs (stddev)");

Number of filters



In [247]:

    
sns.swarmplot("param.model.filters", "eval.auprc", data=df)
plt.ylabel("validation auPR")
plt.xlabel("Number of filters");

Sequence length



In [249]:

    
sns.swarmplot("param.data.seq_length", "eval.auprc", data=df)
plt.ylabel("validation auPR")
plt.xlabel("Input sequence length");

	tid	epoch	acc	loss	val_acc	val_loss
0	277	0	0.8209	0.4828	0.8408	0.3954
1	277	1	0.8466	0.4031	0.8506	0.3867
2	277	2	0.8574	0.3788	0.8556	0.3778
3	277	3	0.8647	0.3633	0.8621	0.3552
4	277	4	0.8682	0.3521	0.8664	0.3770

	tid	loss	status	eval.acc	eval.auc	eval.auprc	eval.loss	name.data	name.optim_metric	name.optim_metric_mode	name.model	param.data.seq_length	param.fit.batch_size	param.fit.early_stop_monitor	param.fit.epochs	param.fit.patience	param.model.filters	param.model.kernel_size	param.model.lr	param.model.motif_init	param.model.motif_init.stddev	path.model	path.results	time.duration.dataload	time.duration.total	time.duration.training	time.end	time.start	n_epoch
257	277	-0.7221	ok	0.8701	0.8693	0.7221	0.3543	data	auprc	auprc	model	101	128	val_loss	50	5	16	15	0.0054	NaN	0.0252	./saved_models/hyopt_...	./saved_models/hyopt_...	23.1172	52.1519	29.0347	2017-08-16 11:52:06.8...	2017-08-16 11:51:14.6...	14
137	157	-0.7218	ok	0.8689	0.8725	0.7218	0.3417	data	auprc	auprc	model	101	128	val_loss	50	5	16	15	0.0050	NaN	0.0102	./saved_models/hyopt_...	./saved_models/hyopt_...	28.3887	93.0859	64.6972	2017-08-16 11:46:30.3...	2017-08-16 11:44:57.2...	34
275	295	-0.7191	ok	0.8697	0.8708	0.7191	0.3504	data	auprc	auprc	model	101	128	val_loss	50	5	16	15	0.0055	NaN	0.0253	./saved_models/hyopt_...	./saved_models/hyopt_...	28.5676	66.6112	38.0437	2017-08-16 11:53:20.5...	2017-08-16 11:52:13.9...	21
294	314	-0.7188	ok	0.8728	0.8660	0.7188	0.3597	data	auprc	auprc	model	101	128	val_loss	50	5	16	15	0.0090	NaN	0.0363	./saved_models/hyopt_...	./saved_models/hyopt_...	28.3512	50.3183	21.9670	2017-08-16 11:54:08.6...	2017-08-16 11:53:18.3...	10
270	290	-0.7148	ok	0.8687	0.8686	0.7148	0.3518	data	auprc	auprc	model	101	128	val_loss	50	5	16	15	0.0067	NaN	0.0463	./saved_models/hyopt_...	./saved_models/hyopt_...	23.6120	58.8660	35.2540	2017-08-16 11:53:04.3...	2017-08-16 11:52:05.4...	17