Purpose: Run the simulation with varying parameters and characterize the effects of those parameters
Parameter exploration can either be done as grid search where a multidimensional "regular" grid of parameter values is explored or as random search where random parameters values are picked.
In [1]:
from __future__ import print_function
from pprint import pprint
This tutorial will introduce you to the "psyrun" tool for parameter space exploration and serial farming step by step. It also integrates well with "ctn_benchmarks". We will use the CommunicationChannel
as an example.
In [2]:
from ctn_benchmark.nengo.communication import CommunicationChannel
In [3]:
rmses = []
for D in np.arange(2, 5):
for N in [10, 50, 100]:
rmses.append(CommunicationChannel().run(D=D, N=N)['rmse'])
In [4]:
pprint(rmses)
However there are a few annoyances with this approach: We have to add another for
loop for each dimensions in the parameter space; and we have to handle storing the results ourselves. Here, especially, the assignment to input parameters in only implicit.
With psyrun, we can define a parameter space in a more natural way:
In [5]:
from psyrun import Param
pspace = Param(D=np.arange(2, 5)) * Param(N=[10, 50, 100])
print(pspace)
We use the Param
class to define sets of parameters with keyword arguments to the constructor. As long as all sequences have the same length it is possible to give multiple arguments to the constructor:
In [6]:
print(Param(a=[1, 2], b=[3, 4]))
If at least one argument is a sequence, non-sequence arguments will generate lists of the required length:
In [7]:
print(Param(a=[1, 2], b=3))
Such parameters definitions support basic operations like a cartesian product (as shown above), concatenation with +
, and two subtraction like operations with -
and the psyrun.pspace.missing
function.
Once we defined a parameter space we want to explore, it is easy to apply a function (i.e. run our model) to each set of parameters.
In [8]:
import psyrun
result = psyrun.map_pspace(CommunicationChannel().run, pspace)
In [9]:
pprint(result)
The map_pspace
function conveniently returns a dictionary with all the input parameters and output values. This makes it easy to convert it to other data structurs like a Pandas data frame for further analysis:
In [10]:
import pandas as pd
pd.DataFrame(result)
Out[10]:
It would be nice to parallelize all these simulations to make the best use of multicore processors (though, your BLAS library might be parallelized already). This is easy with psyrun as we just have to use a different map function. The only problem is that the function we want to apply needs to be pickleable which requires us to put it into an importable Python module. (It is also possible to switch the parallelization backend from multiprocessing
to threading
, but that can introduce other issues.)
In [11]:
%%sh
cat << EOF > model.py
from ctn_benchmark.nengo.communication import CommunicationChannel
def evaluate(**kwargs):
return CommunicationChannel().run(**kwargs)
EOF
In [12]:
from model import evaluate
result2 = psyrun.map_pspace_parallel(evaluate, pspace)
In [13]:
pprint(result2)
After doing some initial experiments, you might want to run a larger number of simulations. Because all the simulations are independent from one another, they could run one after another on your computer, but that would take a very long time. It would be faster to distribute the individual simulations to a high performance cluster (like Sharcnet) with many CPU cores that can run the simulations in parallel. This is called serial farming.
To do serial farming, first make sure that it is a Python module that can be installed, is installed, and can be imported. (It is possible to change sys.path
if you don't want to install it as a Python module, but this approach is more complicated and error prone.) Then you need to define a "task" for what you intend to run. To define a task you have to create a file task_<name>.py
in a directory called psy-tasks
. Here is an example task for our communication channel:
In [14]:
# task_cc1.py
from ctn_benchmark.nengo.communication import CommunicationChannel
import numpy as np
import psyrun
from psyrun import Param
pspace = Param(D=np.arange(2, 5)) * Param(N=[10, 50, 100])
min_items = 1
max_jobs = 6
def execute(**kwargs):
return CommunicationChannel().run(**kwargs)
There are a number of variables with a special meaning in these kind of task files. pspace
defines the parameter space to explore which should be familar by now. execute
is the function that gets called for each assignment of parameters and returns a dictionary with the results. max_jobs
says that a maximum of 6 processing jobs should be created (a few other jobs not doing the core processing might be created in addition).
When running this task (we will get to that in a moment), the parameter space will be split into parts of the same size and assigned to the processing jobs to be processed. At the end all the results from the processing jobs will be merged into a single file. min_items
defines how many parameter assignments should at least be processd by each job.
To run tasks defined in this way, we use the psy-doit
command. As the name suggests, this tool is based on the excellent doit automation tool. Be sure that your working directory contains the psy-tasks
directory when invoking this command.
First we can list the available tasks:
In [20]:
%%sh
psy-doit list
Let us test our task first:
In [16]:
%%sh
psy-doit test cc1
This immediatly runs the first (and only the first) parameter assignment as a way to check that nothing crashes. Now let us run the full task.
In [17]:
%%sh
psy-doit cc1
This produces standard doit output with dots (.
) indicating that a task was executed. As you can see the first task is splitting the parameter space and then several processing jobs are started and in the end the results are merged.
All intermediary files for this will be written to the psy-work/<task name>
directory by default. This is also where we find the result file. psyrun provides some helper functions to load this file:
In [18]:
data = psyrun.NpzStore().load('psy-work/cc1/result.npz')
pprint(data)
Per default NumPy .npz
files are used to store this data. However, this format has a few disadvantages. For example, to append to such a file (which happens in the merge step), the whole file has to be loaded into memory. Thus, it is possible to switch to HDF5 which is better in this regard. Just add the line store = psyrun.H5Store()
to your task file.
Note that this will require pytables to be installed and that the Sharcnet pytables seems to be broken at the moment.
Speaking of Sharcnet: Currently psy-doit runs the jobs sequantially and immediatly. To actually get serial farming we have to tell psyrun to invoke the right job scheduler. For Sharcnet this would be sqsub
. Here is the updated task file:
In [19]:
# task_cc2.py
import platform
from ctn_benchmark.nengo.communication import CommunicationChannel
import numpy as np
from psyrun import Param, Sqsub
pspace = Param(D=np.arange(2, 5)) * Param(N=[10, 50, 100])
min_items = 1
max_jobs = 6
sharcnet_nodes = ['narwhal', 'bul', 'kraken', 'saw']
if any(platform.node().startswith(x) for x in sharcnet_nodes):
workdir = '/work/jgosmann/tcm'
scheduler = Sqsub(workdir)
scheduler_args = {
'timelimit': '10m',
'memory': '512M'
}
def execute(**kwargs):
return CommunicationChannel().run(**kwargs)
First of all, I check whether I am actually on a Sharcnet node before I do the Sharcnet specific settings. This is not required, but allows me to run the same file on my local computer. (Note that the list of Sharcnet nodes contains only the ones I typically use. You might want to use a different list).
workdir
is a special variable to set the location for intermediary and result files. I'm changing the location because on Sharcnet /work
offers more storage space and a better performance. The essential line to change to the sqsub scheduler is scheduler = Sqsub(workdir)
and then we finally add some scheduler specific arguments. For sqsub you have to specify a runtime and memory usage limit.
Now you can run psy-doit cc2
on Sharcnet and it will submit a number of jobs and return. Use sqjobs
to check the status of your jobs and see when they are finished.
Sometimes jobs crash. If that the case run psy-doit cc2
again and it should figure out by itself what jobs have to be resubmitted.
Sometimes the parameters can significantly influence the runtime of your models. That can be problematic on Sharcnet because if you estimate the time limit too low your jobs will get killed before completion, but a high time limit will push your job back in the queue. In those cases it might be good to do some simple load balancing.
If you insert backend = psyrun.LoadBalancingBackend
, psyrun will submit max_jobs
jobs and each job will process as many parameters as it can, always grabbing the next in queue, until it gets killed. Unfortunately, the parameters assignment of a killed job will not be reentered into the queue. Thus, after all jobs have completed, you have to check the results file for completeness. If results are missing reexecute psy-doit <task name>
. This will only process parameters assignments for missing results.
It is best to use the load balancing backend with the H5Store
, because it does a lot more appending to files which is inefficient with the NpzStore
. Also note, that the load balancing backend is less tested as it is newer.
In [ ]: