Experiment Manager: Using Job Queues


In [1]:
pip_arg_xp_man = '-e git+https://github.com/wschuell/experiment_manager.git@origin/master#egg=experiment_manager'
#ssh: pip_arg_xp_man = '-e git+ssh://git@github.com/wschuell/experiment_manager.git@master#egg=experiment_manager'

In [2]:
try:
    import experiment_manager as xp_man
except ImportError:
    print('experiment_manager is not installed, you can install it with command: \n pip install '+pip_arg_xp_man)

Job Queues

Job Queues are one of the key classes of the library. You place jobs in them, and they run them and retrieve the data. You do not have to bother of where exactly things are run and how they are retrieved, everything is abstracted away and already adapted the specific clusters that we are using. In one line you can change clusters or execute it locally instead. Adapting it to a new cluster should take a really short time (~10 lines of code).

Defining several job queue configs: local, multiprocess local, and several clusters.

NB: SSH usage: To use plafrim, you must have a working entry 'plafrim-ext' in your .ssh/config. For the other clusters, if you don't have a corresponding entry (avakas or anyone), you should provide your username. You will then be asked your password and if you want to create a key and export it on the cluster to further automatize the connection.


In [3]:
jq_cfg_local = {'jq_type':'local'}

virtualenv = 'test_py3' # by default root python. ex: virtualenv = 'test_xp_man' for venv in ~/virtualenvs/test_xp_man

jq_cfg_plafrim = {'jq_type':'plafrim',
    'modules':['slurm','language/python/3.5.2'],
    'virtual_env': virtualenv,
    'requirements': [pip_arg_xp_man],
    #'username':'schuelle',
                 }

jq_cfg_avakas = {'jq_type':'avakas',
    'modules':['torque','maui','python3/3.6.0'],
    'without_epilogue':True,
    #'username':'wschueller',
    'virtual_env':virtualenv,#virtualenv,
    #'requirements': [pip_arg_xp_man],  IMPORTANT: install on avakas through github and https is broken due to the git version being too old. You have to install manually and via SSH...
                }

jq_cfg_anyone = {'jq_type':'anyone',
    'modules':[],
    'virtual_env':'test_279',
    #'requirements': [pip_arg_xp_man],
    "hostname":"cluster_roma"
                }

jq_cfg_docker = {'jq_type':'slurm',
    'modules':[],
    #'virtual_env':virtualenv,
    #'requirements': [pip_arg_xp_man],
     'ssh_cfg':{            
     'username':'root',
    'hostname':'172.19.0.2',
    'password':'dockerslurm',}
                }

jq_cfg_local_multiprocess =  {'jq_type':'local_multiprocess',
                              #'nb_process':4, #default value: number of CPUs on the local machine
                             }

The requirements section tells job queues to install a version of the library on the cluster if it does not exist yet. You can add other libraries, or add them for specific jobs. By default, virtual_env is set to None, meaning that everything runs and requirements are installed in the root python interpretor. If you provide a < name > for the value virtual_env attribute, it will search for a virtualenv in ~/virtualenvs/< name > .

The pip_arg_xp_man refers here to the pip command necessary to install the library on the clusters, which is needed to run the jobs. You can use the same syntax to automatically update your own software to a given commit or branch of your own git repository.

You can choose below which one of the job queue configuration you want to use. The job queue object is initialized under the variable name jq.


In [4]:
jq_cfg = jq_cfg_local_multiprocess
jq = xp_man.job_queue.get_jobqueue(**jq_cfg)

In [5]:
print(jq.get_status_string())


[2018 08 21 16:10:31]: Queue updated
    total: 0
    

    execution time: 0 s
    jobs done: 0
    jobs restarted: 0
    jobs extended: 0

    completion level of running jobs: 0.0%
    minimum completion level: 0.0%

Jobs

Jobs are the objects that need to be executed. Here we will use a simple type of job, ExampleJob. It goes through a loop of 24 steps, prints the value of the counter variable, waits a random time between 1 and 2 seconds between each steps, and at the end saves the value in a file < job.descr >data.dat

Other types of jobs, and defining own classes of jobs as subclasses of the root Job class, will be explained in another notebook. However, there is a documented template provided with the library, found in experiment_manager/job/template_job.py

We define job configurations, create the job objects, and add them to the job queue jq.


In [6]:
job_cfg = {
    'estimated_time':120,#in seconds
    #'virtual_env':'test',
    #'requirements':[],
    #...,
    
}

In [7]:
job = xp_man.job.ExampleJob(**job_cfg)

In [8]:
jq.add_job(job) # of course, you can add as many jobs as you want, like in next cell
print(jq.get_status_string())


[2018 08 21 16:10:34]: Queue updated
    total: 1
    pending: 1

    execution time: 0 s
    jobs done: 0
    jobs restarted: 0
    jobs extended: 0

    completion level of running jobs: 0.0%
    minimum completion level: 0.0%


In [9]:
for i in range(20):
    job_cfg_2 = { 'descr' : str(i),  'estimated_time':120,#a description for the example job 
    }
    job = xp_man.job.ExampleJob(**job_cfg_2)
    jq.add_job(job)
print(jq.get_status_string())


[2018 08 21 16:10:35]: Queue updated
    total: 21
    pending: 21

    execution time: 0 s
    jobs done: 0
    jobs restarted: 0
    jobs extended: 0

    completion level of running jobs: 0.0%
    minimum completion level: 0.0%

Last step is to update the queue. One update will check the current status of each job attached to jq, and process its next step, being sending it to the cluster, retrieving it, unpacking it, etc


In [10]:
#jq.ssh_session.reconnect()

In [11]:
jq.update_queue()


[2018 08 21 16:10:39]: Queue updated
    total: 21
    running: 4
    pending: 17

    execution time: 0 s
    jobs done: 0
    jobs restarted: 0
    jobs extended: 0

    completion level of running jobs: 0.0%
    minimum completion level: 0.0%

You can tell jq to automatically do updates until all jobs are done or in error status:


In [12]:
jq.auto_finish_queue()


[2018 08 21 16:14:44]: Queue updated
    total: 0
    

    execution time: 12 min 42 s
    jobs done: 21
    jobs restarted: 0
    jobs extended: 0

    completion level of running jobs: 0.0%
    minimum completion level: 0.0%


In [ ]: