Managing jobs

In this section, we'll be taking a closer look at submitting jobs and how you can manage them once they've been submitted.


Writing job outputs to file

In the previous section, we looked at submitting a job using the default options using bsub.


In [ ]:
bsub "sleep 60"

That submission returned a message:

Returning output by mail is not supported on this cluster.
Please use the -o option to write output to disk.

This message is saying that no matter whether the job succeeds or fails, we don't know what resources it used or if there were any errors because we didn't store that information anywhere. Not tracking information about your job and the outputs and errors it produces makes it difficult to troubleshoot any issues with the job execution.

What we can do instead is supply an output file using the -o option and an error file using the -e option.

bsub -o <output_file> -e <error_file> "command"

Let's give this a try. We'll call our output and error files myjob.o and myjob.e.


In [ ]:
bsub -o myjob.o -e myjob.e "sleep 60"

You can check the progress of your job using bjobs.


In [ ]:
bjobs

When the job has finished, print the contents of the output file to terminal using cat.


In [ ]:
cat myjob.o
------------------------------------------------------------
Sender: LSF System <lsfadmin@pcs5a>
Subject: Job 4018040: <sleep 60> in cluster <pcs5> Done

Job <sleep 60> was submitted from host <pcs5a> by user <userA> in cluster <pcs5>.
Job was executed on host(s) <pcs5a>, queue <normal>, user <userA> cluster <pcs5>.
</nfs/users/nfs_u/userA> was used as the home directory.
</nfs/users/nfs_u/userA> was used as the working directory.
Started at Thu Jan 15 12:26:45 2019
Results reported on Thu Jan 15 12:27:45 2019

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
sleep 60
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time :                                   0.82 sec.
    Max Memory :                                 5 MB
    Average Memory :                             5.00 MB
    Total Requested Memory :                     -
    Delta Memory :                               -
    Max Swap :                                   44 MB
    Max Processes :                              3
    Max Threads :                                4

The output (if any) is above this job summary.



PS:

Read file <myjob.e> for stderr output of this job.

Now, look at your error file using cat.


In [ ]:
cat myjob.e

Printing our error file won't return anything as the error file was empty. This is because our job didn't have any errors. If it did, they would be logged in the error file and we could use it to try and trace what went wrong.

You can also incorporate your JOBID into the filename using a special variable %J.

bsub -o %J.o -e %J.e "sleep 60"    

Let's say that the JOBID returned when you submitted the job was 4018041. Your output and error files would be called 4018041.o and 4018041.e.

You should always try to have different output and error files for each job your submitting. If you submit two jobs writing to myjob.o and myjob.e then it can get confusing as they are both writing to the same file.


Giving your job a name

You may have noticed that when you submitted your job earlier and checked its progress with bjobs that the JOB_NAME was sleep 60. This is because, by default, if no job name is given when you submit the job, the job name will become the command that you submitted.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
4018040 UserA   PEND  normal     pcs5a       pcs5b       sleep 60   Jan 15 13:26

You can give your job a different name using the -J option with bsub. This can be useful when you're running multiple jobs and want to be able to tell them apart in the queue with bjobs.

Let's try submitting a job called "newjob" which writes outputs and errors to newjob.o and newjob.e.


In [ ]:
bsub -o newjob.o -e newjob.e -J newjob "sleep 60"

Let's look at the progress of our job using bjobs.


In [ ]:
bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
4018077 UserA   RUN   normal     pcs5a       pcs5b       newjob     Jan 15 13:34

Notice that the JOB_NAME is now newjob.


Submitting your job to a particular queue

If we don't specify a queue when we submit a job, the job will be submitted to the default queue. To find out which one of the queues is used by default, we can use the command bparams.


In [ ]:
bparams
Default Queues:  normal
Default Host Specification:  BL465c_G8
MBD_SLEEP_TIME used for calculations:  10 seconds
Job Checking Interval:  15 seconds
Job Accepting Interval:  0 seconds

Running bparams will display information about the system parameters, such as the default queue name. In this example, the default queue is the normal queue. To find out which other queues are available, we can use bqueues.


In [ ]:
bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
system          1000 Open:Active       -    -    -    -     0     0     0     0
yesterday       500  Open:Active      20    8    -    -     0     0     0     0
small            31  Open:Active       -    -    -    -     0     0     0     0
normal           30  Open:Active       -    -    -    -   103    43    60     0
long              3  Open:Active      50    -    -    -   241   224    15     0
basement          1  Open:Active      20   10    -    -   182   164    18     0

Now, let's say that we want to submit a job into the yesterday queue because it's fairly urgent. To do this, we can use the -q option with bsub followed by the name of the queue which we want to use (e.g. yesterday).

Let's try submitting a job into the yesterday queue called "newjob1" which writes outputs and errors to newjob1.o and newjob1.e.


In [ ]:
bsub -o newjob1.o -e newjob1.e -J newjob1 -q yesterday "sleep 60"

When you check on the progress with bjobs you will see that job is in the yesterday queue.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
4018119 UserA   RUN   yesterday  pcs5a       pcs5c       newjob1    Jan 15 13:51

Job resources

When we want to reserve more resources for our jobs, we can use the -R, -M and -n options with the bsub command. The -M option sets LSF the memory limit, -n sets the thread limit and the -R option tells LSF that the job needs to run on a host which matches the requirements which follow it.

Let's try submitting a job which has a limit of 2GB memory and 4 threads.


In [ ]:
bsub -n 4 -R "span[hosts=1] select[mem>2000] rusage[mem=2000]" \
-M 2000 "sleep 60"

Here we can see that the 4 threads are reserved using the -n option. Typically when we ask for multiple threads with -n, we also add span[hosts=1] to the -R option. This indicates that all the processors which are allocated to this job must be on the same host.

We reserve our 2GB of memory using the -M options and -R option. Notice that the memory requirement is given in MB (2GB ~ 2000MB). With the -R option, we use a select string, select[mem>2000], and a usage string, rusage[mem=2000]. The selection string specifies the characteristics that a host must have to match the resource requirement. In this case, more than 2GB memory. The usage string defines the expected resource usage of the job. By default, no resources are reserved.

For more information on resource requirements, please see the resource requirement section of the LSF user manual.


Job workflow

Once you have submitted your job, there are several different ways in which you can manage it. Below is a diagram which shows the typical job workflows and related commands.

Along the top of the diagram is the simplest job workflow. Here, a job is submitted using bsub and will have the status PEND until it gets dispatched.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   PEND  normal     pcs5b                   job1       Jan 15 14:06

Once the job is dispatched, it will start running and get the status RUN.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   RUN   normal     pcs5b       pcs5c       job1       Jan 15 14:06

If all goes well and there are no errors (normal completion) then the job finishes and has the status DONE.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   DONE  normal     pcs5b       pcs5c       job1       Jan 15 14:06

If there is a problem with a running job, this will trigger an abnormal exit and the status will become EXIT.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   EXIT  normal     pcs5b       pcs5c       job1       Jan 15 14:06

Cancelling jobs

Now, let's consider some deviations from this workflow. First, how do we cancel a job once it's been submitted?

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   PEND  normal     pcs5b                   job1       Jan 15 14:06    

We can cancel or kill a job 1000 using the command bkill followed by the JOBID of the job that you want to kill.

bkill 1000

If you have used a valid JOBID, the bkill command should return a message that tells you the job is being terminated.

Job <1000> is being terminated

Your job status will now get updated from RUN or PEND to EXIT.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   EXIT  normal     pcs5b                   job1       Jan 15 14:06 

Suspending and resuming jobs

Let's say you are running a series of commands and you realise there's an error in the input file for one of those commands. When you're running a long job, you probably don't want to have to cancel it and start all over again. If the job hasn't reached that command, you can pause or suspend the job while you fix the input file and resume the job once you're done.

To suspend and resume a job you can use the bstop and bresume commands. First let's look at suspending a pending job (PEND).

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   PEND  normal     pcs5b                   job1       Jan 15 14:06    

To suspend this job, we use bstop followed by the JOBID.

bstop 1000 

The job status will now become PSUSP as the job was suspended by a user while it was pending.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   PSUSP normal     pcs5b                   job1       Jan 15 14:06 

To allow the job to be dispatched again, we can use the command bresume followed by the JOBID.

bresume 1000

The job status will now return back to PEND while the job waits to be dispatched.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   PEND  normal     pcs5b                   job1       Jan 15 14:06 

You can also suspend a running job using bstop. In this case, the status will be updated to USUSP instead of PSUSP.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   USUSP normal     pcs5b                   job1       Jan 15 14:06  

When you resume a previously running job that has been suspended with bresume, there may be an interim status of SSUSP before the job starts running again (RUN).

For more information on suspending, resuming and cancelling jobs, please see the controlling job execution section in the LSF user guide.

Moving a job to a different queue

Let's use as an example, a job which has been submitted to the normal queue and is waiting to be dispatched (PEND).

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   PEND  normal     pcs5b                   job1       Jan 15 14:06

Now, let's say that we've made a mistake and that we know this submitted job will run for longer than the normal queue will allow. What can we do?

Well, you could kill the job with bkill followed by the JOBID of the job you want to cancel. You can then submit the job again specifying a different queue using the bsub option -q and the name of the queue you want to use. This would create a new job in a different queue (e.g. long).

bkill <JOBID>
bsub -q <queue_name> <command>

Alternatively, you can move the pending job to a different queue using bswitch.

bswitch <destination_queue> <JOBID>

So, to move our job (JOBID = 1000) from the normal queue (jobs killed after 12 hours) to the long queue (jobs killed after 48 hours) we would run:

bswitch long 1000

And if we looked again using bjobs we can see that the job has been moved into the long queue.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   PEND  long       pcs5b                   job1       Jan 15 14:06

For more information on moving jobs between queues, please see the switching queues section of the LSF user guide.


What's next?

For an overview of basic job submission, you can go back to job submission. Otherwise, let's take a look at job arrays and dependencies.