In this section, we'll be taking a closer look at submitting jobs and how you can manage them once they've been submitted.
In [ ]:
bsub "sleep 60"
That submission returned a message:
Returning output by mail is not supported on this cluster.
Please use the -o option to write output to disk.
This message is saying that no matter whether the job succeeds or fails, we don't know what resources it used or if there were any errors because we didn't store that information anywhere. Not tracking information about your job and the outputs and errors it produces makes it difficult to troubleshoot any issues with the job execution.
What we can do instead is supply an output file using the -o
option and an error file using the -e
option.
bsub -o <output_file> -e <error_file> "command"
Let's give this a try. We'll call our output and error files myjob.o and myjob.e.
In [ ]:
bsub -o myjob.o -e myjob.e "sleep 60"
You can check the progress of your job using bjobs
.
In [ ]:
bjobs
When the job has finished, print the contents of the output file to terminal using cat
.
In [ ]:
cat myjob.o
------------------------------------------------------------
Sender: LSF System <lsfadmin@pcs5a>
Subject: Job 4018040: <sleep 60> in cluster <pcs5> Done
Job <sleep 60> was submitted from host <pcs5a> by user <userA> in cluster <pcs5>.
Job was executed on host(s) <pcs5a>, queue <normal>, user <userA> cluster <pcs5>.
</nfs/users/nfs_u/userA> was used as the home directory.
</nfs/users/nfs_u/userA> was used as the working directory.
Started at Thu Jan 15 12:26:45 2019
Results reported on Thu Jan 15 12:27:45 2019
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
sleep 60
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time : 0.82 sec.
Max Memory : 5 MB
Average Memory : 5.00 MB
Total Requested Memory : -
Delta Memory : -
Max Swap : 44 MB
Max Processes : 3
Max Threads : 4
The output (if any) is above this job summary.
PS:
Read file <myjob.e> for stderr output of this job.
Now, look at your error file using cat
.
In [ ]:
cat myjob.e
Printing our error file won't return anything as the error file was empty. This is because our job didn't have any errors. If it did, they would be logged in the error file and we could use it to try and trace what went wrong.
You can also incorporate your JOBID into the filename using a special variable %J.
bsub -o %J.o -e %J.e "sleep 60"
Let's say that the JOBID returned when you submitted the job was 4018041. Your output and error files would be called 4018041.o and 4018041.e.
You should always try to have different output and error files for each job your submitting. If you submit two jobs writing to myjob.o and myjob.e then it can get confusing as they are both writing to the same file.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
4018040 UserA PEND normal pcs5a pcs5b sleep 60 Jan 15 13:26
You can give your job a different name using the -J
option with bsub
. This can be useful when you're running multiple jobs and want to be able to tell them apart in the queue with bjobs
.
Let's try submitting a job called "newjob" which writes outputs and errors to newjob.o and newjob.e.
In [ ]:
bsub -o newjob.o -e newjob.e -J newjob "sleep 60"
Let's look at the progress of our job using bjobs
.
In [ ]:
bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
4018077 UserA RUN normal pcs5a pcs5b newjob Jan 15 13:34
Notice that the JOB_NAME is now newjob.
In [ ]:
bparams
Default Queues: normal
Default Host Specification: BL465c_G8
MBD_SLEEP_TIME used for calculations: 10 seconds
Job Checking Interval: 15 seconds
Job Accepting Interval: 0 seconds
Running bparams
will display information about the system parameters, such as the default queue name. In this example, the default queue is the normal queue. To find out which other queues are available, we can use bqueues
.
In [ ]:
bqueues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
system 1000 Open:Active - - - - 0 0 0 0
yesterday 500 Open:Active 20 8 - - 0 0 0 0
small 31 Open:Active - - - - 0 0 0 0
normal 30 Open:Active - - - - 103 43 60 0
long 3 Open:Active 50 - - - 241 224 15 0
basement 1 Open:Active 20 10 - - 182 164 18 0
Now, let's say that we want to submit a job into the yesterday queue because it's fairly urgent. To do this, we can use the -q
option with bsub
followed by the name of the queue which we want to use (e.g. yesterday).
Let's try submitting a job into the yesterday queue called "newjob1" which writes outputs and errors to newjob1.o and newjob1.e.
In [ ]:
bsub -o newjob1.o -e newjob1.e -J newjob1 -q yesterday "sleep 60"
When you check on the progress with bjobs
you will see that job is in the yesterday queue.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
4018119 UserA RUN yesterday pcs5a pcs5c newjob1 Jan 15 13:51
When we want to reserve more resources for our jobs, we can use the -R
, -M
and -n
options with the bsub
command. The -M
option sets LSF the memory limit, -n
sets the thread limit and the -R
option tells LSF that the job needs to run on a host which matches the requirements which follow it.
Let's try submitting a job which has a limit of 2GB memory and 4 threads.
In [ ]:
bsub -n 4 -R "span[hosts=1] select[mem>2000] rusage[mem=2000]" \
-M 2000 "sleep 60"
Here we can see that the 4 threads are reserved using the -n
option. Typically when we ask for multiple threads with -n
, we also add span[hosts=1] to the -R
option. This indicates that all the processors which are allocated to this job must be on the same host.
We reserve our 2GB of memory using the -M
options and -R
option. Notice that the memory requirement is given in MB (2GB ~ 2000MB). With the -R
option, we use a select string, select[mem>2000], and a usage string, rusage[mem=2000]. The selection string specifies the characteristics that a host must have to match the resource requirement. In this case, more than 2GB memory. The usage string defines the expected resource usage of the job. By default, no resources are reserved.
For more information on resource requirements, please see the resource requirement section of the LSF user manual.
Along the top of the diagram is the simplest job workflow. Here, a job is submitted using bsub
and will have the status PEND until it gets dispatched.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA PEND normal pcs5b job1 Jan 15 14:06
Once the job is dispatched, it will start running and get the status RUN.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA RUN normal pcs5b pcs5c job1 Jan 15 14:06
If all goes well and there are no errors (normal completion) then the job finishes and has the status DONE.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA DONE normal pcs5b pcs5c job1 Jan 15 14:06
If there is a problem with a running job, this will trigger an abnormal exit and the status will become EXIT.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA EXIT normal pcs5b pcs5c job1 Jan 15 14:06
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA PEND normal pcs5b job1 Jan 15 14:06
We can cancel or kill a job 1000 using the command bkill
followed by the JOBID of the job that you want to kill.
bkill 1000
If you have used a valid JOBID, the bkill
command should return a message that tells you the job is being terminated.
Job <1000> is being terminated
Your job status will now get updated from RUN or PEND to EXIT.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA EXIT normal pcs5b job1 Jan 15 14:06
Let's say you are running a series of commands and you realise there's an error in the input file for one of those commands. When you're running a long job, you probably don't want to have to cancel it and start all over again. If the job hasn't reached that command, you can pause or suspend the job while you fix the input file and resume the job once you're done.
To suspend and resume a job you can use the bstop
and bresume
commands. First let's look at suspending a pending job (PEND).
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA PEND normal pcs5b job1 Jan 15 14:06
To suspend this job, we use bstop
followed by the JOBID.
bstop 1000
The job status will now become PSUSP as the job was suspended by a user while it was pending.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA PSUSP normal pcs5b job1 Jan 15 14:06
To allow the job to be dispatched again, we can use the command bresume
followed by the JOBID.
bresume 1000
The job status will now return back to PEND while the job waits to be dispatched.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA PEND normal pcs5b job1 Jan 15 14:06
You can also suspend a running job using bstop
. In this case, the status will be updated to USUSP instead of PSUSP.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA USUSP normal pcs5b job1 Jan 15 14:06
When you resume a previously running job that has been suspended with bresume
, there may be an interim status of SSUSP before the job starts running again (RUN).
For more information on suspending, resuming and cancelling jobs, please see the controlling job execution section in the LSF user guide.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA PEND normal pcs5b job1 Jan 15 14:06
Now, let's say that we've made a mistake and that we know this submitted job will run for longer than the normal queue will allow. What can we do?
Well, you could kill the job with bkill
followed by the JOBID of the job you want to cancel. You can then submit the job again specifying a different queue using the bsub
option -q
and the name of the queue you want to use. This would create a new job in a different queue (e.g. long).
bkill <JOBID>
bsub -q <queue_name> <command>
Alternatively, you can move the pending job to a different queue using bswitch
.
bswitch <destination_queue> <JOBID>
So, to move our job (JOBID = 1000) from the normal queue (jobs killed after 12 hours) to the long queue (jobs killed after 48 hours) we would run:
bswitch long 1000
And if we looked again using bjobs
we can see that the job has been moved into the long
queue.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA PEND long pcs5b job1 Jan 15 14:06
For more information on moving jobs between queues, please see the switching queues section of the LSF user guide.
For an overview of basic job submission, you can go back to job submission. Otherwise, let's take a look at job arrays and dependencies.