In [ ]:
bsub -o bad_cmd.o -e bad_cmd.e "slep 10"
Use bjobs
to see when your job has finished running. The jobs status (STAT) will be EXIT. This means that your job had an abnormal exit and there was an issue. Below is an example.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA EXIT normal pcs5a pcs5c slep 10 Jan 15 10:48
Now, print the contents of the output file to the terminal using cat
(probably best to use less
if the file is larger).
In [ ]:
cat bad_cmd.o
------------------------------------------------------------
Sender: LSF System <lsfadmin@pcs5c>
Subject: Job 4017581: <slep 10> in cluster <pcs5> Exited
Job <slep 10> was submitted from host <pcs5a> by user <userA> in cluster <pcs5>.
Job was executed on host(s) <pcs5c>, queue <normal>, user <userA> cluster <pcs5>.
</nfs/users/nfs_u/userA> was used as the home directory.
</nfs/users/nfs_u/userA> was used as the working directory.
Started at Thu Jan 15 10:48:46 2019
Results reported on Thu Jan 15 10:48:47 2019
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
slep 10
------------------------------------------------------------
Exited with exit code 127.
Resource usage summary:
CPU time : 0.09 sec.
Total Requested Memory : -
Delta Memory : -
The output (if any) is above this job summary.
PS:
Read file <bad_cmd.e> for stderr output of this job.
This tells us that the job Exited with exit code 127
. Any exit code > 0 means there was an error. In this case, it was error code 127 which means that the command we tried to run, slep
, could not be found.
Here is an overview of the possible exit codes:
When a job gets killed by a signal (error code > 127), you can subtract 128 from the error code to get the signal number. You can then run man 7 signal
to find the meaning of that signal. Error codes 130 and 140 will typically mean you your job exceeded the resources you requested. Try submitting the job again, requesting more memory or to a queue with a longer time limit.
For more information about exit codes, please see the job exit codes, job exception and exit information sections in the LSF user guide.
We can see the error that was generated by printing the contents of the error file to the terminal with cat
.
In [ ]:
cat bad_cmd.e
/tmp/1547722126.4017581: line 8: slep: command not found
Here we can see this confirms that the system couldn't find the command slep
. Whenever you have an issue with an LSF job and need to contact your support team for help, it is always a good idea to give them the JOBID and/or the location of the output and error files. This makes tracking down the issue much quicker!
In [ ]:
bqueues
Each queue has a maximum number of job slots which can be used by scheduled jobs in the queue. Job slots are reserved by jobs which are pending and used by those which have already started running but are not yet finished. In this example we can see the maximum number of job slots for each queue by looking at the MAX column.
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
system 1000 Open:Active - - - - 0 0 0 0
yesterday 500 Open:Active 20 8 - - 0 0 0 0
small 31 Open:Active - - - - 0 0 0 0
normal 30 Open:Active - - - - 35 13 1 0
long 3 Open:Active 50 - - - 31686 31636 46 0
basement 1 Open:Active 20 10 - - 180 170 10 0
Sometimes there are also limits on the job slots that are available to users. You can check this by looking at the JL/U column. In this example, the yesterday queue is limited to a maximum of 8 job slots per user and the basement to 10 job slots per user.
For more information on queues, please see queues.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA PEND normal pcs5b job1 Jan 15 14:06
We can take a look at which queues are available using bqueues
.
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP
system 1000 Open:Active - - - - 0 0 0 0
yesterday 500 Open:Active 20 8 - - 0 0 0 0
small 31 Open:Active - - - - 0 0 0 0
normal 30 Open:Active - - - - 35 13 1 0
long 3 Open:Active 50 - - - 31686 31636 46 0
basement 1 Open:Active 20 10 - - 180 170 10 0
Look at the priorities of the queues (PRIO) to try and find a queue with a higher priority. Here, the yesterday queue has a higher priority (500) compared to the normal queue (30). So, we can try moving our job from the normal queue into the yesterday queue.
bswitch 1000 yesterday
If we then ran bjobs
we would see our job has been moved to the yesterday queue.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA PEND yesterday pcs5b job1 Jan 15 14:06
Alternatively, you can use btop
to move a pending job to the top of your job list. In the example below, we have used bjobs
which showed us that we have 4 jobs scheduled: 1 running and 3 pending.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA RUN normal pcs5b pcs5c job1 Jan 15 14:06
1001 userA PEND normal pcs5b job2 Jan 15 14:07
1002 userA PEND normal pcs5b job3 Jan 15 14:08
1003 userA PEND normal pcs5b job4 Jan 15 14:09
All of these jobs are identical, so job2 will be the first of our pending jobs to be executed when the necessary resources become available. But, what if we needed job4 to be executed first?
btop 1003
Using btop
followed by the JOBID will move the job (e.g. job4) to the top of the list of jobs waiting to be executed.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA RUN normal pcs5b pcs5c job1 Jan 15 14:06
1003 userA PEND normal pcs5b job4 Jan 15 14:09
1001 userA PEND normal pcs5b job2 Jan 15 14:07
1002 userA PEND normal pcs5b job3 Jan 15 14:08
For more information on job management, please see job management.
In [ ]:
bqueues
Most commonly, your jobs will be pending because the cluster is busy and it may just take some time for things to get running. Once other people's jobs start to finish, resources will become available and your jobs should start running.
If this keeps happening, there are two things to consider. The first is whether you've requested more resources than your job is likely to require. Let's use the sleep
command as an example. Well reserve 2GB (2000MB) of memory for this job.
In [ ]:
bsub -o sleep.o -e sleep.e -R 'select[mem>2000] rusage[mem=2000]' \
-M 2000 "sleep 10"
Once that has finished, take a look at the output file (sleep.o).
In [ ]:
cat sleep.o
We want to look at the amount of resources our job used.
Resource usage summary:
CPU time : 0.23 sec.
Max Memory : 6 MB
Average Memory : 6.00 MB
Total Requested Memory : 2000.00 MB
Delta Memory : 1994.00 MB
Max Swap : 44 MB
Max Processes : 3
Max Threads : 4
Here we can see that this job used only used 6MB of memory (Max Memory). We requested 2GB (2000MB) which was 1994MB more than our job required (Delta Memory).
You should always try to request only the resources that you need. If you're running a large analysis, try running the analysis on a small subset of the data and scale up to estimate the resources you'll require.
Alternatively, it could be because your priority is low. Perhaps you or other members of your group have been running a lot of jobs or using a lot of resources in the last 48 hours. This can decrease your priority and chances of getting jobs running. For more information, please see priority and fairshare.
bkill <JOBID>
However, if your job is pending, you can use bmod
to update your job.
bmod <options_or_command_to_update> <JOBID>
To update the command, you can use the -Z
option. For example, if we made a mistake an spelt the sleep
command wrong and ran slep
:
bsub "slep 60"
Now let's say it submitted the job, we now have a JOBID of 1000 and that the job is pending.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
1000 userA PEND normal pcs5b slep 60 Jan 15 14:06
We can update the command using bmod
with the -Z
option, followed by the JOBID (1000), to update the job to run the correct command.
bmod -Z "sleep 60" 1000
When the job is dispatched and executed, it will now run sleep 60
instead of slep 60
. You can change many other
job parameters using bmod
such as the job name.
Let's call our job something more useful like "sleepyjob" using the -J
option.
bmod -J sleepyjob 1000
We can add an output and error file too using the -o
and -e
options.
bmod -o sleepyjob.o -e sleepyjob.e 1000
Or, we can update the resources it's requesting using the -M
, -R
and -n
options.
bmod -n 2 -R "span[hosts=1] select[mem>2000] rusage[mem=2000]" -M 2000 1000
This would be updating job 1000 and asking LSF to reserve 2 cores/threads and 2GB (2000MB) memory.
For an overview of priority and fairshare, you can go back to the priority_and_fairshare. Otherwise, you can take a look at our LSF cheat sheet.