Troubleshooting

My job failed. How do I find out what went wrong?

There are many different reasons why a job might have failed. The first place to check are the output and error files.

Let's say we submitted the wrong command, writing slep 10 instead of `sleep 10.



In [ ]:

    
bsub -o bad_cmd.o -e bad_cmd.e "slep 10"

Use bjobs to see when your job has finished running. The jobs status (STAT) will be EXIT. This means that your job had an abnormal exit and there was an issue. Below is an example.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   EXIT  normal     pcs5a       pcs5c       slep 10    Jan 15 10:48

Now, print the contents of the output file to the terminal using cat (probably best to use less if the file is larger).



In [ ]:

    
cat bad_cmd.o

------------------------------------------------------------
Sender: LSF System <lsfadmin@pcs5c>
Subject: Job 4017581: <slep 10> in cluster <pcs5> Exited

Job <slep 10> was submitted from host <pcs5a> by user <userA> in cluster <pcs5>.
Job was executed on host(s) <pcs5c>, queue <normal>, user <userA> cluster <pcs5>.
</nfs/users/nfs_u/userA> was used as the home directory.
</nfs/users/nfs_u/userA> was used as the working directory.
Started at Thu Jan 15 10:48:46 2019
Results reported on Thu Jan 15 10:48:47 2019

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
slep 10
------------------------------------------------------------

Exited with exit code 127.

Resource usage summary:

    CPU time :                                   0.09 sec.
    Total Requested Memory :                     -
    Delta Memory :                               -

The output (if any) is above this job summary.



PS:

Read file <bad_cmd.e> for stderr output of this job.

This tells us that the job Exited with exit code 127. Any exit code > 0 means there was an error. In this case, it was error code 127 which means that the command we tried to run, slep, could not be found.

Here is an overview of the possible exit codes:

less than 127 - there was an exit code from your script or command
127 - the command was not found / doesn't exist
more than 127 - the job was killed by a signal

When a job gets killed by a signal (error code > 127), you can subtract 128 from the error code to get the signal number. You can then run man 7 signal to find the meaning of that signal. Error codes 130 and 140 will typically mean you your job exceeded the resources you requested. Try submitting the job again, requesting more memory or to a queue with a longer time limit.

For more information about exit codes, please see the job exit codes, job exception and exit information sections in the LSF user guide.

We can see the error that was generated by printing the contents of the error file to the terminal with cat.



In [ ]:

    
cat bad_cmd.e

/tmp/1547722126.4017581: line 8: slep: command not found

Here we can see this confirms that the system couldn't find the command slep. Whenever you have an issue with an LSF job and need to contact your support team for help, it is always a good idea to give them the JOBID and/or the location of the output and error files. This makes tracking down the issue much quicker!

Why can't I submit any more jobs to a particular queue?

Some queues have limits, such as the yesterday queue, while others, like the normal queue, have no limit. You can check the queue limits using bqueues.



In [ ]:

    
bqueues

Each queue has a maximum number of job slots which can be used by scheduled jobs in the queue. Job slots are reserved by jobs which are pending and used by those which have already started running but are not yet finished. In this example we can see the maximum number of job slots for each queue by looking at the MAX column.

QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
system          1000 Open:Active       -    -    -    -     0     0     0     0
yesterday       500  Open:Active      20    8    -    -     0     0     0     0
small            31  Open:Active       -    -    -    -     0     0     0     0
normal           30  Open:Active       -    -    -    -    35    13     1     0
long              3  Open:Active      50    -    -    - 31686 31636    46     0
basement          1  Open:Active      20   10    -    -   180   170    10     0

Sometimes there are also limits on the job slots that are available to users. You can check this by looking at the JL/U column. In this example, the yesterday queue is limited to a maximum of 8 job slots per user and the basement to 10 job slots per user.

For more information on queues, please see queues.

I've lots of submitted jobs, how can I get my high priority job running first?

There are two things that you can do to influence the order in which your pending jobs will be run. First, you can move your job using bswitch.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   PEND  normal     pcs5b                   job1       Jan 15 14:06

We can take a look at which queues are available using bqueues.

QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP
system          1000 Open:Active       -    -    -    -     0     0     0     0
yesterday       500  Open:Active      20    8    -    -     0     0     0     0
small            31  Open:Active       -    -    -    -     0     0     0     0
normal           30  Open:Active       -    -    -    -    35    13     1     0
long              3  Open:Active      50    -    -    - 31686 31636    46     0
basement          1  Open:Active      20   10    -    -   180   170    10     0

Look at the priorities of the queues (PRIO) to try and find a queue with a higher priority. Here, the yesterday queue has a higher priority (500) compared to the normal queue (30). So, we can try moving our job from the normal queue into the yesterday queue.

bswitch 1000 yesterday

If we then ran bjobs we would see our job has been moved to the yesterday queue.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   PEND  yesterday  pcs5b                   job1       Jan 15 14:06

Alternatively, you can use btop to move a pending job to the top of your job list. In the example below, we have used bjobs which showed us that we have 4 jobs scheduled: 1 running and 3 pending.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   RUN   normal     pcs5b       pcs5c       job1       Jan 15 14:06
1001    userA   PEND  normal     pcs5b                   job2       Jan 15 14:07
1002    userA   PEND  normal     pcs5b                   job3       Jan 15 14:08
1003    userA   PEND  normal     pcs5b                   job4       Jan 15 14:09

All of these jobs are identical, so job2 will be the first of our pending jobs to be executed when the necessary resources become available. But, what if we needed job4 to be executed first?

btop 1003

Using btop followed by the JOBID will move the job (e.g. job4) to the top of the list of jobs waiting to be executed.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   RUN   normal     pcs5b       pcs5c       job1       Jan 15 14:06
1003    userA   PEND  normal     pcs5b                   job4       Jan 15 14:09
1001    userA   PEND  normal     pcs5b                   job2       Jan 15 14:07
1002    userA   PEND  normal     pcs5b                   job3       Jan 15 14:08

For more information on job management, please see job management.

All of my jobs are pending. Why can't I get anything running?

The first thing to do here is check how busy the cluster is using bqueues.



In [ ]:

    
bqueues

Most commonly, your jobs will be pending because the cluster is busy and it may just take some time for things to get running. Once other people's jobs start to finish, resources will become available and your jobs should start running.

If this keeps happening, there are two things to consider. The first is whether you've requested more resources than your job is likely to require. Let's use the sleep command as an example. Well reserve 2GB (2000MB) of memory for this job.



In [ ]:

    
bsub -o sleep.o -e sleep.e -R 'select[mem>2000] rusage[mem=2000]' \
-M 2000 "sleep 10"

Once that has finished, take a look at the output file (sleep.o).



In [ ]:

    
cat sleep.o

We want to look at the amount of resources our job used.

Resource usage summary:

    CPU time :                                   0.23 sec.
    Max Memory :                                 6 MB
    Average Memory :                             6.00 MB
    Total Requested Memory :                     2000.00 MB
    Delta Memory :                               1994.00 MB
    Max Swap :                                   44 MB
    Max Processes :                              3
    Max Threads :                                4

Here we can see that this job used only used 6MB of memory (Max Memory). We requested 2GB (2000MB) which was 1994MB more than our job required (Delta Memory).

You should always try to request only the resources that you need. If you're running a large analysis, try running the analysis on a small subset of the data and scale up to estimate the resources you'll require.

Alternatively, it could be because your priority is low. Perhaps you or other members of your group have been running a lot of jobs or using a lot of resources in the last 48 hours. This can decrease your priority and chances of getting jobs running. For more information, please see priority and fairshare.

I made a mistake when I submitted my job, can I update it?

If the job is already running, it's probably best to cancel it with bkill, update the command or script and then submit it again as a new job.

bkill <JOBID>

However, if your job is pending, you can use bmod to update your job.

bmod <options_or_command_to_update> <JOBID>

To update the command, you can use the -Z option. For example, if we made a mistake an spelt the sleep command wrong and ran slep:

bsub "slep 60"

Now let's say it submitted the job, we now have a JOBID of 1000 and that the job is pending.

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
1000    userA   PEND  normal     pcs5b                   slep 60    Jan 15 14:06

We can update the command using bmod with the -Z option, followed by the JOBID (1000), to update the job to run the correct command.

bmod -Z "sleep 60" 1000

When the job is dispatched and executed, it will now run sleep 60 instead of slep 60. You can change many other job parameters using bmod such as the job name.

Let's call our job something more useful like "sleepyjob" using the -J option.

bmod -J sleepyjob 1000

We can add an output and error file too using the -o and -e options.

bmod -o sleepyjob.o -e sleepyjob.e 1000

Or, we can update the resources it's requesting using the -M, -R and -n options.

bmod -n 2 -R "span[hosts=1] select[mem>2000] rusage[mem=2000]" -M 2000 1000

This would be updating job 1000 and asking LSF to reserve 2 cores/threads and 2GB (2000MB) memory.

What's next?

For an overview of priority and fairshare, you can go back to the priority_and_fairshare. Otherwise, you can take a look at our LSF cheat sheet.