Introduction

This part of the tutorial describes the basic structure of a compute cluster and introduces the key concepts behind the workload manager, LSF. After completing this section, you should be able to: describe the basic structure of a cluster, define key terms and find information about cluster structure using LSF.

What is a cluster?

A cluster is a set of connected computers (nodes or hosts) which work together. When you log in to the cluster from your local machine, you will most likely be connecting to the head node. The head node handles the submission of the computational tasks you want to perform. These tasks are then passed on to the compute nodes where they will subsequently be run.

You don't have to log into the head node, you can also log in to a compute node. When you log into the head node, you can use it to submit your jobs, migrate data between file systems and housekeeping. However, you should not be running computationally intensive jobs on the head node outside of LSF.

What is LSF?

When analysing data on a single machine, such as a laptop, commands or scripts are run in the terminal and the results are given back via the terminal. On a cluster, we need to run these commands or scripts as jobs.

The resources required to run the jobs may not always be available straight away, so the jobs get submitted into a queue. A queue is a list of jobs which are waiting for resources (pending) or being executed (running). As jobs in the queue finish executing, the resources they were using become available again and the next job in the queue will start running.

Job scheduling and execution is controlled by the platform load sharing facility (LSF) which manages the workload.

For more information, please see the about LSF section of the LSF user guide.


This tutorial assumes that you are currently connected to a cluster which has LSF installed (e.g. pcs or farm for Sanger users).

Let's start by getting some general information about the cluster.


In [ ]:
lsid
IBM Platform LSF Standard 9.1.3.0, Jul 04 2014
Copyright IBM Corp. 1992, 2014. All rights reserved.
US Government Users Restricted Rights...

My cluster name is pcs5
My master name is pcs5a

This should tell you the name of the cluster you're connected to (e.g. pcs5) and the version of LSF it's using (e.g. 9.1.3.0).

Next, let's take a look at how the cluster is structured.


In [ ]:
lshosts
HOST_NAME      type    model  cpuf ncpus maxmem maxswp server RESOURCES
pcs5a        X86_64 BL465c_G   8.0    32 255.9G  31.9G    Yes (mg linux lustre)
pcs5b        X86_64 BL465c_G   8.0    32 255.9G  31.9G    Yes (mg linux lustre)
pcs5c        X86_64 BL465c_G   8.0    32 255.9G  31.9G    Yes (linux lustre avx)
pcs5d        X86_64 BL465c_G   8.0    32 255.9G  31.9G    Yes (linux lustre avx)
pcs5e        X86_64 BL465c_G   8.0    32 255.9G  31.9G    Yes (linux lustre avx)

This should tell you which hosts (nodes) are part of the cluster. In this example, there are five hosts called pcs5a-e.

Finally, let's take get some information about the hosts.


In [ ]:
bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
pcs5a              ok              -     16      8      8      0      0      0
pcs5b              ok              -     16      0      0      0      0      0
pcs5c              closed          -     32     32     11      0      0     21
pcs5d              ok              -     32     26     26      0      0      0
pcs5e              ok              -     32     24     20      0      0      4

For each host, this gives us the host name, host status, job state statistics, and job slot limits. The host status tells us whether the host is available and ready to accept new jobs.

There are four possible host status states:

  • ok - host is available to accept and run new batch jobs
  • unavail - host is down, or load and job management controls are unreachable
  • unreach - load management controls are running but job management controls are unreachable
  • closed - host not accepting new jobs

To find out why a host is closed you can run bhosts again with the -w option which returns more detailed information about the host.


In [ ]:
bhosts -w
HOST_NAME          STATUS          JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
pcs5a              ok              -     16      0      0      0      0      0
pcs5b              ok              -     16      0      0      0      0      0
pcs5c              closed_Full     -     32     32     11      0      0     21
pcs5d              ok              -     32     30     26      0      0      4
pcs5e              ok              -     32     28     28      0      0      0

This tells us that the maximum number of jobs which can be run on that host has been reached (see values for STATUS, MAX and NJOBS). Once those jobs have started to complete, the host will be ready to accept new jobs.


What's next?

For an overview of what this tutorial covers, head to the index. Otherwise, let's take a closer look at queues.