This part of the tutorial describes the basic structure of a compute cluster and introduces the key concepts behind the workload manager, LSF. After completing this section, you should be able to: describe the basic structure of a cluster, define key terms and find information about cluster structure using LSF.
A cluster is a set of connected computers (nodes or hosts) which work together. When you log in to the cluster from your local machine, you will most likely be connecting to the head node. The head node handles the submission of the computational tasks you want to perform. These tasks are then passed on to the compute nodes where they will subsequently be run.
You don't have to log into the head node, you can also log in to a compute node. When you log into the head node, you can use it to submit your jobs, migrate data between file systems and housekeeping. However, you should not be running computationally intensive jobs on the head node outside of LSF.
When analysing data on a single machine, such as a laptop, commands or scripts are run in the terminal and the results are given back via the terminal. On a cluster, we need to run these commands or scripts as jobs.
The resources required to run the jobs may not always be available straight away, so the jobs get submitted into a queue. A queue is a list of jobs which are waiting for resources (pending) or being executed (running). As jobs in the queue finish executing, the resources they were using become available again and the next job in the queue will start running.
Job scheduling and execution is controlled by the platform load sharing facility (LSF) which manages the workload.
For more information, please see the about LSF section of the LSF user guide.
This tutorial assumes that you are currently connected to a cluster which has LSF installed (e.g. pcs or farm for Sanger users).
Let's start by getting some general information about the cluster.
In [ ]:
lsid
IBM Platform LSF Standard 9.1.3.0, Jul 04 2014
Copyright IBM Corp. 1992, 2014. All rights reserved.
US Government Users Restricted Rights...
My cluster name is pcs5
My master name is pcs5a
This should tell you the name of the cluster you're connected to (e.g. pcs5) and the version of LSF it's using (e.g. 9.1.3.0).
Next, let's take a look at how the cluster is structured.
In [ ]:
lshosts
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES
pcs5a X86_64 BL465c_G 8.0 32 255.9G 31.9G Yes (mg linux lustre)
pcs5b X86_64 BL465c_G 8.0 32 255.9G 31.9G Yes (mg linux lustre)
pcs5c X86_64 BL465c_G 8.0 32 255.9G 31.9G Yes (linux lustre avx)
pcs5d X86_64 BL465c_G 8.0 32 255.9G 31.9G Yes (linux lustre avx)
pcs5e X86_64 BL465c_G 8.0 32 255.9G 31.9G Yes (linux lustre avx)
This should tell you which hosts (nodes) are part of the cluster. In this example, there are five hosts called pcs5a-e.
Finally, let's take get some information about the hosts.
In [ ]:
bhosts
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
pcs5a ok - 16 8 8 0 0 0
pcs5b ok - 16 0 0 0 0 0
pcs5c closed - 32 32 11 0 0 21
pcs5d ok - 32 26 26 0 0 0
pcs5e ok - 32 24 20 0 0 4
For each host, this gives us the host name, host status, job state statistics, and job slot limits. The host status tells us whether the host is available and ready to accept new jobs.
There are four possible host status states:
To find out why a host is closed you can run bhosts
again with the -w
option which returns more detailed information about the host.
In [ ]:
bhosts -w
HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV
pcs5a ok - 16 0 0 0 0 0
pcs5b ok - 16 0 0 0 0 0
pcs5c closed_Full - 32 32 11 0 0 21
pcs5d ok - 32 30 26 0 0 4
pcs5e ok - 32 28 28 0 0 0
This tells us that the maximum number of jobs which can be run on that host has been reached (see values for STATUS, MAX and NJOBS). Once those jobs have started to complete, the host will be ready to accept new jobs.