| notebook.community

notebook.community

SLURM

Simple Linux Utility for Resource Management
Development began in 2002 at Lawrence Livermore National Laboratory
In production at LNLL in 2003
Currently widely used among Linux clusters world-wide

Why not others RMS?

Quadrics RMS: proprietary, limited platform support
PBS: portable, not scalable
IBM LoadLeveler: proprietary, not scalable
LSF: proprietary, scalable but expensive

Design Goals

Simplicity: Users can add custom functionalities via plugins
Open source: GNU GPL
Portability: Written in C, portable across other Linux OSes
Interconnect independence: customizable via GNU autoconf and plugins
Scalability: thousands of nodes
Fault Tolerant
Secure
System administrator friendly: few simple configuration files and minimized distributed state

Key functions of SLURM

Allocates exclusion and/or non-exclusive access to resources to users for some duration of time so that they can perform work.
Provides a framework for starting, executing, and monitoring work.
Arbitrates confluting requests for resources by managing a queue of pending work.

Command line utilities of SLURM

srun: submitting a job for execution (batch mode or interactive mode)
scancel: early termination of pending or running jobs
squeue: monitoring job queue
sinfo: monitoring partition and overall system state
scontrol: administrative tool for privileged operations

Revisit Bridges

https://ondemand.bridges.psc.edu/

SLURM processes

slurmctld: central controller daemon to maintain global state and direct operations
slurmd: a remote shell daemon to export control on individual compute nodes to SLURM

What SLURM is NOT

A sophisticated batch system (only has FIFO by default)
time-sharing WMS (only has space-sharing)
meta-batch system (only supports a single cluster)
a comprehensive cluster administration or monitoring packages

_{*https://slurm.schedmd.com/quickstart.html*}

References