SLURM

  • Simple Linux Utility for Resource Management
  • Development began in 2002 at Lawrence Livermore National Laboratory
  • In production at LNLL in 2003
  • Currently widely used among Linux clusters world-wide

Why not others RMS?

  • Quadrics RMS: proprietary, limited platform support
  • PBS: portable, not scalable
  • IBM LoadLeveler: proprietary, not scalable
  • LSF: proprietary, scalable but expensive

Design Goals

  • Simplicity: Users can add custom functionalities via plugins
  • Open source: GNU GPL
  • Portability: Written in C, portable across other Linux OSes
  • Interconnect independence: customizable via GNU autoconf and plugins
  • Scalability: thousands of nodes
  • Fault Tolerant
  • Secure
  • System administrator friendly: few simple configuration files and minimized distributed state

Key functions of SLURM

  • Allocates exclusion and/or non-exclusive access to resources to users for some duration of time so that they can perform work.
  • Provides a framework for starting, executing, and monitoring work.
  • Arbitrates confluting requests for resources by managing a queue of pending work.

Command line utilities of SLURM

  • srun: submitting a job for execution (batch mode or interactive mode)
  • scancel: early termination of pending or running jobs
  • squeue: monitoring job queue
  • sinfo: monitoring partition and overall system state
  • scontrol: administrative tool for privileged operations

SLURM processes

  • slurmctld: central controller daemon to maintain global state and direct operations
  • slurmd: a remote shell daemon to export control on individual compute nodes to SLURM

What SLURM is NOT

  • A sophisticated batch system (only has FIFO by default)
  • time-sharing WMS (only has space-sharing)
  • meta-batch system (only supports a single cluster)
  • a comprehensive cluster administration or monitoring packages
*https://slurm.schedmd.com/quickstart.html*