Slurm Workload Manager

The task of a workload manager on an compute system is to control the access to compute resources and distribute work to these resources.

Basic Workflow

After login, a user submits jobs from the login node to the workload manager. A job is an application together with a resource description for this concrete run of the application. A resource description specifies the required resources, e.g., the number of compute nodes, the number of cores per compute node, the memory requirement, or the time limit, for an application run to the workload manager. After submission, a job is placed in the scheduling queue and waits for the specified resources to become available. Once the resources are available, the job is placed on the allocated compute nodes and starts running. After job completion, the allocated resources become available again for the next job.

Interacting with the Slurm Workload Manager

A user can interact with the workload manager using several commands:

sinfo : view information about nodes and partitions
squeue : view information about jobs in the scheduling queue
sbatch : submit a batch script
srun : submit a job directly from the command line
scontrol : view (and modify) configurations and workload manager state
scancel : cancel jobs based on id

All these commands have a plethora of options. Information about these options can either be obtained by invoking the command with the --help option (e.g. sinfo --help), or by consulting slurm documentation here.

sinfo - Viewing information about nodes and partitions

The compute nodes of a cluster are grouped into logical collections, called partitions.
sinfo can be used to view information about these partitions:

>$ sinfo

PARTITION AVAIL   TIMELIMIT      NODES  STATE   NODELIST
batch*       up   2-00:00:00      2     down*   dlc-drowzee,dlc-electabuzz,
batch*       up   infinite        1     drain   dlc-magmar
batch*       up   7-00:00:00      5     mix     dlc-articuno,dlc-groudon,dlc-jynx,dlc-tornadus,dlc-tyranitar
batch*       up   7-00:00:00      2     idle    dlc-scyther,dlc-togepi

It prints the following columns:

name	description
PARTITION	partition name
AVAIL	availability of a partition ; a partition can either be up or down
TIMELIMIT	maximum time limit for any user job in days-hours:minutes:seconds
NODES	number of nodes with this respective state in a partition
STATE	state of the respective nodes (see more below)
NODELIST	node names that are part of this partition

Here's a description of all the possible states of a node:

state	description
idle	available for use (not running any jobs)
alloc	all node's resources are allocated for jobs
mix	some of the node's resources are allocated for jobs, while others are idle
resv	part of a reservation (not available for use)
drain	unavailable for use
down	unavailable for use

squeue - Viewing information about jobs in the scheduling queue

When you run squeue, you will get a list of all jobs currently running or waiting to start:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 8240     batch     bash daangeij  R 12-12:35:23     1 dlc-groudon
 8242     batch     bash daangeij  R 12-12:16:27     1 dlc-articuno
 8243     batch     bash daangeij  R 12-12:05:08     1 dlc-articuno
 8370     batch     bash daangeij  R  6-18:54:38     1 dlc-tyranitar
 8520     batch     bash nadiehkh  R  2-15:07:48     1 dlc-jynx
 8546     batch     bash steffanb  R  1-21:06:05     1 dlc-drowzee
 8609     batch     bash steffanb  R    19:44:56     1 dlc-articuno
 8639     batch     bash clementg  R    14:14:34     1 dlc-tornadus
 8659     batch convnext eliasbau  R       42:11     1 dlc-tornadus
 8662     batch test_gpu lenaphil  R        0:03     1 dlc-tornadus

Most of the columns should be self-explaining, but the ST and NODELIST (REASON) columns can be confusing.

ST stands for state. The most important states are listed below.

state	description
R	job is running
PD	job is pending (i.e. waiting to run)
CG	job is completing, meaning that it will be finished soon

The column NODELIST (REASON) will show you a list of computing nodes the job is running on if the job is actually running. If the job is pending, the column will give you a reason why it still pending. The most important reasons are listed below.

reason	description
Priority	there is another pending job with higher priority.
Resources	the job has the highest priority, but is waiting for some running job to finish.
launch failed requeued held	job launch failed for some reason. This is normally due to a faulty node. Please contact us.
Dependency	job cannot start before some other job is finished. This should only happen if you started the job with `--dependency=...`
DependencyNeverSatisfied	same as Dependency, but that other job failed. You must cancel the job with `scancel <job_id>`.

sbatch - Submitting a non-interactive job

We strongly recommend to use sbatch to submit non-interactive jobs to the cluster. All you need is a bash script that contains instructions, configurations, and parameters for the job submission process. An example is provided below:

#!/bin/bash
#SBATCH --qos=low
#SBATCH --ntasks=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=5G
#SBATCH --time=4:00:00
#SBATCH --container-mounts=/data/bodyct:/data/bodyct \
#SBATCH --container-image="dodrio1.umcn.nl#uokbaseimage/diag:tf2.10-pt1.12-v1"

python3 /data/bodyct/experiments/lena_t10027/test_file.py

It starts with #!/bin/bash and followed by various directives and commands. These directives specify parameters such as job name, resource requirements, time limits, and output files. Set the directives to configure the job. Common directives include:

flag	description
`--ntasks=<num_tasks>`	specifies the number of tasks or processes to be launched
`--gpus-per-task=<num_gpus>`	sets the number of GPUs allocated per task
`--gres=gpumem:10g`	VRAM required per allocated GPU
`--mem=<mem>`	memory required ; default units are megabytes
`--mem-per-cpu=<mem>`	memory required per allocated CPU ; default units are megabytes
`--mem-per-gpu=<mem>`	memory required per allocated GPU ; default units are megabytes
`--cpus-per-task=<num_cpus>`	sets the number of CPUs allocated per task
`--time=<time>`	sets the maximum runtime for the job
`--job-name=<name>`	sets the name of the job
`--output=<output_file>`	specifies the file to which the job's standard output is written
`--error=<error_file>`	defines the file to which the job's standard error is written

The SLURM directives are extended by the Pyxis plugin. It provides additional options like:

--container-image="<container-image>" : specifies the container image to use for the job. Enter a # between the domain name and the image location instead of the usual /.
--container-mounts=<host_path>:<container_path> : defines the mount points between the host system and the container. The <host_path> parameter represents the directory or file path on the host system, and the <container_path> parameter indicates the corresponding mount location within the container. This directive enables sharing or accessing data and files between the host and the container during job execution. Unlike the old cluster, the new cluster does not automatically mount the Blissey/Chansey shares on the docker container. In the new cluster, all shares are under the /data/ location. If you previously used /mnt/netcache, you will have to change this.
--container-entrypoint : specifies the entry point for the container. This directive is optional and is only required if the container image does not have an entry point defined. The entry point is the first command executed when the container starts.

After the directives, the commands necessary to execute the job follow. This may involve running an executable, invoking a script, or performing any other task relevant to your job.

Once you have defined the script structure, directives, and job commands, save the sbatch script as <filename>.sh. To submit the job, execute:

sbatch path/to/file/<filename>.sh

srun - Running an interactive job

To run an interactive job, the srun command is used along with specific options. These options convey the resource requirements, such as the desired number of nodes, memory allocation, and time limit for the job (similar to directives described above).

For example:

srun \
     --qos=low \
     --ntasks=1 \
     --gpus-per-task=1 \
     --cpus-per-task=4 \
     --mem=5G \
     --time=4:00:00 \
     --container-mounts=/data/pathology:/data/pathology \
     --container-image=dodrio1.umcn.nl#uokbaseimage/diag:tf2.10-pt1.12-v1 \
     --pty bash

--pty bash gives you a pseudo terminal that runs bash. The interactive job will shut down once the terminal connection is closed. To keep a job alive you can use a terminal multiplexer like tmux (tmux doc) or screen (screen tutorial).

Moreover, srun allows you to launch parallel tasks and control their execution. It provides options to specify the number of tasks (--ntasks), CPU cores per task (--cpus-per-task), task dependencies, and other parameters for efficient parallel execution.

NB: if you have a job that is running on a GPU node and that is expected to use a GPU on that node, you can check the GPU usage by running the following command from the login node:

srun -s --jobid <job_id> --pty nvidia-smi

scontrol - Monitoring jobs

To monitor your job, you can use the following command. It will list detailed information that can be useful for troubleshooting :

scontrol show jobid -dd <jobid>

scancel - Cancelling jobs

To cancel the job, you should run scancel with the job-id (you can derive this from the squeue overview) as argument.

scancel <job-id>