Skip to content

Slurm Workload Manager

The task of a workload manager on an compute system is to control the access to compute resources and distribute work to these resources.

Basic Workflow

After login, a user submits jobs from the login node to the workload manager. A job is an application together with a resource description for this concrete run of the application. A resource description specifies the required resources, e.g., the number of compute nodes, the number of cores per compute node, the memory requirement, or the time limit, for an application run to the workload manager. After submission, a job is placed in the scheduling queue and waits for the specified resources to become available. Once the resources are available, the job is placed on the allocated compute nodes and starts running. After job completion, the allocated resources become available again for the next job.

Interacting with the Slurm Workload Manager

A user can interact with the workload manager using several commands:

  • sinfo : view information about nodes and partitions
  • squeue : view information about jobs in the scheduling queue
  • sbatch : submit a batch script
  • srun : submit a job directly from the command line
  • scontrol : view (and modify) configurations and workload manager state
  • scancel : cancel jobs based on id

All these commands have a plethora of options. Information about these options can either be obtained by invoking the command with the --help option (e.g. sinfo --help), or by consulting slurm documentation here.

sinfo - Viewing information about nodes and partitions

The compute nodes of a cluster are grouped into logical collections, called partitions.
sinfo can be used to view information about these partitions:

>$ sinfo

PARTITION AVAIL   TIMELIMIT      NODES  STATE   NODELIST
batch*       up   2-00:00:00      2     down*   dlc-drowzee,dlc-electabuzz,
batch*       up   infinite        1     drain   dlc-magmar
batch*       up   7-00:00:00      5     mix     dlc-articuno,dlc-groudon,dlc-jynx,dlc-tornadus,dlc-tyranitar
batch*       up   7-00:00:00      2     idle    dlc-scyther,dlc-togepi

It prints the following columns:

name description
PARTITION partition name
AVAIL availability of a partition ; a partition can either be up or down
TIMELIMIT maximum time limit for any user job in days-hours:minutes:seconds
NODES number of nodes with this respective state in a partition
STATE state of the respective nodes (see more below)
NODELIST node names that are part of this partition

Here's a description of all the possible states of a node:

state description
idle available for use (not running any jobs)
alloc all node's resources are allocated for jobs
mix some of the node's resources are allocated for jobs, while others are idle
resv part of a reservation (not available for use)
drain unavailable for use
down unavailable for use

squeue - Viewing information about jobs in the scheduling queue

When you run squeue, you will get a list of all jobs currently running or waiting to start:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 8240     batch     bash daangeij  R 12-12:35:23     1 dlc-groudon
 8242     batch     bash daangeij  R 12-12:16:27     1 dlc-articuno
 8243     batch     bash daangeij  R 12-12:05:08     1 dlc-articuno
 8370     batch     bash daangeij  R  6-18:54:38     1 dlc-tyranitar
 8520     batch     bash nadiehkh  R  2-15:07:48     1 dlc-jynx
 8546     batch     bash steffanb  R  1-21:06:05     1 dlc-drowzee
 8609     batch     bash steffanb  R    19:44:56     1 dlc-articuno
 8639     batch     bash clementg  R    14:14:34     1 dlc-tornadus
 8659     batch convnext eliasbau  R       42:11     1 dlc-tornadus
 8662     batch test_gpu lenaphil  R        0:03     1 dlc-tornadus

Most of the columns should be self-explaining, but the ST and NODELIST (REASON) columns can be confusing.

ST stands for state. The most important states are listed below.

state description
R job is running
PD job is pending (i.e. waiting to run)
CG job is completing, meaning that it will be finished soon

The column NODELIST (REASON) will show you a list of computing nodes the job is running on if the job is actually running. If the job is pending, the column will give you a reason why it still pending. The most important reasons are listed below.

reason description
Priority there is another pending job with higher priority.
Resources the job has the highest priority, but is waiting for some running job to finish.
launch failed requeued held job launch failed for some reason. This is normally due to a faulty node. Please contact us.
Dependency job cannot start before some other job is finished. This should only happen if you started the job with --dependency=...
DependencyNeverSatisfied same as Dependency, but that other job failed. You must cancel the job with scancel <job_id>.

sbatch - Submitting a non-interactive job

We strongly recommend to use sbatch to submit non-interactive jobs to the cluster. All you need is a bash script that contains instructions, configurations, and parameters for the job submission process. An example is provided below:

#!/bin/bash
#SBATCH --qos=low
#SBATCH --ntasks=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=5G
#SBATCH --time=4:00:00
#SBATCH --container-mounts=/data/bodyct:/data/bodyct \
#SBATCH --container-image="doduo1.umcn.nl#uokbaseimage/diag:tf2.10-pt1.12-v1"

python3 /data/bodyct/experiments/lena_t10027/test_file.py

It starts with #!/bin/bash and followed by various directives and commands. These directives specify parameters such as job name, resource requirements, time limits, and output files. Set the directives to configure the job. Common directives include:

flag description
--ntasks=<num_tasks> specifies the number of tasks or processes to be launched
--gpus-per-task=<num_gpus> sets the number of GPUs allocated per task
--gres=gpumem:10g VRAM required per allocated GPU
--mem=<mem> memory required ; default units are megabytes
--mem-per-cpu=<mem> memory required per allocated CPU ; default units are megabytes
--mem-per-gpu=<mem> memory required per allocated GPU ; default units are megabytes
--cpus-per-task=<num_cpus> sets the number of CPUs allocated per task
--time=<time> sets the maximum runtime for the job
--job-name=<name> sets the name of the job
--output=<output_file> specifies the file to which the job's standard output is written
--error=<error_file> defines the file to which the job's standard error is written

The SLURM directives are extended by the Pyxis plugin. It provides additional options like:

  • --container-image="<container-image>" : specifies the container image to use for the job. Enter a # between the domain name and the image location instead of the usual /.
  • --container-mounts=<host_path>:<container_path> : defines the mount points between the host system and the container. The <host_path> parameter represents the directory or file path on the host system, and the <container_path> parameter indicates the corresponding mount location within the container. This directive enables sharing or accessing data and files between the host and the container during job execution. Unlike the old cluster, the new cluster does not automatically mount the Blissey/Chansey shares on the docker container. In the new cluster, all shares are under the /data/ location. If you previously used /mnt/netcache, you will have to change this.
  • --container-entrypoint : specifies the entry point for the container. This directive is optional and is only required if the container image does not have an entry point defined. The entry point is the first command executed when the container starts.

After the directives, the commands necessary to execute the job follow. This may involve running an executable, invoking a script, or performing any other task relevant to your job.

Once you have defined the script structure, directives, and job commands, save the sbatch script as <filename>.sh. To submit the job, execute:

sbatch path/to/file/<filename>.sh

srun - Running an interactive job

To run an interactive job, the srun command is used along with specific options. These options convey the resource requirements, such as the desired number of nodes, memory allocation, and time limit for the job (similar to directives described above).

For example:

srun \
     --qos=low \
     --ntasks=1 \
     --gpus-per-task=1 \
     --cpus-per-task=4 \
     --mem=5G \
     --time=4:00:00 \
     --container-mounts=/data/pathology:/data/pathology \
     --container-image=doduo1.umcn.nl#uokbaseimage/diag:tf2.10-pt1.12-v1 \
     --pty bash

--pty bash gives you a pseudo terminal that runs bash. The interactive job will shut down once the terminal connection is closed. To keep a job alive you can use a terminal multiplexer like tmux (tmux doc) or screen (screen tutorial).

Moreover, srun allows you to launch parallel tasks and control their execution. It provides options to specify the number of tasks (--ntasks), CPU cores per task (--cpus-per-task), task dependencies, and other parameters for efficient parallel execution.

NB: if you have a job that is running on a GPU node and that is expected to use a GPU on that node, you can check the GPU usage by running the following command from the login node:

srun -s --jobid <job_id> --pty nvidia-smi

scontrol - Monitoring jobs

To monitor your job, you can use the following command. It will list detailed information that can be useful for troubleshooting :

scontrol show jobid -dd <jobid>

scancel - Cancelling jobs

To cancel the job, you should run scancel with the job-id (you can derive this from the squeue overview) as argument.

scancel <job-id>