Slurm Workload Manager
The task of a workload manager on an compute system is to control the access to compute resources and distribute work to these resources.
Basic Workflow
After login, a user submits jobs from the login node to the workload manager. A job is an application together with a resource description for this concrete run of the application. A resource description specifies the required resources, e.g., the number of compute nodes, the number of cores per compute node, the memory requirement, or the time limit, for an application run to the workload manager. After submission, a job is placed in the scheduling queue and waits for the specified resources to become available. Once the resources are available, the job is placed on the allocated compute nodes and starts running. After job completion, the allocated resources become available again for the next job.
Interacting with the Slurm Workload Manager
A user can interact with the workload manager using several commands:
sinfo
: view information about nodes and partitionssqueue
: view information about jobs in the scheduling queuesbatch
: submit a batch scriptsrun
: submit a job directly from the command linescontrol
: view (and modify) configurations and workload manager statescancel
: cancel jobs based on id
All these commands have a plethora of options. Information about these options can either be obtained by invoking the command with the --help
option (e.g. sinfo --help
), or by consulting slurm documentation here.
sinfo - Viewing information about nodes and partitions
The compute nodes of a cluster are grouped into logical collections, called partitions.
sinfo
can be used to view information about these partitions:
>$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up 2-00:00:00 2 down* dlc-drowzee,dlc-electabuzz,
batch* up infinite 1 drain dlc-magmar
batch* up 7-00:00:00 5 mix dlc-articuno,dlc-groudon,dlc-jynx,dlc-tornadus,dlc-tyranitar
batch* up 7-00:00:00 2 idle dlc-scyther,dlc-togepi
It prints the following columns:
name | description |
---|---|
PARTITION | partition name |
AVAIL | availability of a partition ; a partition can either be up or down |
TIMELIMIT | maximum time limit for any user job in days-hours:minutes:seconds |
NODES | number of nodes with this respective state in a partition |
STATE | state of the respective nodes (see more below) |
NODELIST | node names that are part of this partition |
Here's a description of all the possible states of a node:
state | description |
---|---|
idle | available for use (not running any jobs) |
alloc | all node's resources are allocated for jobs |
mix | some of the node's resources are allocated for jobs, while others are idle |
resv | part of a reservation (not available for use) |
drain | unavailable for use |
down | unavailable for use |
squeue - Viewing information about jobs in the scheduling queue
When you run squeue
, you will get a list of all jobs currently running or waiting to start:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8240 batch bash daangeij R 12-12:35:23 1 dlc-groudon
8242 batch bash daangeij R 12-12:16:27 1 dlc-articuno
8243 batch bash daangeij R 12-12:05:08 1 dlc-articuno
8370 batch bash daangeij R 6-18:54:38 1 dlc-tyranitar
8520 batch bash nadiehkh R 2-15:07:48 1 dlc-jynx
8546 batch bash steffanb R 1-21:06:05 1 dlc-drowzee
8609 batch bash steffanb R 19:44:56 1 dlc-articuno
8639 batch bash clementg R 14:14:34 1 dlc-tornadus
8659 batch convnext eliasbau R 42:11 1 dlc-tornadus
8662 batch test_gpu lenaphil R 0:03 1 dlc-tornadus
Most of the columns should be self-explaining, but the ST
and NODELIST (REASON)
columns can be confusing.
ST
stands for state. The most important states are listed below.
state | description |
---|---|
R | job is running |
PD | job is pending (i.e. waiting to run) |
CG | job is completing, meaning that it will be finished soon |
The column NODELIST (REASON)
will show you a list of computing nodes the job is running on if the job is actually running. If the job is pending, the column will give you a reason why it still pending. The most important reasons are listed below.
reason | description |
---|---|
Priority | there is another pending job with higher priority. |
Resources | the job has the highest priority, but is waiting for some running job to finish. |
launch failed requeued held | job launch failed for some reason. This is normally due to a faulty node. Please contact us. |
Dependency | job cannot start before some other job is finished. This should only happen if you started the job with --dependency=... |
DependencyNeverSatisfied | same as Dependency, but that other job failed. You must cancel the job with scancel <job_id> . |
sbatch - Submitting a non-interactive job
We strongly recommend to use sbatch
to submit non-interactive jobs to the cluster. All you need is a bash script that contains instructions, configurations, and parameters for the job submission process. An example is provided below:
#!/bin/bash
#SBATCH --qos=low
#SBATCH --ntasks=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=5G
#SBATCH --time=4:00:00
#SBATCH --container-mounts=/data/bodyct:/data/bodyct \
#SBATCH --container-image="doduo1.umcn.nl#uokbaseimage/diag:tf2.10-pt1.12-v1"
python3 /data/bodyct/experiments/lena_t10027/test_file.py
It starts with #!/bin/bash
and followed by various directives and commands. These directives specify parameters such as job name, resource requirements, time limits, and output files. Set the directives to configure the job. Common directives include:
flag | description |
---|---|
--ntasks=<num_tasks> |
specifies the number of tasks or processes to be launched |
--gpus-per-task=<num_gpus> |
sets the number of GPUs allocated per task |
--gres=gpumem:10g |
VRAM required per allocated GPU |
--mem=<mem> |
memory required ; default units are megabytes |
--mem-per-cpu=<mem> |
memory required per allocated CPU ; default units are megabytes |
--mem-per-gpu=<mem> |
memory required per allocated GPU ; default units are megabytes |
--cpus-per-task=<num_cpus> |
sets the number of CPUs allocated per task |
--time=<time> |
sets the maximum runtime for the job |
--job-name=<name> |
sets the name of the job |
--output=<output_file> |
specifies the file to which the job's standard output is written |
--error=<error_file> |
defines the file to which the job's standard error is written |
The SLURM directives are extended by the Pyxis plugin. It provides additional options like:
--container-image="<container-image>"
: specifies the container image to use for the job. Enter a#
between the domain name and the image location instead of the usual/
.--container-mounts=<host_path>:<container_path>
: defines the mount points between the host system and the container. The<host_path>
parameter represents the directory or file path on the host system, and the<container_path>
parameter indicates the corresponding mount location within the container. This directive enables sharing or accessing data and files between the host and the container during job execution. Unlike the old cluster, the new cluster does not automatically mount the Blissey/Chansey shares on the docker container. In the new cluster, all shares are under the/data/
location. If you previously used/mnt/netcache
, you will have to change this.--container-entrypoint
: specifies the entry point for the container. This directive is optional and is only required if the container image does not have an entry point defined. The entry point is the first command executed when the container starts.
After the directives, the commands necessary to execute the job follow. This may involve running an executable, invoking a script, or performing any other task relevant to your job.
Once you have defined the script structure, directives, and job commands, save the sbatch script as <filename>.sh
. To submit the job, execute:
sbatch path/to/file/<filename>.sh
srun - Running an interactive job
To run an interactive job, the srun
command is used along with specific options. These options convey the resource requirements, such as the desired number of nodes, memory allocation, and time limit for the job (similar to directives described above).
For example:
srun \
--qos=low \
--ntasks=1 \
--gpus-per-task=1 \
--cpus-per-task=4 \
--mem=5G \
--time=4:00:00 \
--container-mounts=/data/pathology:/data/pathology \
--container-image=doduo1.umcn.nl#uokbaseimage/diag:tf2.10-pt1.12-v1 \
--pty bash
--pty bash
gives you a pseudo terminal that runs bash. The interactive job will shut down once the terminal connection is closed.
To keep a job alive you can use a terminal multiplexer like tmux
(tmux doc) or screen
(screen tutorial).
Moreover, srun
allows you to launch parallel tasks and control their execution. It provides options to specify the number of tasks (--ntasks
), CPU cores per task (--cpus-per-task
), task dependencies, and other parameters for efficient parallel execution.
NB: if you have a job that is running on a GPU node and that is expected to use a GPU on that node, you can check the GPU usage by running the following command from the login node:
srun -s --jobid <job_id> --pty nvidia-smi
scontrol - Monitoring jobs
To monitor your job, you can use the following command. It will list detailed information that can be useful for troubleshooting :
scontrol show jobid -dd <jobid>
scancel - Cancelling jobs
To cancel the job, you should run scancel with the job-id (you can derive this from the squeue overview) as argument.
scancel <job-id>