Skip to content

Running jobs with Slurm

This page shows how to run the most common jobs on the cluster. Please make sure to read the Slurm documentation page for further information. Using Slurm, we can run two different types of jobs. Regular jobs can be used for running predetermined scripts, which no longer require user input after starting. Using interactive jobs, you can interact with your job while it's running. You can use this to run a Jupyter Lab session, or to connect your Python IDE (Integrated Development Environment), which allows for external debugging of your code. Interactive jobs have a 4-hour time limit to prevent crowding of the cluster.

Regular jobs (using sbatch)

We strongly recommend to use sbatch to submit non-interactive jobs to the cluster. All you need is a bash script that contains instructions, configurations, and parameters for the job submission process. An example is provided below:

#!/bin/bash
#SBATCH --qos=low
#SBATCH --ntasks=1
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=5G
#SBATCH --time=4:00:00
#SBATCH --container-mounts=/data/bodyct:/data/bodyct
#SBATCH --container-image="doduo1.umcn.nl#uokbaseimage/diag:tf2.10-pt1.12-v1"

python3 /data/bodyct/experiments/lena_t10027/test_file.py

It starts with #!/bin/bash and followed by various directives and commands. These directives specify parameters such as job name, resource requirements, time limits, and output files. Set the directives to configure the job. Common directives include:

flag description
--qos=<qos_name> specifies the job priority (high, low, idle)
--ntasks=<num_tasks> specifies the number of tasks or processes to be launched
--cpus-per-task=<num_cpus> sets the number of CPUs allocated per task
--gpus-per-task=<num_gpus> sets the number of GPUs allocated per task
--gres=gpumem:10g VRAM required per allocated GPU
--mem=<mem> memory required ; default units are megabytes
--mem-per-cpu=<mem> memory required per allocated CPU ; default units are megabytes
--mem-per-gpu=<mem> memory required per allocated GPU ; default units are megabytes
--nodelist subselection of nodes to run on (--nodelist=dlc-jynx,dlc-scyther)
--time=<time> sets the maximum runtime for the job
--job-name=<name> sets the name of the job
--output=<output_file> specifies the file to which the job's standard output is written
--error=<error_file> defines the file to which the job's standard error is written

The SLURM directives are extended by the Pyxis plugin. It provides additional options like:

  • --container-image="<container-image>" : specifies the container image to use for the job. Enter a # between the domain name and the image location instead of the usual /.
  • --container-mounts=<host_path>:<container_path> : defines the mount points between the host system and the container. The <host_path> parameter represents the directory or file path on the host system, and the <container_path> parameter indicates the corresponding mount location within the container. This directive enables sharing or accessing data and files between the host and the container during job execution. Unlike the old cluster, the new cluster does not automatically mount the Blissey/Chansey shares on the docker container. In the new cluster, all shares are under the /data/ location. If you previously used /mnt/netcache, you will have to change this.
  • --container-entrypoint : specifies the entry point for the container. This directive is optional and is only required if the container image does not have an entry point defined. The entry point is the first command executed when the container starts. NB: if you cannot get your entrypoint to work with this flag, check this workaround.

After the directives, the commands necessary to execute the job follow. This may involve running an executable, invoking a script, or performing any other task relevant to your job.

Once you have defined the script structure, directives, and job commands, save the sbatch script as <filename>.sh. Note, you can also save the sbatch script as <filename>.slurm. This does not affect the functionality, but makes it clear to users that the file is intended for Slurm commands. To submit the job, execute:

sbatch /data/bodyct/experiments/path/to/file/<filename>.sh

Interactive jobs

Jupyter Lab session

If you want to work with Jupyter Notebooks, you can start a Jupyter Lab session. This allows for easy visualization of data and demonstrations of your code. See the page: Start a Jupyter Lab Session.

Remote debugging

You can also use your own Python IDE (see the IDE page) to debug your code running on SOL. To accomplish this, you first have to make sure you have set up your SSH-key authentication. Please read the SSH Keys documentation page.

The next step is to start an interactive job on SOL. To run an interactive job, the srun command is used along with specific options. These options convey the resource requirements, such as the desired number of nodes, memory allocation, and time limit for the job (make sure to add the --no-container-remap-root flag).

srun
  --qos=low \
  --ntasks=1 \
  --gpus-per-task=1 \
  --cpus-per-task=4 \
  --mem=5G \
  --time=4:00:00 \
  --no-container-remap-root \
  --container-mounts=/data/bodyct:/data/bodyct \
  --container-image="doduo1.umcn.nl#uokbaseimage/diag:tf2.10-pt1.12-v1" \
  --pty bash

This command will open up a bash terminal from inside your running container. Pick an access port (e.g. 5544) and kick off the sshd service by typing /usr/sbin/sshd -p <port> in this terminal. This will expose a port in your container, which allows communication with your IDE. As a result, you can run debug sessions directly from your IDE inside your running docker container. You will need to properly setup your IDE, as is described below.

VS Code

Once you have completed the steps above, you can take a look at the VS Code debug page on how to connect your running container to VS Code.

SSH'ing into a running docker

Alternatively, once you have completed the steps above, you can SSH over a terminal to your docker container to have a more fully featured terminal-based experience. This is done via:

ssh <username>@dlc-<nodename> -p <port> # E.g. ssh leandervaneekelen@dlc-tornadus -p 40398

PyCharm

Once you have completed the previous steps, follow the PyCharm tutorial.