Getting Started with SOL Cluster

This section is intended both for new and experienced users, who want to learn more about the cluster. This page has been designed in a way that an inexperienced user should be able to find their way about the cluster with some basic help from Google.

We have recorded video tutorials to get started. You can watch them here

SOL

SOL is the name of the dedicated computing cluster managed by DIAG and the Radboud Technology Center (RTC) Deep Learning. Its primary use is to train and apply deep learning neural networks.

When fully operational, SOL consists of more than 20 individual machines, with an aggregate of over 80 GPUs of varying capacities. This setup is complemented by two storage servers, Chansey and Blissey, which are designed to store transient experimental data up to a total capacity of 800 TB. The entire cluster is housed within the DIAG Data Center.

SOL Team

The SOL cluster is overseen by the DIAG SOL Team, dedicated to maintaining the hardware and software of the cluster. An overview of the DIAG SOL Team and its members can be found in the about section. If you have any questions, visit the SOL Teams channel or contact one of the team members.

Getting access

DIAG members are granted access to the SOL cluster through their respective supervisors, who will arrange that an account will be created by the DIAG SOL Team (check the Guidelines for New Researchers on Teams). For external parties, access arrangements are handled through the RTC Deep Learning.

Connecting to the SOL Cluster

When you have received a user account, you must login to the controller node (through SSH) in order to submit jobs on the new cluster. This controller node assigns jobs to the compute nodes, where your job will actually run. Ensure that you are connected to the hospital network either directly or via a Virtual Private Network (VPN) before attempting to SSH into the system.

For Windows users, SSH is an available feature but doesn't come pre-installed. To install SSH, navigate to Settings > Apps, then click on Manage optional features under Apps and features. Click Add features, look for OpenSSH client, and install it. A system restart might be required. Once installed, SSH can be used via the Command Prompt.

Run the following command:

ssh <username>@oaks-lab

You will then be prompted to enter your password. If entered correctly, you will be directed to your home folder on the controller node.

Cluster overview

To get an overview of available compute nodes and running jobs, we have created an overview page (the equivalent of Metapod on SOL 1). This overview is called Pokedex and can be reached when you are connected to the hospital network (either directly or through VPN).

Data storage

In order to run a job, you need to store your scripts and data on the Chansey or Blissey drives. Information on how to mount these drives to your local PC can be found on Storage Systems. On the cluster, these drives are available through /data/bodyct or /data/pathology, etc.

Docker

On our cluster, we run our experiments within Docker containers, to ensure they run with the correct versions of all the libraries and dependencies they require. So when submitting a job to the cluster, you will need to specify which docker you want to use to run your experiment. The Docker container you use, determines which packages are available in the jobs that you're running. There are base images, containing common requirements, but you can also add your own. For more information, see the Docker section.

Setting up SSH key authentification

Setting up the SSH key autenthification allows you to securely connect to the controller node and to running Docker containers. Once set up, you can also access running jobs within VS Code or Pycharm to debug your code more efficiently. This is not required for all types of jobs, but we recommend setting this up once, as it will likely be very helpfull later. To set up an ssh-key configuration, take a look at SSH Key Authentification.

Monitoring Resource Utilization

There exist many ways to monitor ressource utilization:

through [Grafana]
by logging in to the node where your job is running (see next section)
when using wandb, through the "System" button on the left pannel of your run

Running jobs using Slurm

Experiments running on the SOL cluster utilize a workload manager known as Slurm, in combination with Docker containers. This ensures efficient and fair utilization of the cluster's resources by all users. Using slurm commands you can submit different types of jobs, cancel your jobs, monitor the status of your jobs and more. You can find a detailed documentation of useful slurm commands on the dedicated Slurm documentation page.

To run an application (e.g. a python script to train your algorithm), you will need to submit a job to the cluster. This job will be scheduled by Slurm and run on one of the available nodes. Two types of jobs can be submitted to the cluster. The first performs a single, or multiple predetermined task. The second type creates an interactive session on a node, which the user can use interactively for debugging or running a Jupyter Notebook. The Running Jobs page provides instructions on how to run different types of jobs.

Monitoring Resource Utilization

Monitoring resource utilization helps to ensure that computational resources like GPUs and CPUs are used efficiently. It allows for early detection of bottlenecks, helps in optimizing model performance, and can prevent overuse or underuse of resources. Please use the provided tools to determine the requirements (VRAM, RAM, number of CPUs) of your jobs to make sure you do not use extensive and unnecessary resources.

Have a look at the Monitoring Resource Utilization page to look at the different available options.

Interpreter sessions via SSH

To run interpreter sessions via SSH in your running container, please follow the instructions as provided in the Running Jobs page, under Interactive Jobs, Remote Debugging.