Monitoring Resource Utilization

Monitoring resource utilization helps to ensure that computational resources like GPUs and CPUs are used efficiently. It allows for early detection of bottlenecks, helps in optimizing model performance, and can prevent overuse or underuse of resources. Please use the provided tools to determine the requirements (VRAM, RAM, number of CPUs) of your jobs to make sure you do not use extensive and unnecessary resources.

There are many ways to monitor resource utilization, we will discuss three.

Through Grafana
By logging in to the node where your job is running
Using Weights and Biases (wandb.ai)

1. Grafana

Grafana is a popular open-source data visualization and monitoring tool commonly used to track and analyze resource utilization in various systems, including deep learning clusters managed via Slurm.

Grafana integrates with Slurm to fetch data from the cluster's monitoring system, which can include CPU and GPU utilization, memory usage, network traffic, job status, and other relevant data points.

It helps users and administrators gain valuable insights into the resource utilization and performance of the cluster, enabling better monitoring, troubleshooting, and optimization of the cluster's resources.

You can access Grafana's web UI at tracey:3000.

2. Connecting to a node

You can only login to a compute node if you have a job running there. Connecting to a node via SSH can be useful if you want to track the GPU usage of your job, for example. You don't need ssh-keys configured to login to the node, simply run:

ssh {nodename}

From then on, you should be logged into the required node. You can now run basic commands such as htop or nvidia-smi.

3. Weights and Biases (wandb.ai)

Wandb is a free to use tool that can help you track your machine learning experiments. Besides logging model performance, it also provides the required information about the resources that are used. These are logged under system in one of your runs.

Running wandb on the cluster should be relatively straight forward with the documentation provided by the wandb website. The first time using wandb, please run your experiment in an interactive job to perform an initial login. Your credentials should automatically be stored in your home folder, which is available across the different nodes. Therefore, after your initial login, subsequent runs should be logged in automatically.