Adding support for HPC clusters #232

myrelle22 · 2024-09-13T14:25:20Z

I have been searching, without success; for a solution that will support high performance computing clusters. These consists of running 2- many computing clusters, each with multiple gpus.
Would need gpu metrics per server and combined per cluster.
There are numerous people looking for a solution for this.

I love this tool for one servers, but it won't work for clusters. I have yet to find a solution for clusters.

Thank you
Andrea

utkuozdemir · 2024-09-13T14:46:35Z

AFAIK there are places doing exactly that, using this tool on production. Are we talking about Kubernetes clusters? If so, there's the Helm chart you can use: https://github.com/utkuozdemir/helm-charts/tree/master/nvidia-gpu-exporter

The regular setup of kube-prometheus-stack and this chart should work just fine, you need to enable serviceMonitor.enabled=true in the values and you should be good to go.

Even if we are not talking about Kubernetes clusters, you can simply install the exporter to each machine and configure Prometheus to scrape all of them. The exporter exposes metrics for all GPUs on the machine distinctly over their UUID. What is the issue you are facing when using this tool with multiple GPUs and multiple machines?

myrelle22 · 2024-09-13T15:02:12Z

These are not Kubernetes clusters. This are something similar to beowulf clusters. Multiple servers with gpus with a common file system and a head node for submitting jobs with a tools like slurm scheduler.
This types of clusters are running generative AI and other applications requiring large amounts of computing power.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for HPC clusters #232

Adding support for HPC clusters #232

myrelle22 commented Sep 13, 2024

utkuozdemir commented Sep 13, 2024

myrelle22 commented Sep 13, 2024

Adding support for HPC clusters #232

Adding support for HPC clusters #232

Comments

myrelle22 commented Sep 13, 2024

utkuozdemir commented Sep 13, 2024

myrelle22 commented Sep 13, 2024