VASP 6.4.3 - multi-node GPU performance

Message

leszek_nowakowski · #1 Post by **leszek_nowakowski** » Mon Feb 17, 2025 2:06 pm

Hello,
There were reports about performance of multi-node GPU VASP calculations (https://p.vasp.at/forum/viewtopic.php?t=19178, https://ww.vasp.at/forum/viewtopic.php?p=20145), but they didn't acutally solve the issue of VASP performance on multi-nodes.
As my experiments confirmed, the drop of performance is very significant (actaully, calulations are even 10-20 times slower then on a 1 node!)

When ran at 1 node, VASP works perfectly and each process is bind to one gpu:

Code: Select all

nvidia-smi

Code: Select all

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   1264221      C   ...ftware/VASP/vasp.6.4.3/bin/vasp_std      37926MiB |
|    1   N/A  N/A   1264222      C   ...ftware/VASP/vasp.6.4.3/bin/vasp_std      38084MiB |
|    2   N/A  N/A   1264223      C   ...ftware/VASP/vasp.6.4.3/bin/vasp_std      38084MiB |
|    3   N/A  N/A   1264224      C   ...ftware/VASP/vasp.6.4.3/bin/vasp_std      37876MiB |
+-----------------------------------------------------------------------------------------+

But when ran on two nodes, it seems like all 4 processes are bound to GPU0, and 3 more a bound to consecutive GPUS (on one node, the same situtation happens on other node):

Code: Select all

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    976211      C   ...ftware/VASP/vasp.6.4.3/bin/vasp_std       9070MiB |
|    0   N/A  N/A    976212      C   ...ftware/VASP/vasp.6.4.3/bin/vasp_std        556MiB |
|    0   N/A  N/A    976213      C   ...ftware/VASP/vasp.6.4.3/bin/vasp_std        556MiB |
|    0   N/A  N/A    976214      C   ...ftware/VASP/vasp.6.4.3/bin/vasp_std        556MiB |
|    1   N/A  N/A    976212      C   ...ftware/VASP/vasp.6.4.3/bin/vasp_std       8966MiB |
|    2   N/A  N/A    976213      C   ...ftware/VASP/vasp.6.4.3/bin/vasp_std       8974MiB |
|    3   N/A  N/A    976214      C   ...ftware/VASP/vasp.6.4.3/bin/vasp_std       8846MiB |
+-----------------------------------------------------------------------------------------+

Additionaly, even when ran on 1 node, VASP spans some additional threads (despite OMP_NUM_THREADS=1):

Code: Select all

htop

Code: Select all

CPU NLWP     PID USER       PRI  NI  VIRT   RES   SHR S  CPU%▽MEM%   TIME+  Command
  0   13 1265538 plglnowako  20   0 45.9G 3975M  582M R  99.8  0.5  1:15.57 /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
 72   13 1265539 plglnowako  20   0 46.4G 3968M  576M R  99.8  0.5  1:14.28 /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
216   13 1265541 plglnowako  20   0 45.9G 3733M  576M R  96.7  0.4  1:14.35 /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
144   13 1265540 plglnowako  20   0 46.4G 3987M  576M R  96.0  0.5  1:14.17 /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std

Is it a result of specific task-gpu binding done by OpenMPI, maybe wrong SLURM parameters? Can You reproduce such a situation? What is the performance of VASP in Your multi-node GPU calculations?
Cluster: 4 Nvidia GH200 gpus on one node
Toolchains and libraries: NVHPC/24.5, CUDA 12.4.0, OpenMPi 5.0.3
VASP version: 6.4.3

Relevant files are attached.

Best Regards,
Leszek

#2 Post by **manuel_engel1** » Tue Feb 18, 2025 4:04 pm

Hello Leszek,

Thanks for reaching out and for providing such a detailed report. Even still, it's quite hard to tell what is really going on. Such issues can sometimes depend on hardware or dependencies such as the MPI library. Unfortunately, we cannot easily assess multi-node GPU performance on our side.

First, I would like you to test setting KPAR=2 or even up to the number of GPUs. This should greatly increase performance as it decreases the communication between processes (however, it also increases memory demand, so it depends also a bit on your hardware capabilities).

Regarding the additional threads, you require one MPI rank per GPU, so I would expect to see 4 processes running on one node. Could you confirm that the extra threads are indeed not just these 4 MPI ranks? This would be expected behavior. It's entirely possible that the additional processes on one node have to do with communication. They seem to share the same process ID (PID). It's not entirely clear to me however.

Let me know if setting KPAR changes anything.

Kind regards

leszek_nowakowski · #3 Post by **leszek_nowakowski** » Fri Feb 28, 2025 12:41 pm

Hello Manuel,

Thanks for an answer.

setting KPAR truly helps and decrease the computation times massively. Additionaly, the communcation through Slingshot was not switched on, so it went through very slow TCP.

The problems begins when the computational model is very large - lets say 500 - 1000 atoms and You have only 1 kpoint. You can't make use of KPAR then. And these are the models that are usually vey demanding and will get profits from 2 computing nodes.
Since more and more supercomputers nowadays uses GPU acceleration maybe it will be useful to use kind of NCORE but on GPUS? In OpenACC port it is by default switched off, so all bands are distributed in a round-robin fasion on all processes/GPUs, even on many nodes. Maybe dividing them into gropus will help to improve multi-node GPU performance?

Additional threads are spawned no matter what OMP_NUM_THREADS is set to; it seems like a natural behaviour ( even on one node ). I guess it is just how it supposed to be. Most of them are idle as You can see.

Code: Select all

 
 CPU NLWP     PID USER      △PRI  NI  VIRT   RES   SHR S  CPU% MEM%   TIME+  Command
  0   13  691554 plglnowako  20   0 45.2G 5230M  797M R  65.6  0.6  1:42.34 │  │     ├─ /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
  0   13  691569 plglnowako  20   0 45.2G 5230M  797M S   0.0  0.6  0:00.00 │  │     │  ├─ /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
  0   13  691575 plglnowako  20   0 45.2G 5230M  797M S   0.0  0.6  0:00.00 │  │     │  ├─ /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
  0   13  691579 plglnowako  20   0 45.2G 5230M  797M S   0.0  0.6  0:00.00 │  │     │  ├─ /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
  0   13  691582 plglnowako  20   0 45.2G 5230M  797M S   0.0  0.6  0:00.03 │  │     │  ├─ /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
  0   13  691599 plglnowako  20   0 45.2G 5230M  797M S   0.0  0.6  0:01.66 │  │     │  ├─ /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
  0   13  691602 plglnowako  20   0 45.2G 5230M  797M R  32.5  0.6  0:27.46 │  │     │  ├─ /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
  0   13  691612 plglnowako  20   0 45.2G 5230M  797M S   0.0  0.6  0:00.02 │  │     │  ├─ /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
  0   13  691614 plglnowako  20   0 45.2G 5230M  797M S   0.0  0.6  0:00.00 │  │     │  ├─ /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
  0   13  691621 plglnowako  20   0 45.2G 5230M  797M S   0.0  0.6  0:01.51 │  │     │  ├─ /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
  0   13  691622 plglnowako  20   0 45.2G 5230M  797M S   0.0  0.6  0:06.06 │  │     │  ├─ /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
  0   13  691631 plglnowako  20   0 45.2G 5230M  797M S   0.0  0.6  0:00.03 │  │     │  ├─ /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
  0   13  691637 plglnowako  20   0 45.2G 5230M  797M S   0.0  0.6  0:00.00 │  │     │  └─ /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std

If OPM_NUM_THREADS is set to higher value, more threads appears. I think it is not an issue.

Thankos for Your help with KPAR, it really speeds up the calcualtion more than an order of magnitude.

Best regards,
Leszek

#4 Post by **manuel_engel1** » Mon Mar 03, 2025 8:34 am

It's good to hear that you have gained a significant speedup from KPAR

What you say about larger systems is certainly true. This could be an issue in the future and we will be thinking about strategies to improve the situation. Thank you kindly for your suggestions in that regard.

I also don't think the additional threads are an issue. I do believe that they are spawned as a side effect of parallel execution and do not actually perform any amount of computational work.

Let me know if I can help you with anything else on this topic. If not, I will lock it in a few days.

Kind regards

My Community

VASP 6.4.3 - multi-node GPU performance

VASP 6.4.3 - multi-node GPU performance

Re: VASP 6.4.3 - multi-node GPU performance

Re: VASP 6.4.3 - multi-node GPU performance

Re: VASP 6.4.3 - multi-node GPU performance