Hello,
There were reports about performance of multi-node GPU VASP calculations (https://p.vasp.at/forum/viewtopic.php?t=19178, https://ww.vasp.at/forum/viewtopic.php?p=20145), but they didn't acutally solve the issue of VASP performance on multi-nodes.
As my experiments confirmed, the drop of performance is very significant (actaully, calulations are even 10-20 times slower then on a 1 node!)
When ran at 1 node, VASP works perfectly and each process is bind to one gpu:
Code: Select all
nvidia-smi
Code: Select all
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1264221 C ...ftware/VASP/vasp.6.4.3/bin/vasp_std 37926MiB |
| 1 N/A N/A 1264222 C ...ftware/VASP/vasp.6.4.3/bin/vasp_std 38084MiB |
| 2 N/A N/A 1264223 C ...ftware/VASP/vasp.6.4.3/bin/vasp_std 38084MiB |
| 3 N/A N/A 1264224 C ...ftware/VASP/vasp.6.4.3/bin/vasp_std 37876MiB |
+-----------------------------------------------------------------------------------------+
But when ran on two nodes, it seems like all 4 processes are bound to GPU0, and 3 more a bound to consecutive GPUS (on one node, the same situtation happens on other node):
Code: Select all
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 976211 C ...ftware/VASP/vasp.6.4.3/bin/vasp_std 9070MiB |
| 0 N/A N/A 976212 C ...ftware/VASP/vasp.6.4.3/bin/vasp_std 556MiB |
| 0 N/A N/A 976213 C ...ftware/VASP/vasp.6.4.3/bin/vasp_std 556MiB |
| 0 N/A N/A 976214 C ...ftware/VASP/vasp.6.4.3/bin/vasp_std 556MiB |
| 1 N/A N/A 976212 C ...ftware/VASP/vasp.6.4.3/bin/vasp_std 8966MiB |
| 2 N/A N/A 976213 C ...ftware/VASP/vasp.6.4.3/bin/vasp_std 8974MiB |
| 3 N/A N/A 976214 C ...ftware/VASP/vasp.6.4.3/bin/vasp_std 8846MiB |
+-----------------------------------------------------------------------------------------+
Additionaly, even when ran on 1 node, VASP spans some additional threads (despite OMP_NUM_THREADS=1):
Code: Select all
htop
Code: Select all
CPU NLWP PID USER PRI NI VIRT RES SHR S CPU%▽MEM% TIME+ Command
0 13 1265538 plglnowako 20 0 45.9G 3975M 582M R 99.8 0.5 1:15.57 /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
72 13 1265539 plglnowako 20 0 46.4G 3968M 576M R 99.8 0.5 1:14.28 /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
216 13 1265541 plglnowako 20 0 45.9G 3733M 576M R 96.7 0.4 1:14.35 /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
144 13 1265540 plglnowako 20 0 46.4G 3987M 576M R 96.0 0.5 1:14.17 /net/home/plgrid/plglnowakowski/software/VASP/vasp.6.4.3/bin/vasp_std
Is it a result of specific task-gpu binding done by OpenMPI, maybe wrong SLURM parameters? Can You reproduce such a situation? What is the performance of VASP in Your multi-node GPU calculations?
Cluster: 4 Nvidia GH200 gpus on one node
Toolchains and libraries: NVHPC/24.5, CUDA 12.4.0, OpenMPi 5.0.3
VASP version: 6.4.3
Relevant files are attached.
Best Regards,
Leszek