Hi,
Does anyone have a SLURM job script to run VASP with GPUs on multiple nodes?
In my script with no particular settings related to CUDA or NVHPC or OpenACC or NCCL,
I get a good scaling for VASP from 1 up to 8 GPUs but within one node. Running on two nodes(i.e. 16 GPUs) is slower than one node (8 GPUs). However, I find benchmarks of VASP GPU on nvidia page up to many nodes for a system of about 700 atoms.
My system has about 500 atoms, therefore, I would expect to obtain speedup up to a few nodes at least.
The HPC cluster has InfiniBand.
I have also compared running on two GPUs in two ways, (i) both GPUs on one node, (ii) two nodes each with one GPU. The latter is about 20% slower.
I wonder whether I need a particular setting to run on more than one node?
Thank you in advance!
Alireza
VASP on multiple nodes each with multiple GPUs
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 5
- Joined: Thu Feb 02, 2023 11:27 am
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: VASP on multiple nodes each with multiple GPUs
Dear Alireza,
To better understand what types of calculations and tests you have done, I would need more information. Could you please provide the input and output files for your calculations (see guidelines). Also, it would be helpful if you could attach your makefile.include, so that I can see what toolchains and libraries you are using.
When setting up a slurm job for running your calculation on GPUs, it is important to choose the number of tasks per node to be equal to the number of GPUs per node. This way you would be able to benefit from the asynchronous communication enabled by the NCCL library.Does anyone have a SLURM job script to run VASP with GPUs on multiple nodes?
To better understand what types of calculations and tests you have done, I would need more information. Could you please provide the input and output files for your calculations (see guidelines). Also, it would be helpful if you could attach your makefile.include, so that I can see what toolchains and libraries you are using.