Page 2 of 2
Re: Linking error compiling GPU version of Vasp 6.3.0
Posted: Fri Mar 25, 2022 9:19 am
by marie-therese.huebsch
Ok, no worries. We can make this work.
1. Did you set
LD_LIBRARY_PATH? I cannot see that you added the system library path to your
.bashrc. So, did my previous suggestion help to solve the following error:
Code: Select all
/data/Software/Vasp/vasp.6.3.0/bin/vasp_std: error while loading shared libraries: libqdmod.so.0: cannot open shared object file: No such file or directory
2. Regarding the new error:
Code: Select all
/proj/nv/libraries/Linux_x86_64/22.2/openmpi/209518-rel-1/share/openmpi/help-mpi-runtime.txt: No such file or directory. Sorry!
Did you install OpenMPI yourself? And did you add
openmpi/lib to your
LD_LIBRARY_PATH?
As a sanity check, you can look where
help-mpi-runtime.txt is located on your system. The one that comes with NV 22.2 is actually expected to be at
Code: Select all
$NVROOT/comm_libs/openmpi/openmpi-3.1.5/share/openmpi/help-mpi-runtime.txt
That is why I assume you installed OpenMPI yourself.
Re: Linking error compiling GPU version of Vasp 6.3.0
Posted: Sat Mar 26, 2022 8:55 am
by paulfons
I have successfully run the simpleMPI example in cuda-samples using the "mpirun -n 32 simpleMPI" example file. I assume I am making a (fundamental) mistake in how to invoke Vasp with the GPU card. Now I realize I have to use mpirun (of the nvidia openmpi installation). This results in many copies of the (second inset) error below which relate to "In most cases this means several MPI-ranks want to share a GPU which is not supported by NCCL" Running vasp_ncl with a simple core "mpirun -n 1 vasp_ncl" runs correctly, but doesn't seem particular fast compared to the CPU version. I assume to get a speedup I need to set KPAR in the INCAR file. Is this correct? Setting KPAR to 56 seems to start correctly, but I encounter numerical errors (see third inset).
The last iset shows the INCAR file I used. The sample input (with the appropriate KPAR setting works fine for the cpu version). Any suggestions as to how to get this test run to work correctly? (thanks!)
Code: Select all
at Mar 26 17:42:15 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A30 Off | 00000000:CA:00.0 Off | 0 |
| N/A 29C P0 31W / 165W | 0MiB / 24576MiB | 3% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Code: Select all
Vasp/GaAs>mpirun -n 32 /data/Software/Vasp/vasp.6.3.0/bin/vasp_ncl
running on 32 total cores
distrk: each k-point on 2 cores, 16 groups
distr: one band on 1 cores, 2 groups
OpenACC runtime initialized ... 1 GPUs detected
-----------------------------------------------------------------------------
| |
| EEEEEEE RRRRRR RRRRRR OOOOOOO RRRRRR ### ### ### |
| E R R R R O O R R ### ### ### |
| E R R R R O O R R ### ### ### |
| EEEEE RRRRRR RRRRRR O O RRRRRR # # # |
| E R R R R O O R R |
| E R R R R O O R R ### ### ### |
| EEEEEEE R R R R OOOOOOO R R ### ### ### |
| |
| M_init_nccl: failed to initialize a NCCL communicator. |
| In most cases this means several MPI-ranks want to share a GPU, |
| which is not supported by NCCL. If this is the case, either reduce |
| the number of MPI-ranks (#-of-ranks <= #-of-GPUs) or run with |
| LUSENCCL = .FALSE. |
| |
| ----> I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <---- |
| |
-----------------------------------------------------------------------------
Code: Select all
Vasp/GaAs>mpirun -n 1 /data/Software/Vasp/vasp.6.3.0/bin/vasp_ncl
running on 1 total cores
distrk: each k-point on 1 cores, 1 groups
distr: one band on 1 cores, 1 groups
OpenACC runtime initialized ... 1 GPUs detected
vasp.6.3.0 20Jan22 (build Mar 19 2022 10:49:18) complex
POSCAR found type information on POSCAR GaAs
POSCAR found : 2 types and 2 ions
-----------------------------------------------------------------------------
| |
| W W AA RRRRR N N II N N GGGG !!! |
| W W A A R R NN N II NN N G G !!! |
| W W A A R R N N N II N N N G !!! |
| W WW W AAAAAA RRRRR N N N II N N N G GGG ! |
| WW WW A A R R N NN II N NN G G |
| W W A A R R N N II N N GGGG !!! |
| |
| You use a magnetic or noncollinear calculation, but did not specify |
| the initial magnetic moment with the MAGMOM tag. Note that a |
| default of 1 will be used for all atoms. This ferromagnetic setup |
| may break the symmetry of the crystal, in particular it may rule |
| out finding an antiferromagnetic solution. Thence, we recommend |
| setting the initial magnetic moment manually or verifying carefully |
| that this magnetic setup is desired. |
| |
-----------------------------------------------------------------------------
scaLAPACK will be used selectively (only on CPU)
-----------------------------------------------------------------------------
| |
| ----> ADVICE to this user running VASP <---- |
| |
| You enforced a specific xc type in the INCAR file but a different |
| type was found in the POTCAR file. |
| I HOPE YOU KNOW WHAT YOU ARE DOING! |
| |
-----------------------------------------------------------------------------
LDA part: xc-table for Pade appr. of Perdew
found WAVECAR, reading the header
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ... GRIDC
FFT: planning ... GRID_SOFT
FFT: planning ... GRID
reading WAVECAR
the WAVECAR file was read successfully
augmentation electrons 18.10200827988438
soft electrons 0.000000000000000
total electrons 18.10200827988438
augmentation electrons 1.5747565785228942E-002
soft electrons 0.000000000000000
total electrons 1.5747565785228942E-002
augmentation electrons 1.5747565785228942E-002
soft electrons 0.000000000000000
total electrons 1.5747565785228942E-002
augmentation electrons 1.5747565785228942E-002
soft electrons 0.000000000000000
total electrons 1.5747565785228942E-002
augmentation electrons 134.9612722547786
soft electrons 0.000000000000000
total electrons 134.9612722547786
augmentation electrons 131.7019698547786
soft electrons 0.000000000000000
total electrons 131.7019698547786
augmentation electrons 131.7019698547786
soft electrons 0.000000000000000
total electrons 131.7019698547786
augmentation electrons 131.7019698547786
soft electrons 0.000000000000000
total electrons 131.7019698547786
reading imaginary part of occupancies ...
charge-density read from file: unknown
reading imaginary part of occupancies ...
magnetization density read from file 1
entering main loop
N E dE d eps ncg rms rms(c)
DAV: 1 -0.906486783406E+01 -0.90649E+01 -0.14944E-04 15296 0.114E-01 0.233E-02
WARNING in EDDRMM: call to ZHEGV failed, returncode = 6 3 21
RMM: 2 -0.906374086794E+01 0.11270E-02 -0.36541E-05 16276 0.619E-02 0.124E-02
RMM: 3 -0.906374171468E+01 -0.84673E-06 -0.50900E-06 18352 0.237E-02 0.161E-03
WARNING in EDDRMM: call to ZHEGV failed, returncode = 6 3 24
WARNING in EDDRMM: call to ZHEGV failed, returncode = 8 4 24
RMM: 4 -0.906374176996E+01 -0.55288E-07 -0.64887E-07 16161 0.825E-03
1 F= -.90637418E+01 E0= -.90637418E+01 d E =0.000000E+00 mag= -0.0004 -0.0027 0.0000
writing wavefunctions
augmentation electrons 7.718732353504382
soft electrons 10.36378357893159
total electrons 18.08251593243597
augmentation electrons 2.0020454222713033E-005
soft electrons 10.36378357893159
total electrons -3.8122818632383288E-004
augmentation electrons 1.4820234856620306E-004
soft electrons 10.36378357893159
total electrons -2.8444639199147587E-003
augmentation electrons -1.2472124912546340E-006
soft electrons 10.36378357893159
total electrons 2.4033158347232809E-005
Warning: ieee_invalid is signaling
Warning: ieee_divide_by_zero is signaling
Warning: ieee_underflow is signaling
Warning: ieee_inexact is signaling
FORTRAN STOP
Code: Select all
Vasp/GaAs>cat INCAR
ALGO = Fast
EDIFF = 1E-7
ENCUT = 520
IBRION = 2
ICHARG = 1
ISIF = 3
ISMEAR = -5
LORBIT = 11
LSORBIT = True
LREAL = False
LWAVE = True
NELM = 100
NSW = 0
PREC = Accurate
SIGMA = 0.05
LAECHG = True
GGA = PS
KPAR = 56
Re: Linking error compiling GPU version of Vasp 6.3.0
Posted: Tue Mar 29, 2022 3:37 pm
by marie-therese.huebsch
Hi paulfons,
it seems you are sorting out how to submit a job now. This thread has become quite long already and you have not answered the questions I have asked before:
1. Did you set LD_LIBRARY_PATH? I cannot see that you added the system library path to your .bashrc. So, did my previous suggestion help to solve the following error:
CODE: SELECT ALL
/data/Software/Vasp/vasp.6.3.0/bin/vasp_std: error while loading shared libraries: libqdmod.so.0: cannot open shared object file: No such file or directory
2. Regarding the new error:
CODE: SELECT ALL
/proj/nv/libraries/Linux_x86_64/22.2/openmpi/209518-rel-1/share/openmpi/help-mpi-runtime.txt: No such file or directory. Sorry!
Did you install OpenMPI yourself? And did you add openmpi/lib to your LD_LIBRARY_PATH?
Could you please respond so I can follow what is your current status?
For the other issues, I suggest that you open a new thread. I will be happy to help also in understanding KPAR etc.
Thank you for understanding.
Kind regards,
Marie-Therese
Re: Linking error compiling GPU version of Vasp 6.3.0
Posted: Wed Mar 30, 2022 4:56 am
by paulfons
Hi,
I am sorry for the delay in posting a response. I did enter an update, but I must have not posted it correctly. In any case, the GPU version of Vasp works. I learned (from the wiki) that only a single mpi process is allowed due to the NCCL libraries. This seems like a significant limitation for smaller systems, but hopefully it will be addressed in the future. In the meantime, I have been trying to learn how to optimize the throughput on my Ampere 100 card. I tried a few runs with vasp_gam for a MD simulation with a few hundred atoms. I tried varying NSIM and it seems like a bigger number than what I typically use with a cpu calculation is better (NSIM=40) gave the shortest time in my limited testing. Can you offer any insight on what the best parameters are (NSIM, ? others) for optimizing a GPU-based calculation? I assume for a system with a larger number of k-points that KPAR would be another parameter to vary. Is there any sort of "rulebook" for getting a handle on gpu calculation optimization?
Re: Linking error compiling GPU version of Vasp 6.3.0
Posted: Wed Mar 30, 2022 6:28 am
by marie-therese.huebsch
Hi paulfons,
Thank you for confirming that the GPU version for VASP works on your machine!
Therefore, I will close this topic now, as your follow-up questions do not fit the title "Linking error compiling GPU version of Vasp 6.3.0". Could you please ask your questions about optimization in a new thread with an appropriate title. I am very sorry this causes an inconvenience and I hope for your understanding.
Kind regards,
Marie-Therese