vasp6 gpu version ab initio MD crushes
Posted: Tue Dec 19, 2023 5:04 pm
Dear developers,
I'm running Langevin MD for 128 tungsten + 1 Re atoms on NERSC perlmutter
With the same input (INCAR, POSCAR, KPOINS, POTCAR) vasp.6.4.1 fail to reach self-consistency after 200 scf iterations and fails at second time step, while vasp 5 cpu version runs without any problem.
Below I copied e-mail received from nersk software engineer. I would appreciate any help. Please let me know what information do you need
"2023-12-15 13:14:48 PST - Phillip ThomasAdditional comments
Hi German,
Thank you for your patience! I tested your job with several versions of VASP:
5.4.4-cpu
6.3.2-cpu
6.4.1-cpu
6.2.1-gpu
6.3.2-gpu
6.4.1-gpu
6.4.2-gpu (not yet public on Perlmutter, new build)
I can reproduce the error that you experienced in 6.2.1-gpu, but I found that this error appears in *all* VASP-6 builds at NERSC; it is not specific to the GPU builds. Looking at the output files I noticed that the SCF iterations begin to differ between VASP 5.4.4 and the VASP 6.x.y runs very early in the calculation, with the SCF energies diverging within the first few SCF cycles (sometimes even in the very first step). In 5.4.4 the free energy always converges to a value around -1645 eV for all SCF cycles in the job, but all of the VASP 6.x.y builds show SCF divergence, so I believe the values from VASP 5.4.4 to be correct.
I notice that the number of "eigenvalue-minimisations" in VASP 6.X begins to differ from VASP 5.4.4 at the point of divergence, so I suspect the issue lies in the eigensolver routine.
At this point I recommend that you file a bug report with the VASP developers. Some issues that the VASP developers might check include:
1) Were there any changes in the eigensolver routine between VASP 5 and VASP 6 which may have introduced a bug?
2) Were any default parameters changed between VASP 5 and VASP 6 which might affect SCF convergence for certain types of systems? If so, then you may be able to restore convergence by setting some parameter in your INCAR in the VASP 6.x.y runs.
3) Is there a possibility of a bug either in the compiler or in the linked libraries which may affect the VASP 6.x.y versions but not VASP 5.4.4? All versions of VASP at NERSC were built using NVIDIA SDK 22.7 and use Cray-MPICH, if that helps.
If you decide to file a bug report with VASP, we would be grateful if you reference the thread in this ticket so that we can track it and patch our VASP builds if the developers suggest a patch!
Best,
Phillip
"
the thread in this ticket is Ref:MSG3501497
I'm running Langevin MD for 128 tungsten + 1 Re atoms on NERSC perlmutter
With the same input (INCAR, POSCAR, KPOINS, POTCAR) vasp.6.4.1 fail to reach self-consistency after 200 scf iterations and fails at second time step, while vasp 5 cpu version runs without any problem.
Below I copied e-mail received from nersk software engineer. I would appreciate any help. Please let me know what information do you need
"2023-12-15 13:14:48 PST - Phillip ThomasAdditional comments
Hi German,
Thank you for your patience! I tested your job with several versions of VASP:
5.4.4-cpu
6.3.2-cpu
6.4.1-cpu
6.2.1-gpu
6.3.2-gpu
6.4.1-gpu
6.4.2-gpu (not yet public on Perlmutter, new build)
I can reproduce the error that you experienced in 6.2.1-gpu, but I found that this error appears in *all* VASP-6 builds at NERSC; it is not specific to the GPU builds. Looking at the output files I noticed that the SCF iterations begin to differ between VASP 5.4.4 and the VASP 6.x.y runs very early in the calculation, with the SCF energies diverging within the first few SCF cycles (sometimes even in the very first step). In 5.4.4 the free energy always converges to a value around -1645 eV for all SCF cycles in the job, but all of the VASP 6.x.y builds show SCF divergence, so I believe the values from VASP 5.4.4 to be correct.
I notice that the number of "eigenvalue-minimisations" in VASP 6.X begins to differ from VASP 5.4.4 at the point of divergence, so I suspect the issue lies in the eigensolver routine.
At this point I recommend that you file a bug report with the VASP developers. Some issues that the VASP developers might check include:
1) Were there any changes in the eigensolver routine between VASP 5 and VASP 6 which may have introduced a bug?
2) Were any default parameters changed between VASP 5 and VASP 6 which might affect SCF convergence for certain types of systems? If so, then you may be able to restore convergence by setting some parameter in your INCAR in the VASP 6.x.y runs.
3) Is there a possibility of a bug either in the compiler or in the linked libraries which may affect the VASP 6.x.y versions but not VASP 5.4.4? All versions of VASP at NERSC were built using NVIDIA SDK 22.7 and use Cray-MPICH, if that helps.
If you decide to file a bug report with VASP, we would be grateful if you reference the thread in this ticket so that we can track it and patch our VASP builds if the developers suggest a patch!
Best,
Phillip
"
the thread in this ticket is Ref:MSG3501497