Page 1 of 1

PLUGINS_STRUCTURE_errors

Posted: Thu Feb 20, 2025 3:34 pm
by thomas_pigeon

I compiled VASP 6.5.0 with the python plugins option, with two different compiler (gcc and fpp) see the attached makefile.include.
I execute VASP on a node composed of two processors AMD EPYC™ Milan 7763 - 64 Core - 2.45GHz - 256MB Cache
The plugin is only used to change the atoms positions every steps through a python code which runs Langevin dynamics using an integrator from ASE adapted for the plugin.
Depending on the ML_MODE and ML_LMLFF tag in the INCAR, I obtain two types of errors for both compilations with gcc and fpp.

With ML_LMLFF=.FALSE., the dynamics (through the plugin) runs for 4500 steps (out of 10 000) and then obtain the following error:

Code: Select all

slurmstepd-topaze1701: error: Detected 1 oom_kill event in StepId=7485027.0. Some of the step tasks have been OOM Killed.
srun: error: topaze1701: task 64: Out Of Memory
slurmstepd-topaze1701: error:  mpi/pmix_v4: _errhandler: topaze1701 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.7485027.0:64]
slurmstepd-topaze1701: error: *** STEP 7485027.0 ON topaze1701 CANCELLED AT 2025-02-19T23:19:21 ***
srun: Job step aborted: Waiting up to 302 seconds for job step to finish.
slurmstepd-topaze1701: error:  mpi/pmix_v4: _errhandler: topaze1701 [0]: pmixp_client_v2.c:212: Error handler invoked: status = -61, source = [slurm.pmix.7485027.0:0]
+ exit 0

With ML_LMLFF=.TRUE. and ML_MODE = train. I do not obtain any error and can run dynamics (through the plugins) for 10 000 steps (with high CTIFOR to not do DFT).
In that particular case, the ML_CTIFOR was set to a high value so that there is no DFT calls and only FF evaluations.

With ML_LMLFF=.TRUE. and ML_MODE = run, the vasp execution stops before calling the python interface but after writing the first energy and forces to the OUTCAR.
I obtain the following error (many times):

Code: Select all

[topaze1150:3973629:0:3973629] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:3973629) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x00000000005aa353 rc_add_()  ???:0
 2 0x00000000004cd81b plugins_mp_plugins_structure_()  ???:0
 3 0x0000000001eff5f1 MAIN__()  ???:0
 4 0x000000000041fba2 main()  ???:0
 5 0x000000000003ad85 __libc_start_main()  ???:0
 6 0x000000000041faae _start()  ???:0
=================================
[topaze1150:3973585:0:3973585] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:3973585) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000000584695 map_forward_()  ???:0
 2 0x000000000058921d fftbrc_plan_mpi_()  ???:0
 3 0x000000000058d33b fft3d_mpi_()  ???:0
 4 0x0000000000590e98 fft3d_()  ???:0
 5 0x00000000004cd838 plugins_mp_plugins_structure_()  ???:0
 6 0x0000000001eff5f1 MAIN__()  ???:0
 7 0x000000000041fba2 main()  ???:0
 8 0x000000000003ad85 __libc_start_main()  ???:0
 9 0x000000000041faae _start()  ???:0
=================================
[topaze1150:3973645:0:3973645] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1268000007f)
==== backtrace (tid:3973645) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000000584695 map_forward_()  ???:0
 2 0x000000000058921d fftbrc_plan_mpi_()  ???:0
 3 0x000000000058d33b fft3d_mpi_()  ???:0
 4 0x0000000000590e98 fft3d_()  ???:0
 5 0x00000000004cd838 plugins_mp_plugins_structure_()  ???:0
 6 0x0000000001eff5f1 MAIN__()  ???:0
 7 0x000000000041fba2 main()  ???:0
 8 0x000000000003ad85 __libc_start_main()  ???:0
 9 0x000000000041faae _start()  ???:0
=================================

Re: PLUGINS_STRUCTURE_errors

Posted: Fri Feb 21, 2025 1:26 pm
by manuel_engel1

Hello,

Thank you kindly for the report. After talking with our ML and plugin experts, I am able to come back with a partial answer.

In the case where ML_LMLFF=True and ML_MODE=run, there is indeed a problem as some of the DFT quantities are not allocated. When running with the VASP plugin, these non-allocated quantities are accessed, causing the segmentation fault you see. We are already working on a fix for this issue.

As to why the first case is running out of memory is still a bit unclear to me. It might be due to an unrelated bug, or it might be something more benign. This needs to be investigated still.

Kind regards


Re: PLUGINS_STRUCTURE_errors

Posted: Fri Feb 21, 2025 2:16 pm
by manuel_engel1

We have now started to investigate the issue with ML_LMLFF=False that you described first. We suspect that it could be caused by a memory leak. Could you please tell us exactly what compiler and library versions you used to build VASP?

In particular, we are interested in the exact version numbers of

  • the Fortran compiler

  • the MPI library

  • the HDF5 library (if used)

  • scaLAPACK/LAPACK

This information would be greatly appreciated.