ML_FF continuation run memory issues only for some phases

Queries about input and output files, running specific calculations, etc.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
akretschmer
Newbie
Newbie
Posts: 16
Joined: Wed Nov 13, 2019 8:14 am

ML_FF continuation run memory issues only for some phases

#1 Post by akretschmer » Tue Feb 25, 2025 10:25 am

I am trying to simulate solid/liquid interfaces of water with different solids like Zn or graphite. I have trained the force field for water, and now want to continue training the solid phases.

For graphite, this works well, I have trained the bulk and the surface with 96 C atoms just following the procedure described on the vasp wiki.

MLFF_graphite.zip

The ML_ABN file has this header (the whole file is too big to upload):

Code: Select all

 1.0 Version
**************************************************
     The number of configurations
--------------------------------------------------
       1313
**************************************************
     The maximum number of atom type
--------------------------------------------------
       3
**************************************************
     The atom types in the data file
--------------------------------------------------
     H  O  C 
**************************************************
     The maximum number of atoms per system
--------------------------------------------------
            192
**************************************************
     The maximum number of atoms per atom type
--------------------------------------------------
            128
**************************************************
     Reference atomic energy (eV)
--------------------------------------------------
  0.000000000000000E+000  0.000000000000000E+000  0.000000000000000E+000
**************************************************
     Atomic mass
--------------------------------------------------
   8.00000000000000        16.0000000000000        12.0110000000000     
**************************************************
     The numbers of basis sets per atom type
--------------------------------------------------
      4713  2258   196
**************************************************
     Basis set for H 
--------------------------------------------------
          1      1
          1      2
          1      3

Now, with Zn I have problems. I cannot get the continuation training to run with a Zn, despite using exactly the same ML_ABN as in the graphite case with (almost) the same settings. I adjusted ISMEAR, but ENCUT, PREC and so on should be identical.

When I remove the ML_AB with the stored H and O positions, the training proceeds, when I add the file I always get stuck at the initialization:

Code: Select all

Loading vasp6/6.4.3-intel-2021.9.0-ayhsd2e
  Loading requirement: intel-oneapi-mpi/2021.9.0-intel-2021.9.0-35bd42h
    pkgconf/1.8.0-intel-2021.9.0-hgzow6c hdf5/1.12.2-intel-2021.9.0-tiadmsp
    intel-oneapi-tbb/2021.9.0-intel-2021.9.0-xyehvoh
    intel-oneapi-mkl/2023.1.0-intel-2021.9.0-7aranik
 running  128 mpi-ranks, on    1 nodes
 distrk:  each k-point on  128 cores,    1 groups
 distr:  one band on   64 cores,    2 groups
 vasp.6.4.3 19Mar24 (build Jul 17 2024 16:12:06) complex                        
  
 POSCAR found type information on POSCAR Zn
 POSCAR found :  1 types and      64 ions
 Reading from existing POTCAR
 scaLAPACK will be used
 -----------------------------------------------------------------------------
|                                                                             |
|           W    W    AA    RRRRR   N    N  II  N    N   GGGG   !!!           |
|           W    W   A  A   R    R  NN   N  II  NN   N  G    G  !!!           |
|           W    W  A    A  R    R  N N  N  II  N N  N  G       !!!           |
|           W WW W  AAAAAA  RRRRR   N  N N  II  N  N N  G  GGG   !            |
|           WW  WW  A    A  R   R   N   NN  II  N   NN  G    G                |
|           W    W  A    A  R    R  N    N  II  N    N   GGGG   !!!           |
|                                                                             |
|     For optimal performance we recommend to set                             |
|       NCORE = 2 up to number-of-cores-per-socket                            |
|     NCORE specifies how many cores store one orbital (NPAR=cpu/NCORE).      |
|     This setting can greatly improve the performance of VASP for DFT.       |
|     The default, NCORE=1 might be grossly inefficient on modern             |
|     multi-core architectures or massively parallel machines. Do your        |
|     own testing! More info at https://www.vasp.at/wiki/index.php/NCORE      |
|     Unfortunately you need to use the default for GW and RPA                |
|     calculations (for HF NCORE is supported but not extensively tested      |
|     yet).                                                                   |
|                                                                             |
 -----------------------------------------------------------------------------

 Reading from existing POTCAR
 -----------------------------------------------------------------------------
|                                                                             |
|               ----> ADVICE to this user running VASP <----                  |
|                                                                             |
|     You enforced a specific xc type in the INCAR file but a different       |
|     type was found in the POTCAR file.                                      |
|     I HOPE YOU KNOW WHAT YOU ARE DOING!                                     |
|                                                                             |
 -----------------------------------------------------------------------------

 LDA part: xc-table for (Slater+PW92), standard interpolation
 POSCAR found type information on POSCAR Zn
 POSCAR found :  1 types and      64 ions
 Machine learning selected
 Setting communicators for machine learning
 Initializing machine learning

The memory used by vasp for Zn without the continuation training is 135087 kBytes, and I tested this on nodes with up to 2048 GB of memory. Meanwhile the graphite runs fine on a node with 384 GB memory, according to OUTCAR the calculation uses 256992 kBytes. So the issue seems to be the memory of the already stored basis set, but at the same time it is no problem in the graphite case.

Here are the input and output files for Zn without the ML_AB file present:

MLFF_Zn.zip

Please note that the high cut-off energy of 700 eV is due to the preceded training of water, and lowering this to 400 eV did not help at all. When I do continuation training with Zn, the calculation gets stuck every time.

This is the header of the ML_AB file with I use for Zn:

Code: Select all

 1.0 Version
**************************************************
     The number of configurations
--------------------------------------------------
       1123
**************************************************
     The maximum number of atom type
--------------------------------------------------
       2
**************************************************
     The atom types in the data file
--------------------------------------------------
     H  O 
**************************************************
     The maximum number of atoms per system
--------------------------------------------------
            192
**************************************************
     The maximum number of atoms per atom type
--------------------------------------------------
            128
**************************************************
     Reference atomic energy (eV)
--------------------------------------------------
   0.0000000000000000        0.0000000000000000     
**************************************************
     Atomic mass
--------------------------------------------------
   8.0000000000000000        16.000000000000000     
**************************************************
     The numbers of basis sets per atom type
--------------------------------------------------
      5043  2309
**************************************************
     Basis set for H 
--------------------------------------------------
          1      1
          1      2
          1      3
          1      4
          1      5
          1      6
        218     70
         28    128
        743     56

The basis set is slightly smaller after the C training because of sparsification, but initally the ML_AB file was identical for both cases.

Why does the continuation training work for graphite when it does not work for Zn? The memory issue should be identical in both cases. And how do I proceed with Zn to get a working force field?

You do not have the required permissions to view the files attached to this post.

ferenc_karsai
Global Moderator
Global Moderator
Posts: 530
Joined: Mon Nov 04, 2019 12:44 pm

Re: ML_FF continuation run memory issues only for some phases

#2 Post by ferenc_karsai » Tue Feb 25, 2025 10:58 am

I will not be able to debug this problem without your ML_AB file, so please upload it to a remote repository and share the link here.


akretschmer
Newbie
Newbie
Posts: 16
Joined: Wed Nov 13, 2019 8:14 am

Re: ML_FF continuation run memory issues only for some phases

#3 Post by akretschmer » Tue Feb 25, 2025 11:15 am

Here is the link: https://owncloud.tuwien.ac.at/index.php ... A4yjexCOlN

There are two ML_AB for two different XC functionals which I have tried and they both seem to suffer from the same problem.


ferenc_karsai
Global Moderator
Global Moderator
Posts: 530
Joined: Mon Nov 04, 2019 12:44 pm

Re: ML_FF continuation run memory issues only for some phases

#4 Post by ferenc_karsai » Tue Feb 25, 2025 3:59 pm

I tried to run through both continuation runs (Zn and graphite after H2O). Both work for me. I've run for both only three steps and ran with simplified ab-initio parameters to speed up the test.
The initialization takes around 30-60 minutes.

The total memory usage is around 120 GB for this job.

I've run now with the developers version, but I will also try VASP 6.5.0.


akretschmer
Newbie
Newbie
Posts: 16
Joined: Wed Nov 13, 2019 8:14 am

Re: ML_FF continuation run memory issues only for some phases

#5 Post by akretschmer » Thu Feb 27, 2025 10:50 am

I have set up the continuation training again in a new folder and tried it with vasp6.5.0 and now it works! Before I used 6.4.3, maybe that caused the issue, but I'm happy I can progress now.


ferenc_karsai
Global Moderator
Global Moderator
Posts: 530
Joined: Mon Nov 04, 2019 12:44 pm

Re: ML_FF continuation run memory issues only for some phases

#6 Post by ferenc_karsai » Thu Feb 27, 2025 12:14 pm

Good to hear it works for you now.

I will check 6.4.3 and see if I can reproduce the bug. Anyways if there is one it seems to be gone with the latest version.


Post Reply