The MLFF of VASP is stuck during initialization

Message

jun_yin2 · #1 Post by **jun_yin2** » Tue Jun 20, 2023 12:19 pm

Hello everyone. I have encountered some difficulties when using MLFF. When I use the input files in the following compressed package, MLFF gets stuck at the step shown below.

Code: Select all

The job 26133963 is running on lm508-[16,18,20],lm602-[02,04,06,08,10,12,14,20,22]
 running  480 mpi-ranks, with    1 threads/rank, on   12 nodes
 distrk:  each k-point on   24 cores,   20 groups
 distr:  one band on    4 cores,    6 groups
 vasp.6.4.1 05Apr23 (build Jun 14 2023 15:49:07) complex                        
  
 POSCAR found type information on POSCAR O H1H2C N PbI 
 POSCAR found :  7 types and     129 ions
 scaLAPACK will be used
 WARNING: type information on POSCAR and POTCAR are incompatible
 POTCAR overwrites the type information in POSCAR
 typ   2 type information:  H1 H 
 WARNING: type information on POSCAR and POTCAR are incompatible
 POTCAR overwrites the type information in POSCAR
 typ   3 type information:  H2 H 
 LDA part: xc-table for Pade appr. of Perdew
 Machine learning selected
 Setting communicators for machine learning
 Initializing machine learning

However, if I delete the ML_AB file, it can run normally. I suspected it was a problem with the running memory before, but even when I increased the running memory to 800GB on the cluster it still could not run. Could you please advise me on what might be causing this issue？

alex · #2 Post by **alex** » Wed Jun 21, 2023 5:57 am

Hello,

I'd guess you are running out of physical memory (and maybe entering swap?).

Check the forum for the memory requirement of your task. It'll be HUGE.
I'm typically running into severe trouble with more than 4 species in POTCAR (caution: 2x H counts also two) on a 4GB/core machine.

Good luck!

alex

jun_yin2 · #3 Post by **jun_yin2** » Wed Jun 21, 2023 10:15 am

Hi Alex,

Thanks for your help. I want to ask how to check the memory requirement of my task? Do I need to check this through VASP files or just command line in the cluster?
Also, I want to know if you have solved this problem.

Tieyuan Bian

#4 Post by **ferenc_karsai** » Wed Jun 21, 2023 10:56 am

I think Alex is right because you have many types and the required computational resources grow quadratically with the number of types (thank you Alex for helping with the answers!). Could you please upload also the ML_LOGFILE. At the beginning you will see how much memory you need per core.
The most important arrays for machine learning check before allocation and should exit with an error if not enough memory is available (this works of course only if you have swapping disabled). Unfortunately the VASP part can also require a lot of memory and those parts don't always check before allocation. So if both need to much together
What can you try to reduce the memory:
-) First of all check if you have compiled with shared memory MPI ("-Duse_shmem" precompiler option). It makes a huge difference for memory.
-) Set ML_MB manually, if nothing is set the array dimension for the local reference configurations is set to "largest number of local reference configurations from the ML_AB file" + MIN(NSW, 1500). You could limit ML_MB to let's say 1500-2000 per type.
-) Set also ML_MCONF. The dimension for the training data is set to "number of training data from ML_AB file" + MIN(NSW, 1500). You can also set this to smaller values because the on-the-fly may never reach +1500 entries. The behaviour here is different than for ML_MB. When ML_MCONF is reached the calculation stops and you need to restart with a higher value. So if you are willing to restart multiple times you can always set it to a value that is moderately higher and then restart if it is reached.
-) On the ab-initio side you could omit KPAR, this makes the calculations significantly slower but also reduces memory cost. Also possibly tune down the remaining ab-initio parameters as much as possible.
-) The largest array required by far is the design matrix. This matrix is fully block-cyclicly distributed via scalapack. So if you increase the number of cores the required size for this array goes down linearly.

jun_yin2 · #5 Post by **jun_yin2** » Wed Jun 21, 2023 11:57 am

Hi,

Thanks for your answer. I have upload ML_LOGFILE.zip, it seems that about 2.73 GB per core that I need to do MLFF.
I have used -Duse option to reduce the memory, but the memory request is still great. What is more, I added ML_MB=1500 and ML_MCONF=1500 into INCAR. However, the memory becomes larger, which you can see in ML_LOGFILE1.zip showing 3.54 GB. Also, I have tried to do this calculation with 960 cores but still failed. I think this system may not be suitable for training with VASP for its complexity.

#6 Post by **ferenc_karsai** » Wed Jun 21, 2023 1:43 pm

How much memory do you have per core?

If it should be enough then could you please try to run only one step refitting.
For that please just take a new INCAR where you set the following:
ML_LMLFF=.TRUE.
ML_MODE=REFTIBAYESIAN

If that runs through than it's the additional memory from ab-initio that's killing your calculation.

jun_yin2 · #7 Post by **jun_yin2** » Mon Jun 26, 2023 5:51 am

Hi,

Sorry for my late response. I have about 9GB memory per core. I use ML_MODE=refit in my new INCAR and other files remained same. It could work. As a result, reason why my calculation was killed may caused by ad-initio data. So is there any solution to solve this problem?

#8 Post by **ferenc_karsai** » Mon Jun 26, 2023 12:05 pm

Ok I think what you should try is to omit learning on the combined big system. We saw for example from interface pinning calculations that it was enough to learn liquid and solid separately and then have them run combined for the interfacial system but without learning the interfacial system. We tested both ways and the results were identical.
I can't guarantee you that it will work but it's worth a try because you would save a lot of computational effort.

For you that means:
1) You alread have the FAPbI surface learned.
2) Run liquid water simulations separately in a bulk box. Please read our best practices carefully, especially because liquids in NpT need constraints of the box:
wiki/index.php/Best_practices_for_machi ... rce_fields
3) Combine the ML_AB files. For that you will probably need a little bit of scripting.
4) Reselect local reference configurations for the combined ML_AB file by selecting ML_MODE=SELECT.
5) Refit ML_MODE=REFIT.
6) Run the force field on the interfacial system. Only at that point will you introduce the interface.

Be careful that all simulations have the same ab-initio parameters (functional, cut-off etc.).

I hope this way it is accurate enough and you will save a lot of calculational time.

My Community

The MLFF of VASP is stuck during initialization

The MLFF of VASP is stuck during initialization

Re: The MLFF of VASP is stuck during initialization

Re: The MLFF of VASP is stuck during initialization

Re: The MLFF of VASP is stuck during initialization

Re: The MLFF of VASP is stuck during initialization

Re: The MLFF of VASP is stuck during initialization

Re: The MLFF of VASP is stuck during initialization

Re: The MLFF of VASP is stuck during initialization