AIMD Calculations terminate after a short time
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 37
- Joined: Thu May 19, 2022 8:44 am
AIMD Calculations terminate after a short time
Hello,
I have been using the AIMD module to simulate NaF which has 64 atoms. However, the calculations terminated after some time (around 1-3 ps, with a timestep of 1fs) but I was unable to find a solution. Therefore, I want to ask whether you could help with this.
The calculation terminates much more quickly when I do it on a larger supercell. Therefore, it might be a memory issue. Two types of error reports have been found:
---error due to memory
---error without specific instruction
I put the reports in the attachment.
I think the problem comes from using the GPU version of VASP. I use the CPU version of VASP before but this is not a problem with an even larger system (96 atoms, much more electrons). Therefore, the INCAR, KPOINTS, and POTCAR files should be fine.
I have done the following tests but they have all failed. I will upload one example in the attachment.
1. Increase the number of nodes
2. Play with ntasks-per-node since this will affect the number of cores assigned to each k-point
3. Decrease KPAR. NCORE doesn't work for me because I use OpenMP
4. Use ALGO=Fast or VeryFast instead of ALGO = All
5. GPU node with 16G and 32G memory
For the cluster I use, each GPU accelerated node consists of 40 CPUs (Intel Cascade Lake 6248 processors, usable memory 160 GB) and 4 Nvidia Tesla V100 SXM2 GPUs (either 16GB or 32GB, both have been tested). The VASP version is the newest 6.3.2 and was compiled by a supercomputer technician. In the attachment, you will also find one submission script I use.
I hope the information above and the attachment will help clarify my problem and give you sufficient information. If you need additional information, please let me know. Thanks a lot in advance.
Best regards,
Xiliang
I have been using the AIMD module to simulate NaF which has 64 atoms. However, the calculations terminated after some time (around 1-3 ps, with a timestep of 1fs) but I was unable to find a solution. Therefore, I want to ask whether you could help with this.
The calculation terminates much more quickly when I do it on a larger supercell. Therefore, it might be a memory issue. Two types of error reports have been found:
---error due to memory
---error without specific instruction
I put the reports in the attachment.
I think the problem comes from using the GPU version of VASP. I use the CPU version of VASP before but this is not a problem with an even larger system (96 atoms, much more electrons). Therefore, the INCAR, KPOINTS, and POTCAR files should be fine.
I have done the following tests but they have all failed. I will upload one example in the attachment.
1. Increase the number of nodes
2. Play with ntasks-per-node since this will affect the number of cores assigned to each k-point
3. Decrease KPAR. NCORE doesn't work for me because I use OpenMP
4. Use ALGO=Fast or VeryFast instead of ALGO = All
5. GPU node with 16G and 32G memory
For the cluster I use, each GPU accelerated node consists of 40 CPUs (Intel Cascade Lake 6248 processors, usable memory 160 GB) and 4 Nvidia Tesla V100 SXM2 GPUs (either 16GB or 32GB, both have been tested). The VASP version is the newest 6.3.2 and was compiled by a supercomputer technician. In the attachment, you will also find one submission script I use.
I hope the information above and the attachment will help clarify my problem and give you sufficient information. If you need additional information, please let me know. Thanks a lot in advance.
Best regards,
Xiliang
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 419
- Joined: Mon Sep 13, 2021 11:02 am
Re: AIMD Calculations terminate after a short time
Hi,
According to your job.sh file, it seems that you are combining the use of GPUs and MPI. Here OpenACC_GPU_port_of_VASP it is mentioned that "Due to the use of NCCL, the OpenACC version of VASP may only be executed using a single MPI-rank per available GPU". May it be the problem of the crash? Could you please try without MPI (#SBATCH --ntasks-per-node=1)?
According to your job.sh file, it seems that you are combining the use of GPUs and MPI. Here OpenACC_GPU_port_of_VASP it is mentioned that "Due to the use of NCCL, the OpenACC version of VASP may only be executed using a single MPI-rank per available GPU". May it be the problem of the crash? Could you please try without MPI (#SBATCH --ntasks-per-node=1)?
-
- Newbie
- Posts: 37
- Joined: Thu May 19, 2022 8:44 am
Re: AIMD Calculations terminate after a short time
Hello,
Thanks a lot for your answer.
I also checked the page you shared and had a test with this. I set ntasks-per-node to 4 because each node has 4 GPUs available. The information for the calculation is this:
Best wishes,
Xiliang
Thanks a lot for your answer.
I also checked the page you shared and had a test with this. I set ntasks-per-node to 4 because each node has 4 GPUs available. The information for the calculation is this:
With this, I don't have the warning you find in the standard output I attached. However, This doesn't work for me and results in the same error. Do you have further suggestions for this?running 12 mpi-ranks, with 10 threads/rank
distrk: each k-point on 3 cores, 4 groups
distr: one band on 1 cores, 3 groups
OpenACC runtime initialized ... 12 GPUs detected
vasp.6.3.2 27Jun22 (build Sep 29 2022 16:11:41) gamma-only
Best wishes,
Xiliang
-
- Global Moderator
- Posts: 419
- Joined: Mon Sep 13, 2021 11:02 am
Re: AIMD Calculations terminate after a short time
Ok, but nevertheless have you tried with "#SBATCH --ntasks-per-node=1"?. I think that "#SBATCH --gres=gpu:4" should be enough to have all 4 GPUs on each node used.
-
- Newbie
- Posts: 37
- Joined: Thu May 19, 2022 8:44 am
Re: AIMD Calculations terminate after a short time
Hello,
Thanks for the explanation. Yes, in the meantime, I submitted another test. It got better but didn't solve the problem. I was able to get a trajectory of more than 4 ps (timestep 1fs) but memory issues led to the termination of the calculation again. I attached the standard out file. Can you please suggest other solutions?
In addition, the reason I didn't do the test with is that it is very inefficient. I did a comparison earlier and found that this setting leads to a very slow speed compared with . The latter is almost three times faster. If you could also comment on this, it would be very helpful for me.
Thank you very much for taking the time to my problem.
Best wishes,
Xiliang
Thanks for the explanation. Yes, in the meantime, I submitted another test. It got better but didn't solve the problem. I was able to get a trajectory of more than 4 ps (timestep 1fs) but memory issues led to the termination of the calculation again. I attached the standard out file. Can you please suggest other solutions?
In addition, the reason I didn't do the test with
Code: Select all
ntasks-per-node = 1
Code: Select all
ntasks-per-node=40
Thank you very much for taking the time to my problem.
Best wishes,
Xiliang
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 419
- Joined: Mon Sep 13, 2021 11:02 am
Re: AIMD Calculations terminate after a short time
Hi,
During how many minutes (or hours) was the calculation with ntasks-per-node=40 running before it crashed? Did the calculation with ntasks-per-node=1 also crashed or did you stop it?
During how many minutes (or hours) was the calculation with ntasks-per-node=40 running before it crashed? Did the calculation with ntasks-per-node=1 also crashed or did you stop it?
-
- Newbie
- Posts: 37
- Joined: Thu May 19, 2022 8:44 am
Re: AIMD Calculations terminate after a short time
Hi,
Sorry, I didn't keep track of the time. For ntasks-per-node=40, normally it crashes after around 3 hours. The error file I sent to you in the last message comes from ntasks-per-node=1, it crashes after around 4-6 hours. I didn't stop it. It stopped because of memory problems.
Best,
Xiliang
Sorry, I didn't keep track of the time. For ntasks-per-node=40, normally it crashes after around 3 hours. The error file I sent to you in the last message comes from ntasks-per-node=1, it crashes after around 4-6 hours. I didn't stop it. It stopped because of memory problems.
Best,
Xiliang
-
- Global Moderator
- Posts: 419
- Joined: Mon Sep 13, 2021 11:02 am
Re: AIMD Calculations terminate after a short time
I will run the calculation myself to have a closer look. Meanwhile, could you try one calculation with
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=10
as well as setting the OMP_NUM_THREADS environment variable to 10, and commenting (or deleting) this line:
#SBATCH --hint=nomultithread
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=10
as well as setting the OMP_NUM_THREADS environment variable to 10, and commenting (or deleting) this line:
#SBATCH --hint=nomultithread
-
- Newbie
- Posts: 37
- Joined: Thu May 19, 2022 8:44 am
Re: AIMD Calculations terminate after a short time
Okay, thank you very much. I will post the result as soon as I got it.
-
- Global Moderator
- Posts: 419
- Joined: Mon Sep 13, 2021 11:02 am
Re: AIMD Calculations terminate after a short time
Which version of the NVIDIA compiler was used?
-
- Newbie
- Posts: 37
- Joined: Thu May 19, 2022 8:44 am
Re: AIMD Calculations terminate after a short time
Hi, I use the following dependencies:
Code: Select all
nvidia-compilers/21.9 cuda/11.2 openmpi/4.0.5-cuda intel-mkl/2020.4
-
- Newbie
- Posts: 37
- Joined: Thu May 19, 2022 8:44 am
Re: AIMD Calculations terminate after a short time
Hello,
I have finished the test. Again memory problems led to the termination after around 4 ps. I modified the calculation by changing the pseudopotential of Na to include p electrons (ZVAL=7 instead of ZVAL=1) but it should not matter.
In case you have doubts about the setup, I have attached the results. Thanks again for your kind help.
Best regards,
Xiliang
I have finished the test. Again memory problems led to the termination after around 4 ps. I modified the calculation by changing the pseudopotential of Na to include p electrons (ZVAL=7 instead of ZVAL=1) but it should not matter.
In case you have doubts about the setup, I have attached the results. Thanks again for your kind help.
Best regards,
Xiliang
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 419
- Joined: Mon Sep 13, 2021 11:02 am
Re: AIMD Calculations terminate after a short time
Hi,
My calculation using your input files finished properly (the output files are attached). I used vasp_gam of VASP-6.3.2, two GPUs (16 GB each) and OMP_NUM_THREADS=10. Thus, I could not reproduce your problem.
Could you upload the makefile.include that was used to compile the GPU version of VASP-6.3.2 that you used?
Another thing you could do, if possible, is to run again the calculation (just a few minutes) and execute the command nvidia-smi (just once during the calculation) to see the memory usage of the GPUs (and show us what is displayed).
Actually, was the test suite run in GPU after the installation of VASP?
PS: In your last calculation the ordering of the atoms in the POTCAR is not ok.
My calculation using your input files finished properly (the output files are attached). I used vasp_gam of VASP-6.3.2, two GPUs (16 GB each) and OMP_NUM_THREADS=10. Thus, I could not reproduce your problem.
Could you upload the makefile.include that was used to compile the GPU version of VASP-6.3.2 that you used?
Another thing you could do, if possible, is to run again the calculation (just a few minutes) and execute the command nvidia-smi (just once during the calculation) to see the memory usage of the GPUs (and show us what is displayed).
Actually, was the test suite run in GPU after the installation of VASP?
PS: In your last calculation the ordering of the atoms in the POTCAR is not ok.
You do not have the required permissions to view the files attached to this post.
-
- Newbie
- Posts: 37
- Joined: Thu May 19, 2022 8:44 am
Re: AIMD Calculations terminate after a short time
Hi,
Thanks a lot for your work and also for your comments on my last mistake.
I will ask the technician of the cluster to send me the makefile.include file. I will also check with our technician whether they have run the test suites. This might take some time but I will come back to you ASAP. I will also check the memory use and send it together next time.
Thanks again.
Best,
Xiliang
Thanks a lot for your work and also for your comments on my last mistake.
I will ask the technician of the cluster to send me the makefile.include file. I will also check with our technician whether they have run the test suites. This might take some time but I will come back to you ASAP. I will also check the memory use and send it together next time.
Thanks again.
Best,
Xiliang
-
- Newbie
- Posts: 37
- Joined: Thu May 19, 2022 8:44 am
Re: AIMD Calculations terminate after a short time
Hello,
I got the "makefile.include" file as you will find attached. The output of nvidia-smi is also included in the zip file. I was not able to confirm whether our technician has checked the test suits or not because he is not available. Can you please have a look and let me know whether we have problems compiling the code or any other issues? Thanks in advance.
Best regards,
Xiliang
I got the "makefile.include" file as you will find attached. The output of nvidia-smi is also included in the zip file. I was not able to confirm whether our technician has checked the test suits or not because he is not available. Can you please have a look and let me know whether we have problems compiling the code or any other issues? Thanks in advance.
Best regards,
Xiliang
You do not have the required permissions to view the files attached to this post.