SiC8_GW0R test hangup VASP 6.3.2
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 7
- Joined: Mon Jan 20, 2020 9:11 pm
SiC8_GW0R test hangup VASP 6.3.2
I have been having trouble with this test in particular(SiC8_GW0R) when run on our zen 2 architecture nodes. It always gets stuck at the end and never finishes. This is the line it always gets stuck at in the OUTCAR:
" GAMMA: cpu time 0.0315: real time 0.0316"
I have attached the relevant inputs and outputs in addition to my makefile in a zip file. I have not included the WAVECAR or WAVEDER as they would make the zipfile too large. Let me know if you would like them as well.
It has been compiled with intel oneapi 2022.0.1 compilers and mkl and intel oneapi mpi 2021.5.0 and hdf5-1.12.1 and -march=core-avx2 while on the AMD EPYC node.
Additionally, I have tested this exact compilation on a cascade lake architecture node and this issue DOESN'T occur. It seems to be specific to the zen 2 architecture.
I have also tested it on a slightly different setup(different number of cores on the node) but still in the AMD EPYC ROME processor family and I run into the same issue.
All other fast tests finish successfully.
I have also tested it when compiling without OPENMP and with VASP 6.2.0(albeit with slightly older intel compilers, mpi, and mkl and without hdf5) and I still see the exact same issue with the zen 2 architecture.
Please let me know if you require any more information.
Thank you for your help.
Matt Matzelle
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: SiC8_GW0R test hangup VASP 6.3.2
There was a similar issue reported for AMD (thread).
Such hangups on AMD architecture can be due to the fabrics control. Try setting the variable I_MPI_FABRICS=shm or run the tests with
Code: Select all
mpirun -np 4 -genv I_MPI_FABRICS=shm vasp_std
-
- Newbie
- Posts: 7
- Joined: Mon Jan 20, 2020 9:11 pm
Re: SiC8_GW0R test hangup VASP 6.3.2
That alleviated the issue.
However, I am wondering if this will lead to a slowdown on internode calculations. Do you think this setting will have any impact vs using I_MPI_FABRICS=shm:ofi on internode calculations?
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: SiC8_GW0R test hangup VASP 6.3.2
-
- Newbie
- Posts: 7
- Joined: Mon Jan 20, 2020 9:11 pm
Re: SiC8_GW0R test hangup VASP 6.3.2
I believe the default on the AMD EPYC ROME nodes is shm:ofi as you can see from output file and setting I_MPI_DEBUG=4 and not setting I_MPI_FABRICS at all:
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5 Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): Load tuning file: "/shared/centos7/intel/oneapi/2022.1.0/mpi/2021.5.0/etc/tuning_generic_shm-ofi_mlx_hcoll.dat"
Notice how it uses "tuning_generic_shm-ofi_mlx_hcoll.dat"
However after explicitly setting I_MPI_FABRICS=shm it says the following:
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5 Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): File "" not found
[0] MPI startup(): Load tuning file: "/shared/centos7/intel/oneapi/2022.1.0/mpi/2021.5.0/etc/tuning_generic_shm.dat"
The job without setting I_MPI_FABRICS fails while the job setting I_MPI_FABRICS=shm works fine. So I am thinking the problem lies in this file:
"tuning_generic_shm-ofi_mlx_hcoll.dat"
Additionally when running on the cascade lake nodes and not setting I_MPI_FABRICS the output file says the following:
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5 Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): File "/shared/centos7/intel/oneapi/2022.1.0/mpi/2021.5.0/etc/tuning_skx_shm-ofi_mlx_100.dat" not found
[0] MPI startup(): Load tuning file: "/shared/centos7/intel/oneapi/2022.1.0/mpi/2021.5.0/etc/tuning_skx_shm-ofi.dat"
and this job completes successfully.
So in summary the GW issue is fixed by using I_MPI_FABRICS=shm. However, this fix won't transfer to multinode calculations because this issue is actually inherent in the shm:ofi fabric choice. Furthermore, the problem somehow likely lies in the "tuning_generic_shm-ofi_mlx_hcoll.dat" file because the "tuning_skx_shm-ofi.dat" file has no such problems.
I am wondering if it is possible to use the "tuning_skx_shm-ofi.dat" with the AMD nodes via setting some variables or if there is a more general fix that an be applied.
Thank you,
Matt
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: SiC8_GW0R test hangup VASP 6.3.2
Did you run a calculation with I_MPI_FABRICS=shm:ofi on a single or multiple AMD nodes? Does it hang up?
So far we have not seen that shm:ofi does not work on multiple nodes.
-
- Newbie
- Posts: 7
- Joined: Mon Jan 20, 2020 9:11 pm
Re: SiC8_GW0R test hangup VASP 6.3.2
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5 Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): Load tuning file: "/shared/centos7/intel/oneapi/2022.1.0/mpi/2021.5.0/etc/tuning_generic_shm-ofi_mlx_hcoll.dat"
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 48530 d3041 {0}
[0] MPI startup(): 1 48531 d3041 {8}
[0] MPI startup(): 2 48532 d3041 {16}
[0] MPI startup(): 3 37818 d3042 {0}
[0] MPI startup(): threading: mode: direct
[0] MPI startup(): threading: vcis: 1
[0] MPI startup(): threading: app_threads: 1
[0] MPI startup(): threading: runtime: generic
[0] MPI startup(): threading: is_threaded: 0
[0] MPI startup(): threading: async_progress: 0
[0] MPI startup(): threading: num_pools: 64
[0] MPI startup(): threading: lock_level: global
[0] MPI startup(): threading: enable_sep: 0
[0] MPI startup(): threading: direct_recv: 1
[0] MPI startup(): threading: zero_op_flags: 1
[0] MPI startup(): threading: num_am_buffers: 1
[0] MPI startup(): threading: library is built with per-vci thread granularity
Here was my job script:
#!/bin/bash
#SBATCH --job-name=TiSrO3
#SBATCH --output=output.out
#SBATCH --error=error.error
#SBATCH --time=6:00:00
#SBATCH --mem=0
#SBATCH -n 4
#SBATCH -p bansil
#SBATCH -N 2
#SBATCH --constraint=ib
module load intel/compilers-2022.0.1
module load intel/mpi-2021.5.0
module load hdf5/1.12.1-intel2022
source /shared/centos7/intel/oneapi/2022.1.0/setvars.sh
BIN=/work/bansil/programs/VASP632intel2022omp/vasp.6.3.2/bin/vasp_std
export OMP_NUM_THREADS=1
export OMP_PLACES=cores
export OMP_PROC_BIND=close
export OMP_STACKSIZE=512m
export I_MPI_PIN_DOMAIN=omp
export I_MPI_PIN=yes
#export I_MPI_FABRICS=shm:ofi
export I_MPI_DEBUG=4
mpirun -np 4 $BIN
Interestingly when I specify "export I_MPI_FABRICS=shm:ofi" specifically I get the following error:
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5 Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): Load tuning file: "/shared/centos7/intel/oneapi/2022.1.0/mpi/2021.5.0/etc/tuning_generic_shm-ofi_mlx_hcoll.dat"
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 47352 RUNNING AT d3041
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 47354 RUNNING AT d3041
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
The job script for this run was:
#!/bin/bash
#SBATCH --job-name=TiSrO3
#SBATCH --output=output.out
#SBATCH --error=error.error
#SBATCH --time=6:00:00
#SBATCH --mem=0
#SBATCH -n 4
#SBATCH -p bansil
#SBATCH -N 2
#SBATCH --constraint=ib
module load intel/compilers-2022.0.1
module load intel/mpi-2021.5.0
module load hdf5/1.12.1-intel2022
source /shared/centos7/intel/oneapi/2022.1.0/setvars.sh
BIN=/work/bansil/programs/VASP632intel2022omp/vasp.6.3.2/bin/vasp_std
export OMP_NUM_THREADS=1
export OMP_PLACES=cores
export OMP_PROC_BIND=close
export OMP_STACKSIZE=512m
export I_MPI_PIN_DOMAIN=omp
export I_MPI_PIN=yes
export I_MPI_FABRICS=shm:ofi
export I_MPI_DEBUG=4
mpirun -np 4 $BIN
This also occurs for calculations with a single node. Here is the error:
[0] MPI startup(): Intel(R) MPI Library, Version 2021.5 Build 20211102 (id: 9279b7d62)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.2rc1-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): Load tuning file: "/shared/centos7/intel/oneapi/2022.1.0/mpi/2021.5.0/etc/tuning_generic_shm-ofi_mlx_hcoll.dat"
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 63791 RUNNING AT d3057
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 63792 RUNNING AT d3057
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 63793 RUNNING AT d3057
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
and here is the job script:
#!/bin/bash
#SBATCH --job-name=TiSrO3
#SBATCH --output=output.out
#SBATCH --error=error.error
#SBATCH --time=6:00:00
#SBATCH --mem=0
#SBATCH -n 4
#SBATCH -p bansil
#SBATCH -N 1
#SBATCH --constraint=ib
module load intel/compilers-2022.0.1
module load intel/mpi-2021.5.0
module load hdf5/1.12.1-intel2022
source /shared/centos7/intel/oneapi/2022.1.0/setvars.sh
BIN=/work/bansil/programs/VASP632intel2022omp/vasp.6.3.2/bin/vasp_std
export OMP_NUM_THREADS=1
export OMP_PLACES=cores
export OMP_PROC_BIND=close
export OMP_STACKSIZE=512m
export I_MPI_PIN_DOMAIN=omp
export I_MPI_PIN=yes
export I_MPI_FABRICS=shm:ofi
export I_MPI_DEBUG=4
mpirun -np 4 $BIN
This is very confusing. I hope this can help pinpoint the problem. Thank you for your continued help with this issue.
Matt
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: SiC8_GW0R test hangup VASP 6.3.2
Regarding your previous post, the choice of the tuning file does not necessarily reflect the default communication fabric, but it would be helpful to know what it is. Can you set the flag I_MPI_DEBUG=16 and run the calculations: 1) without defining the fabric; 2) with I_MPI_FABRICS=shm 3) ofi and 4) shm:ofi. The comparison of the outputs should allow one to understand what the default fabric is.
Also, it would be helpful if you could provide the full output and OUTCARs, so that we could see where exactly the calculations stop.
-
- Newbie
- Posts: 7
- Joined: Mon Jan 20, 2020 9:11 pm
Re: SiC8_GW0R test hangup VASP 6.3.2
Sorry for my late reply.
I have done the tests as asked for and included a zip file with them all. In addition to the error output, the output, the batch script and the OUTCAR, I've also include the dmesg output for both the successful run and the run that was killed.
Thank you for your help,
Matthew Matzelle
-
- Newbie
- Posts: 7
- Joined: Mon Jan 20, 2020 9:11 pm
Re: SiC8_GW0R test hangup VASP 6.3.2
I'm just wondering if there has been any progress on this issue?
Thanks for your hard work,
Matt
-
- Global Moderator
- Posts: 314
- Joined: Mon Sep 13, 2021 12:45 pm
Re: SiC8_GW0R test hangup VASP 6.3.2
Thank you for doing all these tests.
In the shmofi directory, there is an OUTCAR file and it looks like the calculation went through quite far and then stopped, although the output.out looks like the process was killed at the MPI initialization, i.e., no stdout from VASP. Are you sure that it is the correct OUTCAR? Also, the provided tests are only done on a single node, but I assume that shm:ofi didn't work on multiple nodes either.
But otherwise the situation is clear. The ofi fabric causes hang ups when it is used for the intranode communication, which probably has to do with the multiple endpoint communication. So far we have seen that using shm for intranode and ofi for internode communication usually works. However, in your case the shm:ofi option fails too. From the provided log files, it looks like it doesn't hang but hits a segmentation fault somewhere at the MPI initialization. So the issue is in Intel MPI, not VASP. I wasn't able to reproduce your issue with shm:ofi on any of our AMD machines, but we will post an update if we come up with a solution. If you manage to make it work please let us know.
-
- Newbie
- Posts: 7
- Joined: Mon Jan 20, 2020 9:11 pm
Re: SiC8_GW0R test hangup VASP 6.3.2
Hi All,
I know it has been some time but I figured I would update you in case anyone runs into this issue and give some closure to this thread. I currently needed to do some optical studies so I had to revisit this issue myself.
The problem seems to have fixed itself going from oneapi version 2022.1.0 to oneapi version 2022.2.0 and specifically is probably associated with the intel mpi version 2021.5.0 going to 2021.6.0.
While I imagine most users will have access to much more recent intel packages and not run into any problems, I think some may be stuck using resources at HPC centers like mine where oneapi version 2022.2.0 is the latest we have access to.
Thanks for your help in trying to replicate and address this issue before, even if it was in vain. It is still appreciated.
Matt