VASP6.3.2 testsuite stuck on bulk_GaAs_ACFDT

Message

kyujungjun · #1 Post by **kyujungjun** » Tue Aug 02, 2022 12:34 am

Hello,

I am trying to compile the 6.3.2 version of VASP on our systems (CPU, intelmpi), and the testsuite runs fine initially. However at the stage of bulk_GaAs_ACFDT, specifically the ACFDT subprocess of this task, the test is stuck forever. I am using intel/2021.3 compilers and intelmpi/2021.3.0 for this.

The line that this test job is stuck at is the following:

Code: Select all


 k-point  25 :   0.3333 0.6667 0.3333  plane waves:     101
 k-point  26 :  -0.6667-0.3333-0.3333  plane waves:     101
 k-point  27 :   0.6667 0.3333 0.3333  plane waves:     101

 maximum and minimum number of plane-waves per node :       113       98

 maximum number of plane-waves:       113
 maximum index in each direction:
   IXMAX=    3   IYMAX=    3   IZMAX=    3
   IXMIN=   -3   IYMIN=   -3   IZMIN=   -3

 exchange correlation table for  LEXCH =        8
   RHO(1)=    0.500       N(1)  =     2000
   RHO(2)=  100.500       N(2)  =     4000

 min. memory requirement per mpi rank     55.4 MB, per node    221.6 MB

 shmem allocating  16 responsefunctions rank=   114
 response function shared by NCSHMEM nodes    2
 all allocation done, memory is now:
 total amount of memory used by VASP MPI-rank0    57610. kBytes
=======================================================================

   base      :      30000. kBytes
   nonl-proj :       1705. kBytes
   fftplans  :        406. kBytes
   grid      :        758. kBytes
   one-center:         16. kBytes
   HF        :        135. kBytes
   nonlr-proj:       1125. kBytes
   wavefun   :      20971. kBytes
   response  :       2494. kBytes



--------------------------------------------------------------------------------------------------------


NQ=   1    0.0000    0.0000    0.0000,

It should proceed after NQ=1 to the next steps and in other systems, it takes less than a few seconds to do so. Just to compare with another HPC systems that I compiled (CPU, intelmpi), the testsuite runs fine and the ACFDT finishes within 10s of seconds. For all processes before this bulk_GaAs_ACFDT, both systems finish the test sets in very similar time frame.

For this problematic system, if I kill this specific test job and make it proceed to the next one, it often gets stuck on the GW calculations. Here are the jobs that failed in this test:

Code: Select all

==================================================================
SUMMARY:
==================================================================
The following tests failed, please check the output file manually:
bulk_GaAs_ACFDT bulk_GaAs_ACFDT_RPR bulk_GaAs_G0W0_sym bulk_GaAs_G0W0_sym_RPR bulk_GaAs_scGW0_ALGO=D_sym bulk_GaAs_scGW0_ALGO=D_sym_RPR bulk_GaAs_scGW0_sym bulk_GaAs_scGW0_sym_RPR bulk_GaAs_scGW_ALGO=D_sym bulk_GaAs_scGW_ALGO=D_sym_RPR bulk_GaAs_scGW_sym bulk_GaAs_scGW_sym_RPR bulk_InP_SOC_G0W0_sym bulk_InP_SOC_G0W0_sym_RPR bulk_SiO2_elastic_properties_ibrion6_RPR bulk_SiO2_elastic_properties_ibrion8 HEG_333_LW SiC8_GW0R SiC_ACFDTR_T

Can anyone provide some insight on how I can solve this issue? These GW and RPR calculations are not what I need urgently, but still I would like to figure out why this is not working

Finally, I am attaching my makefile.include

Code: Select all

# Precompiler options
CPP_OPTIONS= -DHOST=\"LinuxIFC\"\
             -DMPI -DMPI_BLOCK=32000 \
             -Duse_collective \
             -DscaLAPACK \
             -DCACHE_SIZE=16000 \
             -Davoidalloc \
             -Duse_bse_te \
             -Dtbdyn \
             -Duse_shmem

CPP        = fpp -f_com=no -free -w0  $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)

FC         = mpiifort
FCL        = mpiifort -mkl=cluster -lstdc++

FREE       = -free -names lowercase

FFLAGS     = -assume byterecl -w -heap-arrays 64
OFLAG      = -O2 -XCORE-AVX2
OFLAG_IN   = $(OFLAG)
DEBUG      = -O0

MKL_PATH   = $(MKLROOT)/lib/intel64
BLAS       =
LAPACK     =
BLACS      = 
SCALAPACK  = 

OBJECTS    = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d.o

INCS       =-I$(MKLROOT)/include/fftw

LLIBS      = $(SCALAPACK) $(LAPACK) $(BLAS)


OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o

# For what used to be vasp.5.lib
CPP_LIB    = $(CPP)
FC_LIB     = $(FC)
CC_LIB     = icc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB   = $(FREE)

OBJECTS_LIB= linpack_double.o getshmem.o

# For the parser library
CXX_PARS   = icpc

LIBS       += parser
LLIBS      += -Lparser -lparser -lstdc++

# Normally no need to change this
SRCDIR     = ../../src
BINDIR     = ../../bin

#================================================
# GPU Stuff

CPP_GPU    = -DCUDA_GPU -DRPROMU_CPROJ_OVERLAP -DUSE_PINNED_MEMORY -DCUFFT_MIN=28 -UscaLAPACK

OBJECTS_GPU = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d_gpu.o fftmpiw_gpu.o

CC         = icc
CXX        = icpc
CFLAGS     = -fPIC -DADD_ -Wall -openmp -DMAGMA_WITH_MKL -DMAGMA_SETAFFINITY -DGPUSHMEM=300 -DHAVE_CUBLAS

CUDA_ROOT  ?= /usr/local/cuda/
NVCC       := $(CUDA_ROOT)/bin/nvcc -ccbin=icc
CUDA_LIB   := -L$(CUDA_ROOT)/lib64 -lnvToolsExt -lcudart -lcuda -lcufft -lcublas

GENCODE_ARCH    := -gencode=arch=compute_30,code=\"sm_30,compute_30\" \
                   -gencode=arch=compute_35,code=\"sm_35,compute_35\" \
                   -gencode=arch=compute_60,code=\"sm_60,compute_60\"

MPI_INC    = $(I_MPI_ROOT)/include64/

#2 Post by **alexey.tal** » Wed Aug 03, 2022 9:17 am

Hi,

Are you running these tests on AMD CPUs? Have you tried setting the communication fabrics? This issue was discussed here and here.

kyujungjun · #3 Post by **kyujungjun** » Thu Aug 04, 2022 11:57 pm

Thank you! It was solved by adding this line, and all of the tests runs fine on this system!

However, there is another issue. In a different cluster with AMD EPYC 7742 nodes, I also get another error. In this system, if I run the test, I do not get stuck on bulk_GaAs_ACFDT. Actually, most of non-GW calculations run fine without error. However, instead, I get the error message that the correlation energies are wrong in bulk_GaAs_ACFDT. This happens both when I run the test with -genv I_MPI_FABRICS=shm or without.

Code: Select all

exiting run_recipe bulk_GaAs_ACFDT
ERROR: test yields different ACFDT correlation energies correct, please check
---------------------------------------------------------------------
   100.000     80.000  -1.653095 -87.572135
    95.238     76.190  -1.653048 -87.572043
    90.703     72.562  -1.653000 -87.571974
    86.384     69.107  -1.652950 -87.571917
    82.270     65.816  -1.652893 -87.571852
    78.353     62.682  -1.652830 -87.571782
    74.622     59.697  -1.652766 -87.571714
    71.068     56.855  -1.652708 -87.571653
 ---------------------------------------------------------------------------
 Comparing files: acfdt and acfdt.ref
                            16  number(s) differ.
       Max diff.:    15.5318560000000
  (at row number:            1  column number:            4 )
       Tolerance:   1.000000000000000E-003
 ---------------------------------------------------------------------------

In addition to bulk_GaAs_ACFDT, these are the list of calculation that results in errors.

Code: Select all


==================================================================
SUMMARY:
==================================================================
The following tests failed, please check the output file manually:
bulk_GaAs_ACFDT bulk_GaAs_ACFDT_RPR bulk_GaAs_G0W0_sym bulk_GaAs_G0W0_sym_RPR bulk_InP_SOC_DFT_ISYM=2 bulk_InP_SOC_DFT_ISYM=2_RPR bulk_InP_SOC_DFT_ISYM=3 bulk_InP_SOC_DFT_ISYM=3_RPR bulk_InP_SOC_G0W0_sym bulk_InP_SOC_G0W0_sym_RPR bulk_InP_SOC_PBE0_nosym bulk_InP_SOC_PBE0_sym bulk_InP_SOC_PBE0_sym_RPR bulk_SiO2_elastic_properties_ibrion6 bulk_SiO2_elastic_properties_ibrion6_RPR bulk_SiO2_elastic_properties_ibrion8 bulk_SiO2_elastic_properties_ibrion8_RPR bulk_SiO2_HSE bulk_SiO2_HSE_RPR bulk_SiO2_LOPTICS bulk_SiO2_LOPTICS_RPR bulk_SiO2_LPEAD bulk_SiO2_LPEAD_RPR bulk_SiO2_PBE0 bulk_SiO2_PBE0_RPR CrS CrS_RPR HEG_333_LW NiOLDAU=1 NiOLDAU=1_RPR NiOLDAU=2 NiOLDAU=2_RPR NiOsLDAU=2_x NiOsLDAU=2_x_RPR NiOsLDAU=2_y NiOsLDAU=2_y_RPR NiOsLDAU=2_z NiOsLDAU=2_z_RPR SiC8_GW0R SiC_ACFDTR_T SiC_phon SiC_phon_RPR Tl_x Tl_x_RPR Tl_y Tl_y_RPR Tl_z Tl_z_RPR

What I notice is that on Tl_z_RPR calculation, I see a bunch of lines like this (not on bulk_GaAs_ACFDT). So I am assuming that this is related to why I get errors on these tests, but I am quite lost. Could you help me on identifying how to fix this issue?

Thank you!

Code: Select all

WARNING: Sub-Space-Matrix is not hermitian in DAV            8
   4.44778457202058
 WARNING: Sub-Space-Matrix is not hermitian in DAV            5
   2.91536111870757
 WARNING: Sub-Space-Matrix is not hermitian in DAV            6
  0.721056515923352
 WARNING: Sub-Space-Matrix is not hermitian in DAV            7
  -2.21525533240472
 WARNING: Sub-Space-Matrix is not hermitian in DAV            8
   4.44778457202058
 -----------------------------------------------------------------------------
|                                                                             |
|     EEEEEEE  RRRRRR   RRRRRR   OOOOOOO  RRRRRR      ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     EEEEE    RRRRRR   RRRRRR   O     O  RRRRRR       #       #       #      |
|     E        R   R    R   R    O     O  R   R                               |
|     E        R    R   R    R   O     O  R    R      ###     ###     ###     |
|     EEEEEEE  R     R  R     R  OOOOOOO  R     R     ###     ###     ###     |
|                                                                             |
|     Error EDDDAV: Call to ZHEGV failed. Returncode = 8 2 8                  |
|                                                                             |
|       ---->  I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <----       |
|                                                                             |
 -----------------------------------------------------------------------------

#4 Post by **alexey.tal** » Fri Aug 05, 2022 1:27 pm

Did you use the same makefile.include on both machines?

Looks like it could be a shared memory issue. Can you compile VASP without the -Duse_shmem flag and run the tests again?

kyujungjun · #5 Post by **kyujungjun** » Fri Aug 05, 2022 6:53 pm

Thanks for the quick reply! Yes, I did try both cases with and without Duse_shmem. Both cases, I get exactly the same set of failed test suites.

FYI, here is my makefile that I used that does not have Duse_shmem.

Code: Select all

# Default precompiler options
CPP_OPTIONS = -DHOST=\"LinuxIFC\" \
              -DMPI -DMPI_BLOCK=8000 -Duse_collective \
              -DscaLAPACK \
              -DCACHE_SIZE=4000 \
              -Davoidalloc \
              -Dvasp6 \
              -Duse_bse_te \
              -Dtbdyn \
              -Dfock_dblbuf

CPP         = fpp -f_com=no -free -w0  $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)

FC          = mpiifort
FCL         = mpiifort

FREE        = -free -names lowercase

FFLAGS      = -assume byterecl -w

OFLAG       = -O2 -xCORE-AVX2
OFLAG_IN    = $(OFLAG)
DEBUG       = -O0

OBJECTS     = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o

# For what used to be vasp.5.lib
CPP_LIB     = $(CPP)
FC_LIB      = $(FC)
CC_LIB      = icc
CFLAGS_LIB  = -O
FFLAGS_LIB  = -O1
FREE_LIB    = $(FREE)

OBJECTS_LIB = linpack_double.o

# For the parser library
CXX_PARS    = icpc
LLIBS       = -lstdc++

##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##

# When compiling on the target machine itself, change this to the
# relevant target when cross-compiling for another architecture
VASP_TARGET_CPU ?=
FFLAGS     += $(VASP_TARGET_CPU)

# Intel MKL (FFTW, BLAS, LAPACK, and scaLAPACK)
# (Note: for Intel Parallel Studio's MKL use -mkl instead of -qmkl)
FCL        += -mkl=sequential
MKLROOT    ?= /path/to/your/mkl/installation

#6 Post by **alexey.tal** » Mon Aug 08, 2022 9:44 am

Thank you for sending your makefile.inclue. Just to be sure, you use this makefile for both the first machine (where tests were getting stuck) and the second one (where tests fail), is that right? Do both of these machines have the same architechture?

#7 Post by **andreas.singraber** » Tue Aug 09, 2022 10:14 am

Hello!

I tried to reproduce the failure of tests you described with a very similar setup: Intel 21.3 compiler with an AMD 7402P processor (same architecture but less cores). I also used the makefile.include you provided in your last post. With this setup and setting

Code: Select all

export I_MPI_FABRICS=shm

all the tests you mentioned (bulk_GaAs_ACFDT,...) passed successfully. Since we have now basically two contradicting results, could you please check once again that you used the same makefile.include. Furthermore, please make sure that you ran

Code: Select all

make veryclean

before building the code with a changed makefile and execute

Code: Select all

make cleantest

in the testsuite subdirectory before running the tests. If then the tests still fail, can you please upload the OUTCAR files and the testsuite.log of one of the failing tests? You will find the OUTCAR files in the testsuite/tests/ directory (e.g. in testsuite/tests/bulk_GaAs_ACFDT). Please mind that there could be multiple files because some tests involve multiple steps. Thank you!

All the best,

Andreas Singraber

kyujungjun · #8 Post by **kyujungjun** » Tue Aug 09, 2022 7:13 pm

I really appreciate your reply! Actually, I was able to find the solution.
Originally, I was using the following modules to complile VASP:
intel/19.1.1.217, intel-mpi/2019.8.254, intel-mkl/2019.8.254

If I replace intel-mkl to the older version intel-mkl/2018.1.163 and keep the remaining modules as 2019 versions, then every single test passes fine. I thought that it is best to stick to same year intel compilers (2019 in my case), but contrary to my expectation, 2019-mkl might have been the problem, and 2018-mkl solves it.

I am not sure if it is this specific version of mkl intel-2019.8.254 that was the problem or just the module in this cluster is wrongly generated. Anyways, I really thank you for your help in trying to solve this issue!

Best regards,
KyuJung

My Community

VASP6.3.2 testsuite stuck on bulk_GaAs_ACFDT

VASP6.3.2 testsuite stuck on bulk_GaAs_ACFDT

Re: VASP6.3.2 testsuite stuck on bulk_GaAs_ACFDT

Re: VASP6.3.2 testsuite stuck on bulk_GaAs_ACFDT

Re: VASP6.3.2 testsuite stuck on bulk_GaAs_ACFDT

Re: VASP6.3.2 testsuite stuck on bulk_GaAs_ACFDT

Re: VASP6.3.2 testsuite stuck on bulk_GaAs_ACFDT

Re: VASP6.3.2 testsuite stuck on bulk_GaAs_ACFDT

Re: VASP6.3.2 testsuite stuck on bulk_GaAs_ACFDT