GPU error : cuCtxSynchronize returned error 214
Posted: Tue Jul 26, 2022 5:29 am
As part of a much larger Raman scattering calculation using phon3py, I have a large series of input files for different displacements to allow computation of derivatives of the dielectric tensor. The code runs fine using parallel Intel MPI. When I try to run the same input files on my Nvidia Ampere GPU, the code always crashes in the same place (where the cpu version, although slower, has no problems). I have attached the console output below. I have also attached the input files and OUTCAR file as an attachment. Any suggestions as to what might wrong. I have tried to run the same job (with different NSIM values) to see if it a memory issue, but the code always crashes at the same place.
Code: Select all
(base) paulfons@kaon:/data/Vasp/NICT/MnTe/MnTe_alpha/af/phonons/phon3py/disp-00559>!vi
vi INCAR
(base) paulfons@kaon:/data/Vasp/NICT/MnTe/MnTe_alpha/af/phonons/phon3py/disp-00559>!mpi
mpirun -n 1 /data/Software/Vasp/vasp.6.3.2/bin_gpu/vasp_std
running on 1 total cores
distrk: each k-point on 1 cores, 1 groups
distr: one band on 1 cores, 1 groups
OpenACC runtime initialized ... 1 GPUs detected
vasp.6.3.2 27Jun22 (build Jul 19 2022 16:18:56) complex
POSCAR found type information on POSCAR MnTe
POSCAR found : 2 types and 32 ions
scaLAPACK will be used selectively (only on CPU)
-----------------------------------------------------------------------------
| |
| ----> ADVICE to this user running VASP <---- |
| |
| You have a (more or less) 'large supercell' and for larger cells it |
| might be more efficient to use real-space projection operators. |
| Therefore, try LREAL= Auto in the INCAR file. |
| Mind: For very accurate calculation, you might also keep the |
| reciprocal projection scheme (i.e. LREAL=.FALSE.). |
| |
-----------------------------------------------------------------------------
LDA part: xc-table for Pade appr. of Perdew
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ... GRIDC
FFT: planning ... GRID_SOFT
FFT: planning ... GRID
WAVECAR not read
WARNING: random wavefunctions but no delay for mixing, default for NELMDL
entering main loop
N E dE d eps ncg rms rms(c)
DAV: 1 0.175251143835E+04 0.17525E+04 -0.75528E+04 4960 0.152E+03
DAV: 2 0.113236267777E+03 -0.16393E+04 -0.16148E+04 4960 0.415E+02
DAV: 3 -0.156690869764E+03 -0.26993E+03 -0.26540E+03 7424 0.179E+02
DAV: 4 -0.174048430883E+03 -0.17358E+02 -0.17282E+02 8000 0.542E+01
DAV: 5 -0.174559906444E+03 -0.51148E+00 -0.51073E+00 7876 0.103E+01 0.383E+01
DAV: 6 -0.165910802496E+03 0.86491E+01 -0.12774E+02 8000 0.126E+02 0.425E+01
DAV: 7 -0.169713918389E+03 -0.38031E+01 -0.32403E+01 6080 0.169E+01 0.261E+01
DAV: 8 -0.169481482195E+03 0.23244E+00 -0.16604E+00 6976 0.840E+00 0.149E+01
DAV: 9 -0.169602802943E+03 -0.12132E+00 -0.78061E-01 6656 0.520E+00 0.330E+00
DAV: 10 -0.169565820351E+03 0.36983E-01 -0.22230E-01 7392 0.387E+00 0.233E+00
DAV: 11 -0.169567489048E+03 -0.16687E-02 -0.41704E-02 7392 0.773E-01 0.145E+00
DAV: 12 -0.169565077447E+03 0.24116E-02 -0.84935E-03 8544 0.294E-01 0.626E-01
DAV: 13 -0.169564303591E+03 0.77386E-03 -0.51175E-03 8768 0.337E-01 0.119E-01
DAV: 14 -0.169564422875E+03 -0.11928E-03 -0.44176E-04 8160 0.103E-01 0.122E-01
DAV: 15 -0.169564421433E+03 0.14420E-05 -0.77640E-05 7744 0.375E-02 0.361E-02
DAV: 16 -0.169564418908E+03 0.25256E-05 -0.92158E-06 6816 0.119E-02 0.120E-02
DAV: 17 -0.169564417794E+03 0.11140E-05 -0.26154E-06 8320 0.851E-03 0.570E-03
DAV: 18 -0.169564417940E+03 -0.14590E-06 -0.42598E-07 8416 0.324E-03 0.264E-03
DAV: 19 -0.169564417357E+03 0.58232E-06 -0.16410E-07 7552 0.216E-03 0.119E-03
DAV: 20 -0.169564417463E+03 -0.10538E-06 -0.24085E-08 5376 0.707E-04 0.700E-04
DAV: 21 -0.169564417484E+03 -0.21687E-07 -0.51509E-09 4160 0.337E-04 0.175E-04
DAV: 22 -0.169564417492E+03 -0.77941E-08 -0.48097E-09 4160 0.298E-04
1 F= -.16956442E+03 E0= -.16956463E+03 d E =0.644280E-03 mag= -0.0002
Linear response reoptimize wavefunctions to high precision
DAV: 1 -0.169564417491E+03 0.76398E-09 -0.15442E-09 4160 0.280E-04
DAV: 2 -0.169564417491E+03 -0.58890E-10 -0.59245E-10 4160 0.113E-04
Linear response G [H, r] |phi>, progress :
Direction: 1
N E dE d eps ncg rms
Failing in Thread:1
call to cuCtxSynchronize returned error 214: Other
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[33005,1],0]
Exit code: 1
--------------------------------------------------------------------------