Segmentation fault after 2 iter VASP 5.4.4 on GPU
Posted: Wed Jan 29, 2020 1:08 pm
Hi,
We got segmentation fault every time after 2 iterations of MD simulation when trying to run large supercells with VASP 5.4.4 with CUDA libs. On the other hand, the energy optimization of this system (IBRION = 2, ISIF = 3) works well!
We have a box with 243 atoms of alpha-quartz.
Our supercomputer consists of Intel Haswell-EP E5-2697v3 nodes with NVidia Tesla K40M GPU cards.
We use this modules, when runing and compiling VASP: intel/15.0.3, openmpi/2.1.1-icc, cuda/6.5, mkl/11.1.3.
Input test files INCAR, KPOINTS and POSCAR are below:
There is an example of error, when running task on 1 GPU card:
The same task was successful when we run MD simulation on CPU version of VASP 5.4.1 on other supercomputer with Intel Xeon X5570 and modules intel/15.0.090, impi/5.0.1, mkl/11.2.0.
The output file of successful MD simulation is below:
Can anyone help me out?
Thanks!
We got segmentation fault every time after 2 iterations of MD simulation when trying to run large supercells with VASP 5.4.4 with CUDA libs. On the other hand, the energy optimization of this system (IBRION = 2, ISIF = 3) works well!
We have a box with 243 atoms of alpha-quartz.
Our supercomputer consists of Intel Haswell-EP E5-2697v3 nodes with NVidia Tesla K40M GPU cards.
We use this modules, when runing and compiling VASP: intel/15.0.3, openmpi/2.1.1-icc, cuda/6.5, mkl/11.1.3.
Input test files INCAR, KPOINTS and POSCAR are below:
Code: Select all
SYSTEM = SiO2: alfa-quartz, Si 81, O 162, 243 atoms, SuperCell 3x3x3
LWAVE = .FALSE.
LCHARG = .FALSE.
LREAL=A
ISYM = 0
ISMEAR = 0
SIGMA = 0.1
#ENCUT = 600.0
IBRION = 0
MDALGO = 3
ISIF = 3
#SMASS = -1
LANGEVIN_GAMMA = 30.0 30.0
LANGEVIN_GAMMA_L = 30.0
PMASS = 3840
ALGO = VeryFast
PREC = Normal
TEBEG = 300
NSW = 10
POTIM = 1.0
Code: Select all
K-Points
0
Gamma
1 1 1
0 0 0
Code: Select all
SiO2_quartz
1.0
14.7390003204 0.0000000000 0.0000000000
-7.3695001602 12.7643487039 0.0000000000
0.0000000000 0.0000000000 16.2155990601
O Si
162 81
Cartesian
.....
Code: Select all
Using device 0 (rank 0, local rank 0, local size 1) : Tesla K40st
running on 1 total cores
distrk: each k-point on 1 cores, 1 groups
distr: one band on 1 cores, 1 groups
using from now: INCAR
*******************************************************************************
You are running the GPU port of VASP! When publishing results obtained with
this version, please cite:
- M. Hacene et al., http://dx.doi.org/10.1002/jcc.23096
- M. Hutchinson and M. Widom, http://dx.doi.org/10.1016/j.cpc.2012.02.017
in addition to the usual required citations (see manual).
GPU developers: A. Anciaux-Sedrakian, C. Angerer, and M. Hutchinson.
*******************************************************************************
vasp.5.4.4.18Apr17-6-g9f103f2a35 (build Jul 03 2017 17:00:58) complex
POSCAR found type information on POSCAR O Si
POSCAR found : 2 types and 243 ions
LDA part: xc-table for Pade appr. of Perdew
WARNING: The GPU port of VASP has been extensively
tested for: ALGO=Normal, Fast, and VeryFast.
Other algorithms may produce incorrect results or
yield suboptimal performance. Handle with care!
POSCAR, INCAR and KPOINTS ok, starting setup
creating 32 CUDA streams...
creating 32 CUFFT plans with grid size 72 x 70 x 70...
FFT: planning ...
WAVECAR not read
prediction of wavefunctions initialized - no I/O
######################################################################
entering main loop
N E dE d eps ncg rms rms(c)
RMM: 1 0.220862746483E+05 0.22086E+05 -0.36918E+05 777 0.988E+02
RMM: 2 0.150356269278E+05 -0.70506E+04 -0.66219E+04 777 0.461E+02
RMM: 3 0.947948630441E+04 -0.55561E+04 -0.40170E+04 777 0.352E+02
RMM: 4 0.633797469509E+04 -0.31415E+04 -0.24216E+04 777 0.279E+02
RMM: 5 0.441019924087E+04 -0.19278E+04 -0.15981E+04 777 0.236E+02
RMM: 6 0.315396338638E+04 -0.12562E+04 -0.11116E+04 777 0.208E+02
RMM: 7 0.223813602408E+04 -0.91583E+03 -0.85601E+03 777 0.189E+02
RMM: 8 0.151620648562E+04 -0.72193E+03 -0.69290E+03 777 0.174E+02
RMM: 9 -0.778374908781E+03 -0.22946E+04 -0.13638E+04 2292 0.163E+02
RMM: 10 -0.165099748848E+04 -0.87262E+03 -0.52765E+03 2276 0.487E+01
RMM: 11 -0.188654282149E+04 -0.23555E+03 -0.25003E+03 1716 0.600E+01
RMM: 12 -0.202262859422E+04 -0.13609E+03 -0.13550E+03 1996 0.186E+01 0.112E+02
RMM: 13 -0.181428587026E+04 0.20834E+03 -0.77143E+02 2134 0.404E+01 0.831E+01
RMM: 14 -0.183042132334E+04 -0.16135E+02 -0.33408E+02 2058 0.191E+01 0.646E+01
RMM: 15 -0.185533603019E+04 -0.24915E+02 -0.20262E+02 1945 0.164E+01 0.301E+01
RMM: 16 -0.184354463284E+04 0.11791E+02 -0.47213E+01 2040 0.104E+01 0.842E+00
RMM: 17 -0.184462327420E+04 -0.10786E+01 -0.26974E+01 1965 0.707E+00 0.931E+00
RMM: 18 -0.184396313235E+04 0.66014E+00 -0.40891E+00 1919 0.427E+00 0.115E+00
RMM: 19 -0.184419551414E+04 -0.23238E+00 -0.20106E+00 1916 0.188E+00 0.434E+00
RMM: 20 -0.184403837638E+04 0.15714E+00 -0.38680E-01 1885 0.132E+00 0.198E+00
RMM: 21 -0.184403372340E+04 0.46530E-02 -0.35328E-01 1888 0.720E-01 0.316E-01
RMM: 22 -0.184403669105E+04 -0.29676E-02 -0.39324E-02 1751 0.490E-01 0.336E-01
RMM: 23 -0.184403607685E+04 0.61420E-03 -0.53081E-03 1690 0.115E-01 0.177E-01
RMM: 24 -0.184403630741E+04 -0.23055E-03 -0.95524E-04 1328 0.788E-02 0.206E-01
RMM: 25 -0.184403598416E+04 0.32324E-03 -0.12132E-03 1610 0.513E-02 0.367E-02
RMM: 26 -0.184403601390E+04 -0.29732E-04 -0.45587E-04 1195 0.595E-02
1 T= 280. E= -.18351244E+04 F= -.18440360E+04 E0= -.18440360E+04 EK= 0.89116E+01 SP= 0.00E+00 SK= 0.00E+00
######################################################################
bond charge predicted
N E dE d eps ncg rms rms(c)
RMM: 1 -0.184272191805E+04 0.13141E+01 -0.43494E+01 2118 0.103E+01 0.114E+00
RMM: 2 -0.184344173004E+04 -0.71981E+00 -0.80263E+00 2056 0.423E+00 0.790E-01
RMM: 3 -0.184344384777E+04 -0.21177E-02 -0.10826E-01 2020 0.283E-01 0.520E-01
RMM: 4 -0.184344172215E+04 0.21256E-02 -0.25429E-02 1828 0.310E-01 0.164E-01
RMM: 5 -0.184344260129E+04 -0.87914E-03 -0.11951E-02 1682 0.163E-01 0.112E-01
RMM: 6 -0.184344251539E+04 0.85906E-04 -0.22211E-03 1679 0.924E-02 0.504E-02
RMM: 7 -0.184344254169E+04 -0.26308E-04 -0.73407E-04 1293 0.387E-02
2 T= 263. E= -.18350927E+04 F= -.18434425E+04 E0= -.18434425E+04 EK= 0.83498E+01 SP= 0.00E+00 SK= 0.00E+00
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 30925 on node n48618 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
The output file of successful MD simulation is below:
Code: Select all
running on 122 total cores
distrk: each k-point on 122 cores, 1 groups
distr: one band on 1 cores, 122 groups
using from now: INCAR
vasp.5.4.1 05Feb16 (build Feb 22 2016 23:54:54) complex
POSCAR found type information on POSCAR O Si
POSCAR found : 2 types and 243 ions
-----------------------------------------------------------------------------
| |
| W W AA RRRRR N N II N N GGGG !!! |
| W W A A R R NN N II NN N G G !!! |
| W W A A R R N N N II N N N G !!! |
| W WW W AAAAAA RRRRR N N N II N N N G GGG ! |
| WW WW A A R R N NN II N NN G G |
| W W A A R R N N II N N GGGG !!! |
| |
| For optimal performance we recommend to set |
| NCORE= 4 - approx SQRT( number of cores) |
| NCORE specifies how many cores store one orbital (NPAR=cpu/NCORE). |
| This setting can greatly improve the performance of VASP for DFT. |
| The default, NPAR=number of cores might be grossly inefficient |
| on modern multi-core architectures or massively parallel machines. |
| Do your own testing !!!! |
| Unfortunately you need to use the default for GW and RPA calculations. |
| (for HF NCORE is supported but not extensively tested yet) |
| |
-----------------------------------------------------------------------------
LDA part: xc-table for Pade appr. of Perdew
-----------------------------------------------------------------------------
| |
| W W AA RRRRR N N II N N GGGG !!! |
| W W A A R R NN N II NN N G G !!! |
| W W A A R R N N N II N N N G !!! |
| W WW W AAAAAA RRRRR N N N II N N N G GGG ! |
| WW WW A A R R N NN II N NN G G |
| W W A A R R N N II N N GGGG !!! |
| |
| VASP found 726 degrees of freedom |
| the temperature will equal 2*E(kin)/ (degrees of freedom) |
| this differs from previous releases, where T was 2*E(kin)/(3 NIONS). |
| The new definition is more consistent |
| |
-----------------------------------------------------------------------------
POSCAR, INCAR and KPOINTS ok, starting setup
WARNING: small aliasing (wrap around) errors must be expected
FFT: planning ...
WAVECAR not read
prediction of wavefunctions initialized - no I/O
entering main loop
N E dE d eps ncg rms rms(c)
RMM: 1 0.216431141959E+05 0.21643E+05 -0.42943E+05 854 0.110E+03
RMM: 2 0.148501855704E+05 -0.67929E+04 -0.68490E+04 854 0.470E+02
RMM: 3 0.878612807270E+04 -0.60641E+04 -0.41293E+04 854 0.342E+02
RMM: 4 0.548647806810E+04 -0.32997E+04 -0.24740E+04 854 0.265E+02
RMM: 5 0.352306914928E+04 -0.19634E+04 -0.15982E+04 854 0.223E+02
RMM: 6 0.225702374355E+04 -0.12660E+04 -0.11169E+04 854 0.197E+02
RMM: 7 0.133041003489E+04 -0.92661E+03 -0.86353E+03 854 0.181E+02
RMM: 8 0.594936726974E+03 -0.73547E+03 -0.70270E+03 854 0.167E+02
RMM: 9 -0.129355660194E+04 -0.18885E+04 -0.16595E+04 2320 0.156E+02
RMM: 10 -0.207907603242E+04 -0.78552E+03 -0.46532E+03 2295 0.444E+01
RMM: 11 -0.213418000652E+04 -0.55104E+02 -0.71835E+02 2103 0.411E+01
RMM: 12 -0.216872659493E+04 -0.34547E+02 -0.32902E+02 1923 0.963E+00 0.111E+02
RMM: 13 -0.190551200897E+04 0.26321E+03 -0.90315E+02 2308 0.466E+01 0.809E+01
RMM: 14 -0.191496197953E+04 -0.94500E+01 -0.23886E+02 2228 0.188E+01 0.661E+01
RMM: 15 -0.192814651192E+04 -0.13185E+02 -0.73264E+01 2040 0.152E+01 0.248E+01
RMM: 16 -0.191420569522E+04 0.13941E+02 -0.25195E+01 1892 0.114E+01 0.836E+00
RMM: 17 -0.191480751409E+04 -0.60182E+00 -0.20339E+01 2144 0.591E+00 0.642E+00
RMM: 18 -0.191480691969E+04 0.59440E-03 -0.19970E+00 1875 0.327E+00 0.345E+00
RMM: 19 -0.191493673947E+04 -0.12982E+00 -0.13952E+00 2010 0.197E+00 0.324E+00
RMM: 20 -0.191485577361E+04 0.80966E-01 -0.14144E-01 1872 0.948E-01 0.226E+00
RMM: 21 -0.191482345668E+04 0.32317E-01 -0.10659E-01 2020 0.468E-01 0.106E+00
RMM: 22 -0.191481861725E+04 0.48394E-02 -0.39937E-02 2208 0.319E-01 0.622E-01
RMM: 23 -0.191481766996E+04 0.94728E-03 -0.21691E-02 1996 0.215E-01 0.227E-01
RMM: 24 -0.191481811091E+04 -0.44095E-03 -0.65005E-03 1906 0.175E-01 0.113E-01
RMM: 25 -0.191481826537E+04 -0.15446E-03 -0.11978E-03 1617 0.676E-02 0.751E-02
RMM: 26 -0.191481832160E+04 -0.56229E-04 -0.25521E-04 1185 0.332E-02
1 T= 301. E= -.19054124E+04 F= -.19148183E+04 E0= -.19148183E+04 EK= 0.94059E+01 SP= 0.00E+00 SK= 0.00E+00
bond charge predicted
N E dE d eps ncg rms rms(c)
RMM: 1 -0.191380220994E+04 0.10161E+01 -0.49753E+01 2179 0.117E+01 0.126E+00
RMM: 2 -0.191471643144E+04 -0.91422E+00 -0.98691E+00 2115 0.600E+00 0.802E-01
RMM: 3 -0.191473294215E+04 -0.16511E-01 -0.29038E-01 2002 0.618E-01 0.629E-01
RMM: 4 -0.191472627645E+04 0.66657E-02 -0.36815E-02 1834 0.428E-01 0.348E-01
RMM: 5 -0.191472523053E+04 0.10459E-02 -0.32551E-02 1829 0.240E-01 0.165E-01
RMM: 6 -0.191472526388E+04 -0.33345E-04 -0.59988E-03 1685 0.193E-01 0.877E-02
RMM: 7 -0.191472542890E+04 -0.16502E-03 -0.38046E-03 1879 0.876E-02 0.898E-02
RMM: 8 -0.191472530362E+04 0.12528E-03 -0.89987E-04 1499 0.693E-02 0.319E-02
RMM: 9 -0.191472531426E+04 -0.10638E-04 -0.29434E-04 1213 0.241E-02
2 T= 298. E= -.19054113E+04 F= -.19147253E+04 E0= -.19147253E+04 EK= 0.93141E+01 SP= 0.00E+00 SK= 0.00E+00
bond charge predicted
prediction of wavefunctions
N E dE d eps ncg rms rms(c)
RMM: 1 -0.191436534444E+04 0.35996E+00 -0.25506E-01 2002 0.955E-01 0.132E-01
RMM: 2 -0.191436957189E+04 -0.42274E-02 -0.43352E-02 1840 0.478E-01 0.522E-02
RMM: 3 -0.191437142206E+04 -0.18502E-02 -0.18556E-02 1708 0.102E-01 0.693E-02
RMM: 4 -0.191437141249E+04 0.95722E-05 -0.78607E-04 1509 0.717E-02
3 T= 287. E= -.19054083E+04 F= -.19143714E+04 E0= -.19143714E+04 EK= 0.89631E+01 SP= 0.00E+00 SK= 0.00E+00
bond charge predicted
prediction of wavefunctions
N E dE d eps ncg rms rms(c)
RMM: 1 -0.191381705936E+04 0.55436E+00 -0.62560E-02 2346 0.357E-01 0.644E-02
RMM: 2 -0.191381792608E+04 -0.86672E-03 -0.90220E-03 2069 0.761E-02 0.505E-02
RMM: 3 -0.191381794926E+04 -0.23177E-04 -0.63733E-04 1392 0.549E-02
4 T= 269. E= -.19054039E+04 F= -.19138179E+04 E0= -.19138179E+04 EK= 0.84141E+01 SP= 0.00E+00 SK= 0.00E+00
bond charge predicted
prediction of wavefunctions
N E dE d eps ncg rms rms(c)
RMM: 1 -0.191315355558E+04 0.66437E+00 -0.10716E-01 2397 0.339E-01 0.130E-01
RMM: 2 -0.191315456585E+04 -0.10103E-02 -0.12246E-02 2234 0.129E-01 0.420E-02
RMM: 3 -0.191315467423E+04 -0.10838E-03 -0.13635E-03 1542 0.464E-02 0.213E-02
RMM: 4 -0.191315468571E+04 -0.11478E-04 -0.27159E-04 1142 0.289E-02
5 T= 248. E= -.19053985E+04 F= -.19131547E+04 E0= -.19131547E+04 EK= 0.77561E+01 SP= 0.00E+00 SK= 0.00E+00
bond charge predicted
prediction of wavefunctions
N E dE d eps ncg rms rms(c)
RMM: 1 -0.191248159339E+04 0.67308E+00 -0.11842E-01 2400 0.367E-01 0.708E-02
RMM: 2 -0.191248295088E+04 -0.13575E-02 -0.13906E-02 2229 0.124E-01 0.451E-02
RMM: 3 -0.191248307928E+04 -0.12841E-03 -0.16352E-03 1562 0.501E-02 0.190E-02
RMM: 4 -0.191248309655E+04 -0.17271E-04 -0.25035E-04 1157 0.374E-02
6 T= 227. E= -.19053927E+04 F= -.19124831E+04 E0= -.19124831E+04 EK= 0.70904E+01 SP= 0.00E+00 SK= 0.00E+00
bond charge predicted
prediction of wavefunctions
N E dE d eps ncg rms rms(c)
RMM: 1 -0.191189705599E+04 0.58602E+00 -0.12630E-01 2488 0.315E-01 0.653E-02
RMM: 2 -0.191189786139E+04 -0.80540E-03 -0.85453E-03 2164 0.120E-01 0.278E-02
RMM: 3 -0.191189798238E+04 -0.12099E-03 -0.13235E-03 1594 0.368E-02 0.193E-02
RMM: 4 -0.191189799015E+04 -0.77675E-05 -0.16916E-04 1125 0.277E-02
7 T= 208. E= -.19053880E+04 F= -.19118980E+04 E0= -.19118980E+04 EK= 0.65100E+01 SP= 0.00E+00 SK= 0.00E+00
Information: wavefunction orthogonal band 838 0.8936
Information: wavefunction orthogonal band 840 0.8926
Information: wavefunction orthogonal band 841 0.8995
Information: wavefunction orthogonal band 842 0.8998
Information: wavefunction orthogonal band 843 0.8968
Information: wavefunction orthogonal band 845 0.8959
Information: wavefunction orthogonal band 846 0.8849
Information: wavefunction orthogonal band 847 0.8939
Information: wavefunction orthogonal band 848 0.8648
Information: wavefunction orthogonal band 849 0.8833
Information: wavefunction orthogonal band 850 0.8827
Information: wavefunction orthogonal band 851 0.8814
Information: wavefunction orthogonal band 852 0.8697
Information: wavefunction orthogonal band 853 0.8780
Information: wavefunction orthogonal band 854 0.8795
bond charge predicted
prediction of wavefunctions
N E dE d eps ncg rms rms(c)
RMM: 1 -0.191147067616E+04 0.42731E+00 -0.88569E-02 2478 0.268E-01 0.432E-02
RMM: 2 -0.191147130034E+04 -0.62418E-03 -0.64143E-03 2102 0.108E-01 0.276E-02
RMM: 3 -0.191147140183E+04 -0.10149E-03 -0.11003E-03 1559 0.353E-02 0.181E-02
RMM: 4 -0.191147141678E+04 -0.14946E-04 -0.19650E-04 1146 0.312E-02
8 T= 195. E= -.19053857E+04 F= -.19114714E+04 E0= -.19114714E+04 EK= 0.60857E+01 SP= 0.00E+00 SK= 0.00E+00
Information: wavefunction orthogonal band 853 0.8892
bond charge predicted
prediction of wavefunctions
N E dE d eps ncg rms rms(c)
RMM: 1 -0.191123955161E+04 0.23185E+00 -0.10302E-01 2426 0.251E-01 0.411E-02
RMM: 2 -0.191124014174E+04 -0.59013E-03 -0.61092E-03 1989 0.139E-01 0.223E-02
RMM: 3 -0.191124029109E+04 -0.14935E-03 -0.15365E-03 1724 0.345E-02 0.161E-02
RMM: 4 -0.191124030456E+04 -0.13473E-04 -0.19214E-04 1146 0.299E-02
9 T= 187. E= -.19053856E+04 F= -.19112403E+04 E0= -.19112403E+04 EK= 0.58547E+01 SP= 0.00E+00 SK= 0.00E+00
bond charge predicted
prediction of wavefunctions
N E dE d eps ncg rms rms(c)
RMM: 1 -0.191120247177E+04 0.37819E-01 -0.10602E-01 2401 0.255E-01 0.426E-02
RMM: 2 -0.191120312176E+04 -0.64999E-03 -0.67053E-03 2038 0.140E-01 0.223E-02
RMM: 3 -0.191120327196E+04 -0.15020E-03 -0.15437E-03 1718 0.337E-02 0.158E-02
RMM: 4 -0.191120328759E+04 -0.15632E-04 -0.19134E-04 1146 0.298E-02
10 T= 186. E= -.19053879E+04 F= -.19112033E+04 E0= -.19112033E+04 EK= 0.58154E+01 SP= 0.00E+00 SK= 0.00E+00
bond charge predicted
prediction of wavefunctions
wavefunctions rotated
Thanks!