running VASP with openmpi in /state/partition1 (scratch)
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 7
- Joined: Sat Jul 23, 2011 11:49 pm
running VASP with openmpi in /state/partition1 (scratch)
To avoid writing files over NFS it is customary to run code in a scratch directory (only visible to 1 node). However, with VASP I have a problem for a parallel job when doing that. At startup my SGE script creates
WORKDIR=/state/partition1/$USER/$JOB_NAME-${JOB_ID%%.*}
as the working directory, and runs from there.
1) many MPI version require that you create this directory also on ALL other nodes (because MPI switches to this directory, even when no files are read). I use "open-mpi 1.4.3".
2) apparently, VASP requires the same input to be present at ALL the nodes (not just the start-up node).
3) okay, now it runs, but it hangs after a few hours.
If I run the same from my home-directory, which is visible via NFS, it works. The difference is that now all nodes see the _same_ (updated) files in the directory.
Question: is VASP designed to run such that all nodes need to see the same files? (like OUTCAR, CHG, CONTCAR, IBZKPT), or should you be able to run from a local scratch directory?
David
WORKDIR=/state/partition1/$USER/$JOB_NAME-${JOB_ID%%.*}
as the working directory, and runs from there.
1) many MPI version require that you create this directory also on ALL other nodes (because MPI switches to this directory, even when no files are read). I use "open-mpi 1.4.3".
2) apparently, VASP requires the same input to be present at ALL the nodes (not just the start-up node).
3) okay, now it runs, but it hangs after a few hours.
If I run the same from my home-directory, which is visible via NFS, it works. The difference is that now all nodes see the _same_ (updated) files in the directory.
Question: is VASP designed to run such that all nodes need to see the same files? (like OUTCAR, CHG, CONTCAR, IBZKPT), or should you be able to run from a local scratch directory?
David
Last edited by dubbelda on Tue Aug 09, 2011 11:50 am, edited 1 time in total.
-
- Hero Member
- Posts: 583
- Joined: Tue Nov 16, 2004 2:21 pm
- License Nr.: 5-67
- Location: Germany
running VASP with openmpi in /state/partition1 (scratch)
Hi David,
to answer your question: try it! It needs at least the input files replicated.
Hint: VASP does not write (many) huge files, so you could easily run it via NFS. Just switch off WAVECAR and CHG* writing by default and you'll end up with a bunch of Mega(!)bytes per (optimisation) run.
Cheers,
alex
to answer your question: try it! It needs at least the input files replicated.
Hint: VASP does not write (many) huge files, so you could easily run it via NFS. Just switch off WAVECAR and CHG* writing by default and you'll end up with a bunch of Mega(!)bytes per (optimisation) run.
Cheers,
alex
Last edited by alex on Tue Aug 09, 2011 5:34 pm, edited 1 time in total.
-
- Newbie
- Posts: 7
- Joined: Sat Jul 23, 2011 11:49 pm
running VASP with openmpi in /state/partition1 (scratch)
Thanks Alex, I did try and it randomly hangs after a while. Although with longer tests also my jobs via the home hang eventually, so it is _not_ related to the scratch disk. That part works! (as long as you copy the input to all the nodes). But as you say, if VASP does not do write large temp-files then there is little use for doing this.
I am computing 'elastic constants' on a big system, but after a while (say around 200 out of the total 648 steps to numerically compute the generalized Hessian matrix) the output stop and nothing happens. Anybody got experience with this? I am now trying without any optimization compiler options. If that still does not work, then I will also try mvapich. I am running over infiniband.
David
I am computing 'elastic constants' on a big system, but after a while (say around 200 out of the total 648 steps to numerically compute the generalized Hessian matrix) the output stop and nothing happens. Anybody got experience with this? I am now trying without any optimization compiler options. If that still does not work, then I will also try mvapich. I am running over infiniband.
David
Last edited by dubbelda on Wed Aug 10, 2011 12:17 pm, edited 1 time in total.
-
- Hero Member
- Posts: 583
- Joined: Tue Nov 16, 2004 2:21 pm
- License Nr.: 5-67
- Location: Germany
running VASP with openmpi in /state/partition1 (scratch)
Hi David,
this sounds weird. I'm still going with my stone age openmpi 1.2.6 over IB, because I had diffulties with 1.4.3. Which one, I can't remember.
The randomness suggests IMO network problems. Is it a professionally setup system? Are you sure you've installed proper drivers and brought the network up accordingly?
Do long optimsations show similar misbehaviour? Or is just the freq-calc.?
Which VASP version are you using?
Cheers,
alex
this sounds weird. I'm still going with my stone age openmpi 1.2.6 over IB, because I had diffulties with 1.4.3. Which one, I can't remember.
The randomness suggests IMO network problems. Is it a professionally setup system? Are you sure you've installed proper drivers and brought the network up accordingly?
Do long optimsations show similar misbehaviour? Or is just the freq-calc.?
Which VASP version are you using?
Cheers,
alex
Last edited by alex on Wed Aug 10, 2011 5:26 pm, edited 1 time in total.
-
- Newbie
- Posts: 7
- Joined: Sat Jul 23, 2011 11:49 pm
running VASP with openmpi in /state/partition1 (scratch)
details: VASP 5.2.11, rocks 5.3 system, Opteron 6164 HE (24 cores per node), intel composerxe-2011.4.191 compiler, mkl, scaLAPCK, fftw using mkl. Infiniband using OFED-1.5.3.
Rocks has an old gfortran compiler (4.1) and I was unable to get a proper executable using gfortran and 'acml'. Then tried gfortran 4.4, installed it, same thing. Actually, most stuff using 'acml' crashes for me after a while (segmentation faults). MKL seems to be more stable for me (eventhough I have AMD opterons).
Long geometry optimizations are fine. The 'elastic constants' stuff takes several days and I did not yet find out what is causing the hangs. A 'top' shows everything runnning at 100% cpu per core, but there is just no output anymore.
Perhaps it is an MPI thing. I will try to run using ethernet instead of infiniband and see if that works.
David
Rocks has an old gfortran compiler (4.1) and I was unable to get a proper executable using gfortran and 'acml'. Then tried gfortran 4.4, installed it, same thing. Actually, most stuff using 'acml' crashes for me after a while (segmentation faults). MKL seems to be more stable for me (eventhough I have AMD opterons).
Long geometry optimizations are fine. The 'elastic constants' stuff takes several days and I did not yet find out what is causing the hangs. A 'top' shows everything runnning at 100% cpu per core, but there is just no output anymore.
Perhaps it is an MPI thing. I will try to run using ethernet instead of infiniband and see if that works.
David
Last edited by dubbelda on Wed Aug 10, 2011 5:46 pm, edited 1 time in total.
-
- Newbie
- Posts: 7
- Joined: Sat Jul 23, 2011 11:49 pm
running VASP with openmpi in /state/partition1 (scratch)
I got no hangs when I use (in the INCAR):
LSCALAPACK = .FALSE.
so the problem seems to be related to using the intel scalapack. But it could still be related to VASP or openmpi.
In my makefile:
CPP = $(CPP_) -DMPI -DHOST=\"LinuxIFC\" -DIFC \
-Dkind8 -DNGZhalf -DCACHE_SIZE=8000 -DPGF90 -Davoidalloc \
-DMPI_BLOCK=500 -DscaLAPACK
MKLINCLUDE=/share/apps/intel/mkl/include/fftw
MKLPATH=/share/apps/intel/mkl/lib/intel64
BLAS=-L${MKLPATH} -I${MKLINCLUDE} -I${MKLINCLUDE}/em64t/lp64 -lmkl_blas95_lp64 -Wl,--start-group ${MKLPATH}/libmkl_intel_lp64.a ${MKLPATH}/libmkl_sequential.a ${MKLPATH}/libmkl_core.a -Wl,--end-group -lpthread
LAPACK=-L${MKLPATH} -I${MKLINCLUDE} -I${MKLINCLUDE}/em64t/lp64 -lmkl_lapack95_lp64 -Wl,--start-group ${MKLPATH}/libmkl_intel_lp64.a ${MKLPATH}/libmkl_sequential.a ${MKLPATH}/libmkl_core.a -Wl,--end-group -lpthread
SCA=${MKLPATH}/libmkl_scalapack_lp64.a ${MKLPATH}/libmkl_solver_lp64_sequential.a -Wl,--start-group ${MKLPATH}/libmkl_intel_lp64.a ${MKLPATH}/libmkl_sequential.a ${MKLPATH}/libmkl_core.a ${MKLPATH}/libmkl_blacs_openmpi_lp64.a -Wl,--end-group -lpthread -lpthread -limf -lm
FFT3D = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o /share/apps/intel/mkl/interfaces/fftw3xf/libfftw3xf_intel.a fft3dlib.o
Would this be correct?
David
LSCALAPACK = .FALSE.
so the problem seems to be related to using the intel scalapack. But it could still be related to VASP or openmpi.
In my makefile:
CPP = $(CPP_) -DMPI -DHOST=\"LinuxIFC\" -DIFC \
-Dkind8 -DNGZhalf -DCACHE_SIZE=8000 -DPGF90 -Davoidalloc \
-DMPI_BLOCK=500 -DscaLAPACK
MKLINCLUDE=/share/apps/intel/mkl/include/fftw
MKLPATH=/share/apps/intel/mkl/lib/intel64
BLAS=-L${MKLPATH} -I${MKLINCLUDE} -I${MKLINCLUDE}/em64t/lp64 -lmkl_blas95_lp64 -Wl,--start-group ${MKLPATH}/libmkl_intel_lp64.a ${MKLPATH}/libmkl_sequential.a ${MKLPATH}/libmkl_core.a -Wl,--end-group -lpthread
LAPACK=-L${MKLPATH} -I${MKLINCLUDE} -I${MKLINCLUDE}/em64t/lp64 -lmkl_lapack95_lp64 -Wl,--start-group ${MKLPATH}/libmkl_intel_lp64.a ${MKLPATH}/libmkl_sequential.a ${MKLPATH}/libmkl_core.a -Wl,--end-group -lpthread
SCA=${MKLPATH}/libmkl_scalapack_lp64.a ${MKLPATH}/libmkl_solver_lp64_sequential.a -Wl,--start-group ${MKLPATH}/libmkl_intel_lp64.a ${MKLPATH}/libmkl_sequential.a ${MKLPATH}/libmkl_core.a ${MKLPATH}/libmkl_blacs_openmpi_lp64.a -Wl,--end-group -lpthread -lpthread -limf -lm
FFT3D = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o /share/apps/intel/mkl/interfaces/fftw3xf/libfftw3xf_intel.a fft3dlib.o
Would this be correct?
David
Last edited by dubbelda on Fri Aug 12, 2011 1:27 pm, edited 1 time in total.
-
- Newbie
- Posts: 7
- Joined: Sat Jul 23, 2011 11:49 pm
running VASP with openmpi in /state/partition1 (scratch)
Dooh, I think most of the problems are caused by a faulty infiniband card on one of the nodes... ibcheckerrors and ibchecknet now show errors for this node.
David
David
Last edited by dubbelda on Wed Aug 17, 2011 11:50 pm, edited 1 time in total.
-
- Newbie
- Posts: 7
- Joined: Sat Jul 23, 2011 11:49 pm
running VASP with openmpi in /state/partition1 (scratch)
Just wanted to post an update: for me, upgrading from openmpi-1.4.3 (from OFED-1.5.4) to openmpi-1.4.5 solved all my problems:
1) segmentation faults when running on more than 1 node,
2) segmentation faults for certain NPAR/NSIM values,
3) random hangs,
4) empty output-files.
Since this update VASP 5.2.12 is running great (I am running systems with 700 atoms on intel Xeon X5675 with lots of memory using infiniband Mellanox MT26428 ConnectX, linux kernel 2.6.32-220.13.1.el6, rocks).
1) segmentation faults when running on more than 1 node,
2) segmentation faults for certain NPAR/NSIM values,
3) random hangs,
4) empty output-files.
Since this update VASP 5.2.12 is running great (I am running systems with 700 atoms on intel Xeon X5675 with lots of memory using infiniband Mellanox MT26428 ConnectX, linux kernel 2.6.32-220.13.1.el6, rocks).
Last edited by dubbelda on Sat Apr 28, 2012 1:20 pm, edited 1 time in total.