A serious problem with PARALLEL RUNNING of VASP

Message

maryam · #1 Post by **maryam** » Fri Dec 11, 2009 10:34 am

Dear all,
We have a remote accessible to a server that is a cluster x86_64 whose each node has two quad core cpu's. The OS is RedHat Enterprise 5.1 (64bit). Our compiler and libraries are
Intel Fortran Compiler 11.1,
GotoBlas-1.26,
Intel MPI 3.2.2,
For clarifying of the problem, I explain steps which have been done.

Step 1. Making of the parallel executable file of VASP

The executable file of vasp has been made by making the following Makefile. As you can see, the section of MPI has been activated and BLAS library is addressed using the GotoBLAS (which has been made by setting SMP=1, MAX_THREADS=16, and Binary64=1)

*******
.SUFFIXES: .inc .f .f90 .F
#-----------------------------------------------------------------------
# Makefile for Intel Fortran compiler for P4 systems
#
# The makefile was tested only under Linux on Intel platforms
# (Suse 5.3- Suse 9.0)
# the followin compiler versions have been tested
# 5.0, 6.0, 7.0 and 7.1 (some 8.0 versions seem to fail compiling the code)
# presently we recommend version 7.1 or 7.0, since these
# releases have been used to compile the present code versions
#
# it might be required to change some of library pathes, since
# LINUX installation vary a lot
# Hence check ***ALL**** options in this makefile very carefully
#-----------------------------------------------------------------------
#
# BLAS must be installed on the machine
# there are several options:
# 1) very slow but works:
# retrieve the lapackage from ftp.netlib.org
# and compile the blas routines (BLAS/SRC directory)
# please use g77 or f77 for the compilation. When I tried to
# use pgf77 or pgf90 for BLAS, VASP hang up when calling
# ZHEEV (however this was with lapack 1.1 now I use lapack 2.0)
# 2) most desirable: get an optimized BLAS
#
# the two most reliable packages around are presently:
# 3a) Intels own optimised BLAS (PIII, P4, Itanium)
# http://developer.intel.com/software/products/mkl/
# this is really excellent when you use Intel CPU's
#
# 3b) or obtain the atlas based BLAS routines
# http://math-atlas.sourceforge.net/
# you certainly need atlas on the Athlon, since the mkl
# routines are not optimal on the Athlon.
# If you want to use atlas based BLAS, check the lines around LIB=
#
# 3c) mindblowing fast SSE2 (4 GFlops on P4, 2.53 GHz)
# Kazushige Goto's BLAS
# http://www.cs.utexas.edu/users/kgoto/signup_first.html
#
#-----------------------------------------------------------------------

# all CPP processed fortran files have the extension .f90
SUFFIX=.f90

#-----------------------------------------------------------------------
# fortran compiler and linker
#-----------------------------------------------------------------------
#FC=ifort
# fortran linker
#FCL=$(FC)

#-----------------------------------------------------------------------
# whereis CPP ?? (I need CPP, can't use gcc with proper options)
# that's the location of gcc for SUSE 5.3
#
# CPP_ = /usr/lib/gcc-lib/i486-linux/2.7.2/cpp -P -C
#
# that's probably the right line for some Red Hat distribution:
#
# CPP_ = /usr/lib/gcc-lib/i386-redhat-linux/2.7.2.3/cpp -P -C
#
# SUSE X.X, maybe some Red Hat distributions:

CPP_ = ./preprocess <$*.F | /usr/bin/cpp -P -C -traditional >$*$(SUFFIX)

#-----------------------------------------------------------------------
# possible options for CPP:
# NGXhalf charge density reduced in X direction
# wNGXhalf gamma point only reduced in X direction
# avoidalloc avoid ALLOCATE if possible
# IFC work around some IFC bugs
# CACHE_SIZE 1000 for PII,PIII, 5000 for Athlon, 8000-12000 P4
# RPROMU_DGEMV use DGEMV instead of DGEMM in RPRO (depends on used BLAS)
# RACCMU_DGEMV use DGEMV instead of DGEMM in RACC (depends on used BLAS)
#-----------------------------------------------------------------------

#CPP = $(CPP_) -DHOST=\"LinuxIFC\" \
# -Dkind8 -DNGXhalf -DCACHE_SIZE=12000 -DPGF90 -Davoidalloc \
# -DRPROMU_DGEMV -DRACCMU_DGEMV

#-----------------------------------------------------------------------
# general fortran flags (there must a trailing blank on this line)
#-----------------------------------------------------------------------

FFLAGS = -FR -lowercase -assume byterecl

#-----------------------------------------------------------------------
# optimization
# we have tested whether higher optimisation improves performance
# -axK SSE1 optimization, but also generate code executable on all mach.
# xK improves performance somewhat on XP, and a is required in order
# to run the code on older Athlons as well
# -xW SSE2 optimization
# -axW SSE2 optimization, but also generate code executable on all mach.
# -tpp6 P3 optimization
# -tpp7 P4 optimization
#-----------------------------------------------------------------------

OFLAG=-O1 -xW -tpp7

OFLAG_HIGH = $(OFLAG)
OBJ_HIGH =

OBJ_NOOPT =
DEBUG = -FR -O0
INLINE = $(OFLAG)

#-----------------------------------------------------------------------
# the following lines specify the position of BLAS and LAPACK
# on P4, VASP works fastest with the libgoto library
# so that's what I recommend
#-----------------------------------------------------------------------

# Atlas based libraries
#ATLASHOME= $(HOME)/archives/BLAS_OPT/ATLAS/lib/Linux_P4SSE2/
#BLAS= -L$(ATLASHOME) -lf77blas -latlas

# use specific libraries (default library path might point to other libraries)
#BLAS= $(ATLASHOME)/libf77blas.a $(ATLASHOME)/libatlas.a

# use the mkl Intel libraries for p4 (www.intel.com)
# mkl.5.1
# set -DRPROMU_DGEMV -DRACCMU_DGEMV in the CPP lines
# BLAS=-L/opt/intel/mkl/lib/32 -lmkl_p4 -lpthread

# mkl.5.2 requires also to -lguide library
# set -DRPROMU_DGEMV -DRACCMU_DGEMV in the CPP lines
#BLAS=-L/opt/intel/mkl/10.2.1.017/lib/em64t -lguide -lmkl_blacs_ilp64 -lmkl_intel_thread -lmkl_core -lmkl_intel_ilp64
#BLAS=-L/opt/intel/Compiler/11.1/056/mkl/lib/em64t -lmkl_solver_lp64 -lmkl_blacs_lp64
BLAS=/home/maryam/GotoBLAS/libgoto_core2p-r1.26.a
# /opt/intel/mkl/10.2.1.017/lib/em64t/libguie.a \
# /opt/intel/mkl/10.2.1.017/lib/em64t/libmkl_blacs_lp64.a \
# /opt/intel/mkl/10.2.1.017/lib/em64t/libmkl_intel_thread.a \
# /opt/intel/mkl/10.2.1.017/lib/em64t/libmkl_core.a \
# /opt/intel/mkl/10.2.1.017/lib/em64t/libmkl_intel_lp64.a \
# /opt/intel/mkl/10.2.1.017/lib/em64t/libmkl_scalapack_lp64.a \
# /opt/intel/mkl/10.2.1.017/lib/em64t/libmkl_sequential.a \
#BLAS =-L/opt/intel/mkl/10.2.1.017/lib/em64t -lmkl_core -lmkl_blacs_intelmpi_ilp64 -lguide
# /opt/intel/mkl/10.2.1.017/lib/em64t/libmkl_blacs_intelmpi_lp64.a

#BLACS=-L/opt/intel/mkl/10.2.1.017/lib/em64t -lmkl_blacs_lp64
# even faster Kazushige Goto's
# http://www.cs.utexas.edu/users/kgoto/signup_first.html
#BLAS=/home/maryam/libgoto_northwoodp-r1.26.so

# LAPACK, simplest use vasp.4.lib/lapack_doublei
LAPACK= ../vasp.4.lib/lapack_double.o

# use atlas optimized part of lapack
#LAPACK= ../vasp.4.lib/lapack_atlas.o -llapack -lcblas

# use the mkl Intel lapack
#LAPACK=-L/opt/intel/mkl/10.2.1.017/lib/em64t -lmkl_scalapack_ilp64 -lmkl_lapack -lmkl_core -liomp5 -lmkl_blacs_intelmpi_ilp64 -lmkl_pgi_thread \
#/opt/intel/mkl/10.2.1.017/lib/em64t/libmkl_blacs_openmpi_ilp64.a
# /opt/intel/mkl/10.2.1.017/lib/em64t/libmkl_lapack.a \
# /opt/intel/mkl/10.2.1.017/lib/em64t/libmkl_em64t.a \
# /opt/intel/mkl/10.2.1.017/lib/em64t/libmkl_core.a \
# /opt/intel/mkl/10.2.1.017/lib/em64t/libguide.a
#LAPACK= ../vasp.4.lib/lapack_double.o
#-----------------------------------------------------------------------

LIB = -L../vasp.4.lib/ -ldmy \
../vasp.4.lib/linpack_double.o $(LAPACK) \
$(BLAS)

# options for linking (for compiler version 6.X, 7.1) nothing is required
#LINK =
# compiler version 7.0 generates some vector statments which are located
# in the svml library, add the LIBPATH and the library (just in case)
# LINK =/opt/intel/Compiler/11.1/056/lib/intel64/for_main.o

#-----------------------------------------------------------------------
# fft libraries:
# VASP.4.6 can use fftw.3.0.X (http://www.fftw.org)
# since this version is faster on P4 machines, we recommend to use it
#-----------------------------------------------------------------------

#FFT3D = fft3dfurth.o fft3dlib.o
FFT3D = fftw3d.o fft3dlib.o /usr/local/lib/libfftw3.a

#=======================================================================
# MPI section, uncomment the following lines
#
# one comment for users of mpich or lam:
# You must *not* compile mpi with g77/f77, because f77/g77
# appends *two* underscores to symbols that contain already an
# underscore (i.e. MPI_SEND becomes mpi_send__). The pgf90/ifc
# compilers however append only one underscore.
# Precompiled mpi version will also not work !!!
#
# We found that mpich.1.2.1 and lam-6.5.X to lam-7.0.4 are stable
# mpich.1.2.1 was configured with
# ./configure -prefix=/usr/local/mpich_nodvdbg -fc="pgf77 -Mx,119,0x200000" \
# -f90="pgf90 " \
# --without-romio --without-mpe -opt=-O \
#
# lam was configured with the line
# ./configure -prefix /opt/libs/lam-7.0.4 --with-cflags=-O -with-fc=ifc \
# --with-f77flags=-O --without-romio
#
# please note that you might be able to use a lam or mpich version
# compiled with f77/g77, but then you need to add the following
# options: -Msecond_underscore (compilation) and -g77libs (linking)
#
# !!! Please do not send me any queries on how to install MPI, I will
# certainly not answer them !!!!
#=======================================================================
#-----------------------------------------------------------------------
# fortran linker for mpi: if you use LAM and compiled it with the options
# suggested above, you can use the following line
#-----------------------------------------------------------------------

FC=/home/maryam/intel/impi/3.2.2.006/bin64/mpiifort
FCL=$(FC)

#-----------------------------------------------------------------------
# additional options for CPP in parallel version (see also above):
# NGZhalf charge density reduced in Z direction
# wNGZhalf gamma point only reduced in Z direction
# scaLAPACK use scaLAPACK (usually slower on 100 Mbit Net)
#-----------------------------------------------------------------------

CPP = $(CPP_) -DMPI -DHOST=\"LinuxIFC\" -DIFC \
-Dkind8 -DNGZhalf -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc \
-DMPI_BLOCK=500 \
-DRPROMU_DGEMV -DRACCMU_DGEMV

#-----------------------------------------------------------------------
# location of SCALAPACK
# if you do not use SCALAPACK simply uncomment the line SCA
#-----------------------------------------------------------------------

BLACS=$(HOME)/archives/SCALAPACK/BLACS/
SCA_=$(HOME)/archives/SCALAPACK/SCALAPACK

SCA= $(SCA_)/libscalapack.a \
$(BLACS)/LIB/blacsF77init_MPI-LINUX-0.a $(BLACS)/LIB/blacs_MPI-LINUX-0.a $(BLACS)/LIB/blacsF77init_MPI-LINUX-0.a

SCA=

#-----------------------------------------------------------------------
# libraries for mpi
#-----------------------------------------------------------------------

LIB = -L../vasp.4.lib -ldmy \
../vasp.4.lib/linpack_double.o $(LAPACK) \
$(SCA)$(BLAS)

# FFT: fftmpi.o with fft3dlib of Juergen Furthmueller
#FFT3D = fftmpi.o fftmpi_map.o fft3dlib.o

# fftw.3.0.1 is slighly faster and should be used if available
FFT3D = fftmpiw.o fftmpi_map.o fft3dlib.o /usr/local/lib/libfftw3.a

#-----------------------------------------------------------------------
# general rules and compile lines
#-----------------------------------------------------------------------
BASIC= symmetry.o symlib.o lattlib.o random.o

SOURCE= base.o mpi.o smart_allocate.o xml.o \
constant.o jacobi.o main_mpi.o scala.o \
asa.o lattice.o poscar.o ini.o setex.o radial.o \
pseudo.o mgrid.o mkpoints.o wave.o wave_mpi.o $(BASIC) \
nonl.o nonlr.o dfast.o choleski2.o \
mix.o charge.o xcgrad.o xcspin.o potex1.o potex2.o \
metagga.o constrmag.o pot.o cl_shift.o force.o dos.o elf.o \
tet.o hamil.o steep.o \
chain.o dyna.o relativistic.o LDApU.o sphpro.o paw.o us.o \
ebs.o wavpre.o wavpre_noio.o broyden.o \
dynbr.o rmm-diis.o reader.o writer.o tutor.o xml_writer.o \
brent.o stufak.o fileio.o opergrid.o stepver.o \
dipol.o xclib.o chgloc.o subrot.o optreal.o davidson.o \
edtest.o electron.o shm.o pardens.o paircorrection.o \
optics.o constr_cell_relax.o stm.o finite_diff.o \
elpol.o setlocalpp.o aedens.o

INC=

vasp: $(SOURCE) $(FFT3D) $(INC) main.o
rm -f vasp
$(FCL) -o vasp $(LINK) main.o $(SOURCE) $(FFT3D) $(LIB)
makeparam: $(SOURCE) $(FFT3D) makeparam.o main.F $(INC)
$(FCL) -o makeparam $(LINK) makeparam.o $(SOURCE) $(FFT3D) $(LIB)
zgemmtest: zgemmtest.o base.o random.o $(INC)
$(FCL) -o zgemmtest $(LINK) zgemmtest.o random.o base.o $(LIB)
dgemmtest: dgemmtest.o base.o random.o $(INC)
$(FCL) -o dgemmtest $(LINK) dgemmtest.o random.o base.o $(LIB)
ffttest: base.o smart_allocate.o mpi.o mgrid.o random.o ffttest.o $(FFT3D) $(INC)
$(FCL) -o ffttest $(LINK) ffttest.o mpi.o mgrid.o random.o smart_allocate.o base.o $(FFT3D) $(LIB)
kpoints: $(SOURCE) $(FFT3D) makekpoints.o main.F $(INC)
$(FCL) -o kpoints $(LINK) makekpoints.o $(SOURCE) $(FFT3D) $(LIB)

clean:
-rm -f *.g *.f *.o *.L *.mod ; touch *.F

main.o: main$(SUFFIX)
$(FC) $(FFLAGS)$(DEBUG) $(INCS) -c main$(SUFFIX)
xcgrad.o: xcgrad$(SUFFIX)
$(FC) $(FFLAGS) $(INLINE) $(INCS) -c xcgrad$(SUFFIX)
xcspin.o: xcspin$(SUFFIX)
$(FC) $(FFLAGS) $(INLINE) $(INCS) -c xcspin$(SUFFIX)

makeparam.o: makeparam$(SUFFIX)
$(FC) $(FFLAGS)$(DEBUG) $(INCS) -c makeparam$(SUFFIX)

makeparam$(SUFFIX): makeparam.F main.F
#
# MIND: I do not have a full dependency list for the include
# and MODULES: here are only the minimal basic dependencies
# if one strucuture is changed then touch_dep must be called
# with the corresponding name of the structure
#
base.o: base.inc base.F
mgrid.o: mgrid.inc mgrid.F
constant.o: constant.inc constant.F
lattice.o: lattice.inc lattice.F
setex.o: setexm.inc setex.F
pseudo.o: pseudo.inc pseudo.F
poscar.o: poscar.inc poscar.F
mkpoints.o: mkpoints.inc mkpoints.F
wave.o: wave.inc wave.F
nonl.o: nonl.inc nonl.F
nonlr.o: nonlr.inc nonlr.F

$(OBJ_HIGH):
$(CPP)
$(FC) $(FFLAGS) $(OFLAG_HIGH) $(INCS) -c $*$(SUFFIX)
$(OBJ_NOOPT):
$(CPP)
$(FC) $(FFLAGS) $(INCS) -c $*$(SUFFIX)

fft3dlib_f77.o: fft3dlib_f77.F
$(CPP)
$(F77) $(FFLAGS_F77) -c $*$(SUFFIX)

.F.o:
$(CPP)
$(FC) $(FFLAGS) $(OFLAG) $(INCS) -c $*$(SUFFIX)
.F$(SUFFIX):
$(CPP)
$(SUFFIX).o:
$(FC) $(FFLAGS) $(OFLAG) $(INCS) -c $*$(SUFFIX)

# special rules
#-----------------------------------------------------------------------
# these special rules are cummulative (that is once failed
# in one compiler version, stays in the list forever)
# -tpp5|6|7 P, PII-PIII, PIV
# -xW use SIMD (does not pay of on PII, since fft3d uses double prec)
# all other options do no affect the code performance since -O1 is used
#-----------------------------------------------------------------------

fft3dlib.o : fft3dlib.F
$(CPP)
$(FC) -FR -lowercase -O1 -tpp7 -xW -unroll0 -w95 -vec_report3 -c $*$(SUFFIX)
fft3dfurth.o : fft3dfurth.F
$(CPP)
$(FC) -FR -lowercase -O1 -c $*$(SUFFIX)

radial.o : radial.F
$(CPP)
$(FC) -FR -lowercase -O1 -c $*$(SUFFIX)

symlib.o : symlib.F
$(CPP)
$(FC) -FR -lowercase -O1 -c $*$(SUFFIX)

symmetry.o : symmetry.F
$(CPP)
$(FC) -FR -lowercase -O1 -c $*$(SUFFIX)

dynbr.o : dynbr.F
$(CPP)
$(FC) -FR -lowercase -O1 -c $*$(SUFFIX)

broyden.o : broyden.F
$(CPP)
$(FC) -FR -lowercase -O2 -c $*$(SUFFIX)

us.o : us.F
$(CPP)
$(FC) -FR -lowercase -O1 -c $*$(SUFFIX)

wave.o : wave.F
$(CPP)
$(FC) -FR -lowercase -O0 -c $*$(SUFFIX)

LDApU.o : LDApU.F
$(CPP)
$(FC) -FR -lowercase -O2 -c $*$(SUFFIX)
mpi.o : mpi.F
$(CPP)
$(FC) -FR -lowercase -O0 -c $*$(SUFFIX)
*******
Step 2. Settings in the INCAR file for parallelization

LPLANE=.TRUE.
NPAR=4 (# of our nodes is four)
NSIM=4

Step 3. Running of VASP in parallel mode

We want to run vasp by using 4 nodes of that server that each node has 2 quad-core CPU's. What we do in command line are as bellow:

3.1) making hostfile in /home/maryam path to specify nodes:
Compute-0-0
Compute-0-0
Compute-0-1
Compute-0-1
Compute-0-2
Compute-0-2
Compute-0-3
Compute-0-3

3.2) ssh to one node for example compute-0-3 then running vasp:
ssh compute-0-3
mpirun â€“hostfile /home/maryam/hostfile /home/maryam/vasp

3.3) Results of running:
Results written by vasp after running are as the following:
*********
running on 1 nodes
distr: one band on 1 nodes, 1 groups
running on 1 nodes
distr: one band on 1 nodes, 1 groups
vasp.4.6.36 17Feb09 complex
vasp.4.6.36 17Feb09 complex
POSCAR found : 1 types and 39 ions
POSCAR found : 1 types and 39 ions
LDA part: xc-table for Ceperly-Alder, standard interpolation
LDA part: xc-table for Ceperly-Alder, standard interpolation
running on 1 nodes
distr: one band on 1 nodes, 1 groups
running on 1 nodes
distr: one band on 1 nodes, 1 groups
vasp.4.6.36 17Feb09 complex
running on 1 nodes
distr: one band on 1 nodes, 1 groups
found WAVECAR, reading the header
found WAVECAR, reading the header
number of k-points has changed, file: 15 present: 25
trying to continue reading WAVECAR, but it might fail
WAVECAR: different cutoff or change in lattice found
number of k-points has changed, file: 15 present: 25
trying to continue reading WAVECAR, but it might fail
WAVECAR: different cutoff or change in lattice found
vasp.4.6.36 17Feb09 complex
running on 1 nodes
distr: one band on 1 nodes, 1 groups
vasp.4.6.36 17Feb09 complex
POSCAR, INCAR and KPOINTS ok, starting setup
POSCAR, INCAR and KPOINTS ok, starting setup
vasp.4.6.36 17Feb09 complex
WARNING: wrap around errors must be expected
WARNING: wrap around errors must be expected

..... each process is repeated 8 times!!!!!!!!!!!!!!
*********
It is found that one process is done simultaneously on all nodes. Perhaps, breaking of the process for submitting jobs to different nodes is not done!

Something is wrong, but we don't know its source.
I will be very grateful for any help, any idea.

#2 Post by **admin** » Fri Dec 11, 2009 3:59 pm

please check with your sys admin how to exactly start parallel jobs on your system,
possibly you have to
-- set the nr of nodes,... in the header of your job-script
-- and the number of nodes should be given when you start mpirun by eg
mpirun -np (# of nodes) vasp