VASP on intel em64t and MKL

Message

simoneca · #1 Post by **simoneca** » Mon Nov 03, 2008 9:54 am

Dear all,

we've just succeeded in comipling a parallel version of VASP and want to share the experience. Suggestions about how to improve our installation scheme are of course welcome.

*system*

- 8 nodes, 2 x Intel(R) Xeon(R) QuadCore (em64t) each, so 64 cores overall
- Oscarized CentoOS 5.2
- Intel Fortran and C/C++ compilers 10.1
- Intel Math Kernel Libraries 10.0
- OpenMPI 1.2.5, icc compiled
- Vasp 4.6

Of course our starting point will be the corresponding makefile provided by the VASP team together with the source code

installation

Installation of the Intel compilers/libraries is straightforward and won't be detailed here. The same is true for OpenMPI which we compiled with the Intel C/C++ compiler to avoid possible conflicts. The source comes with the configure script, which allows one to easily choose the compilers to be used, e.g.

Code: Select all

./configureÂ CC=iccÂ CXX=iccÂ F77=ifortÂ F90=ifort

Environment variables are handled with the module/switcher tools available with Oscar (Alternatively, one can manually set them). For example a module for intel environment has been created with the following settings

Code: Select all

$moduleÂ showÂ compile/intel
-------------------------------------------------------------------
/opt/env-switcher/share/env-switcher/compile/intel:

module-whatisÂ Â Â SetupÂ Intel-suiteÂ inÂ yourÂ environment.
conflictÂ Â Â Â Â Â Â Â Â Â Â Â Â compilers
prepend-pathÂ Â Â Â Â INCLUDEÂ /prg/intel/mkl/10.0.4.023/include
prepend-pathÂ Â Â Â Â CPATHÂ /prg/intel/mkl/10.0.4.023/include
prepend-pathÂ Â Â Â Â FPATHÂ /prg/intel/mkl/10.0.4.023/include
prepend-pathÂ Â Â Â Â PATHÂ /prg/intel/fce/10.1.018/bin/
prepend-pathÂ Â Â Â Â PATHÂ /prg/intel/cce/10.1.018/bin/
prepend-pathÂ Â Â Â Â PATHÂ /prg/intel/mkl/10.0.4.023/lib/em64t/
prepend-pathÂ Â Â Â Â NLSPATHÂ /prg/intel/fce/10.1.018/lib/locale/en_US/%N
prepend-pathÂ Â Â Â Â LD_LIBRARY_PATHÂ /prg/intel/fce/10.1.018/lib
prepend-pathÂ Â Â Â Â LD_LIBRARY_PATHÂ /prg/intel/cce/10.1.018/lib
prepend-pathÂ Â Â Â Â LD_LIBRARY_PATHÂ /prg/intel/mkl/10.0.4.023/lib/em64t/
prepend-pathÂ Â Â Â Â LIBRARY_PATHÂ /prg/intel/mkl/10.0.4.023/lib/em64t/
prepend-pathÂ Â Â Â Â MANPATHÂ /prg/intel/cce/10.1.018/man/
prepend-pathÂ Â Â Â Â MANPATHÂ /prg/intel/fce/10.1.018/man/
prepend-pathÂ Â Â Â Â MANPATHÂ /prg/intel/mkl/10.0.4.023/man/
-------------------------------------------------------------------

and loaded for intel compilations. The same holds for the openmpi module

Code: Select all

$moduleÂ showÂ mpi/openmpi-1.2.5-icc
-------------------------------------------------------------------
/opt/env-switcher/share/env-switcher/mpi/openmpi-1.2.5-icc:

module-whatisÂ Â Â Â SetsÂ upÂ theÂ OpenMPI-iccÂ environmentÂ forÂ anÂ OSCARÂ cluster.
conflictÂ Â Â Â Â Â Â Â Â Â Â Â Â mpi
prepend-pathÂ Â Â Â Â PATHÂ /usr/lib64/openmpi/1.2.5-icc/bin/
prepend-pathÂ Â Â Â Â LD_LIBRARY_PATHÂ /usr/lib64/openmpi/1.2.5-icc/lib
prepend-pathÂ Â Â Â Â MANPATHÂ /usr/lib64/openmpi/1.2.5-icc/share/man
-------------------------------------------------------------------

VASP.4.lib is easy, just set

Code: Select all

FC=ifort

in the makefile.

For VASP 4.6, starting with makefile.linux_ifc_ath coming with VASP, we did the following modifications

compiler flags:

Code: Select all

FFLAGSÂ =Â Â -FRÂ -lowercaseÂ -assuÂ byterecl
OFLAG=-O3Â -axW
OFLAG2=-O1Â -axWÂ (usedÂ forÂ specialÂ casesÂ below)

BLAS/Lapack libraries:

Code: Select all

BLAS=Â -lmkl_intel_lp64Â -lmkl_intel_threadÂ -lmkl_coreÂ -lmÂ -liomp5Â -lpthread
LAPACK=

This is the MKL pure layered model (MKL 10 style). BLAS/LAPACK libraries
with parallel MKL supporting LP64 interface. -lm if for FFT interface.
RTL library iomp5 is preferred over traditional guide since it fully
supports threaded and non-threaded user's applications with both
intel and gnu compilers. Just only make sure that -lpthread appears at the end of the link line.

mpi wrapper

In the MPI section set the mpi wrapper

Code: Select all

FC=mpif90
FCL=$(FC)

and make sure,if you have more than one mpi installation, that mpif90 corresponds to the one desired (we simply load the above openmpi-icc module).

FFT

Set fftmpi.o with fft3dlib of Juergen Furthmueller

Code: Select all

FFT3DÂ Â Â =Â fftmpi.oÂ fftmpi_map.oÂ fft3dlib.o

FFTs represent a tricky point. Despite many efforts we were able to use neither the FFTW from http://www.fftw.org nor the FFTW wrapper to the MKL library (Note that in both cases fftw3.f has to be copied in the vasp dir or -I/<include FFTW>/dir has to be added in the compile line. According to the MKL guide, the native fftw3.f has to be use when compiling user applications). Successful compilation was achieved in both cases but when running VASP the following errors were typically found (in both cases)

Code: Select all

[nodo08:05023]Â FailingÂ atÂ address:Â (nil)
[nodo08:05023]Â [Â 0]Â /lib64/libpthread.so.0Â [0x2b7188e2ae80]
[nodo08:05023]Â [Â 1]Â /prg/Source/VASP/bin/vasp-orgw(fftw_destroy_plan+0x8)Â [0x6852b4]
[nodo08:05023]Â [Â 2]Â /prg/Source/VASP/bin/vasp-orgw(dfftw_destroy_plan_+0x9)Â [0x684f89]
[nodo08:05023]Â [Â 3]Â /prg/Source/VASP/bin/vasp-orgw(fftbas_plan_+0x1ed)Â [0x66ee13]
[nodo08:05023]Â [Â 4]Â /prg/Source/VASP/bin/vasp-orgw(fftmakeplan_+0x1e)Â [0x66ec0c]
[nodo08:05023]Â [Â 5]Â /prg/Source/VASP/bin/vasp-orgw(MAIN__+0x17885)Â [0x42cbd5]
[nodo08:05023]Â [Â 6]Â /prg/Source/VASP/bin/vasp-orgw(main+0x2a)Â [0x415342]

At the present we do not know where they come from.

special cases

Two special cases have to be added at the end of the makefile for compiling FFT at a lower optimization level (-O3 runs into troubles when using vasp)

Code: Select all

fftmpi.oÂ :Â fftmpi.F
Â Â Â Â Â Â Â Â $(FC)Â $(FFLAGS)Â $(OFLAG2)Â $(INCS)Â -cÂ $*$(SUFFIX)
fftmpi_map.oÂ :Â fftmpi_map.F
Â Â Â Â Â Â Â Â $(FC)Â $(FFLAGS)Â $(OFLAG2)Â $(INCS)Â -cÂ $*$(SUFFIX)

In addition the compilation prescription for fft3dlib is set to

Code: Select all

$(FC)Â -FRÂ -lowercaseÂ -O1Â Â Â Â Â Â Â -xWÂ -prefetch-Â -unroll0Â -vec_report3Â -cÂ $*$(SUFFIX)

(does the -e95 option set in the example makefiles work with other compilers? The point is that fft3dlib.F contains goto statements..).

Before running make tell mpich that you are using ifort, e.g. use a compile script like

Code: Select all

#!/bin/bash

exportÂ OMPI_FC=ifort
exportÂ OMPI_F77=ifort

make

We all hope that our efforts will be useful to somebody in the installation process.
The makefile is the following:

Code: Select all

.SUFFIXES:Â .incÂ .fÂ .f90Â .F
#-----------------------------------------------------------------------
#Â MakefileÂ forÂ IntelÂ FortranÂ compilerÂ forÂ AthlonÂ XPÂ systems
#
#Â TheÂ makefileÂ wasÂ testedÂ onlyÂ underÂ LinuxÂ onÂ IntelÂ platforms
#Â (SuseÂ 5.3-Â SuseÂ 9.0)
#Â theÂ followinÂ compilerÂ versionsÂ haveÂ beenÂ tested
#Â 5.0,Â 6.0,Â 7.0Â andÂ 7.1Â (someÂ 8.0Â versionsÂ seemÂ toÂ failÂ compilingÂ theÂ code)
#Â presentlyÂ weÂ recommendÂ versionÂ 7.1Â orÂ 7.0,Â sinceÂ these
#Â releasesÂ haveÂ beenÂ usedÂ toÂ compileÂ theÂ presentÂ codeÂ versions
#
#Â itÂ mightÂ beÂ requiredÂ toÂ changeÂ someÂ ofÂ libraryÂ pathes,Â since
#Â LINUXÂ installationÂ varyÂ aÂ lot
#Â HenceÂ checkÂ ***ALL****Â optionsÂ inÂ thisÂ makefileÂ veryÂ carefully
#-----------------------------------------------------------------------
#
#Â BLASÂ mustÂ beÂ installedÂ onÂ theÂ machine
#Â thereÂ areÂ severalÂ options:
#Â 1)Â veryÂ slowÂ butÂ works:
#Â Â Â retrieveÂ theÂ lapackageÂ fromÂ ftp.netlib.org
#Â Â Â andÂ compileÂ theÂ blasÂ routinesÂ (BLAS/SRCÂ directory)
#Â Â Â pleaseÂ useÂ g77Â orÂ f77Â forÂ theÂ compilation.Â WhenÂ IÂ triedÂ to
#Â Â Â useÂ pgf77Â orÂ pgf90Â forÂ BLAS,Â VASPÂ hangÂ upÂ whenÂ calling
#Â Â Â ZHEEVÂ Â (howeverÂ thisÂ wasÂ withÂ lapackÂ 1.1Â nowÂ IÂ useÂ lapackÂ 2.0)
#Â 2)Â mostÂ desirable:Â getÂ anÂ optimizedÂ BLASÂ 
#
#Â theÂ twoÂ mostÂ reliableÂ packagesÂ aroundÂ areÂ presently:
#Â 3a)Â IntelsÂ ownÂ optimisedÂ BLASÂ (PIII,Â P4,Â Itanium)
#Â Â Â Â Â http://developer.intel.com/software/products/mkl/
#Â Â Â thisÂ isÂ reallyÂ excellentÂ whenÂ youÂ useÂ IntelÂ CPU's
#
#Â 3b)Â orÂ obtainÂ theÂ atlasÂ basedÂ BLASÂ routines
#Â Â Â Â Â http://math-atlas.sourceforge.net/
#Â Â Â youÂ certainlyÂ needÂ atlasÂ onÂ theÂ Athlon,Â sinceÂ theÂ Â mkl
#Â Â Â routinesÂ areÂ notÂ optimalÂ onÂ theÂ Athlon.
#Â Â Â IfÂ youÂ wantÂ toÂ useÂ atlasÂ basedÂ BLAS,Â checkÂ theÂ linesÂ aroundÂ LIB=
#
#Â 3c)Â mindblowingÂ fastÂ SSE2Â (4Â GFlopsÂ onÂ P4,Â 2.53Â GHz)
#Â Â Â KazushigeÂ Goto'sÂ BLAS
#Â Â Â http://www.cs.utexas.edu/users/kgoto/signup_first.html
#Â 
#-----------------------------------------------------------------------

#Â allÂ CPPÂ processedÂ fortranÂ filesÂ haveÂ theÂ extensionÂ .f90
SUFFIX=.f90

#-----------------------------------------------------------------------
#Â fortranÂ compilerÂ andÂ linker
#-----------------------------------------------------------------------
#FC=ifcÂ 
#Â fortranÂ linker
#FCL=$(FC)


#-----------------------------------------------------------------------
#Â whereisÂ CPPÂ ??Â (IÂ needÂ CPP,Â can'tÂ useÂ gccÂ withÂ properÂ options)
#Â that'sÂ theÂ locationÂ ofÂ gccÂ forÂ SUSEÂ 5.3
#
#Â Â CPP_Â Â Â =Â Â /usr/lib/gcc-lib/i486-linux/2.7.2/cppÂ -PÂ -CÂ 
#
#Â that'sÂ probablyÂ theÂ rightÂ lineÂ forÂ someÂ RedÂ HatÂ distribution:
#
#Â Â CPP_Â Â Â =Â Â /usr/lib/gcc-lib/i386-redhat-linux/2.7.2.3/cppÂ -PÂ -C
#
#Â Â SUSEÂ X.X,Â maybeÂ someÂ RedÂ HatÂ distributions:

CPP_Â =Â Â ./preprocessÂ <$*.FÂ |Â /usr/bin/cppÂ -PÂ -CÂ -traditionalÂ >$*$(SUFFIX)

#-----------------------------------------------------------------------
#Â possibleÂ optionsÂ forÂ CPP:
#Â NGXhalfÂ Â Â Â Â Â Â Â Â Â Â Â Â chargeÂ densityÂ Â Â reducedÂ inÂ XÂ direction
#Â wNGXhalfÂ Â Â Â Â Â Â Â Â Â Â Â gammaÂ pointÂ onlyÂ reducedÂ inÂ XÂ direction
#Â avoidallocÂ Â Â Â Â Â Â Â Â Â avoidÂ ALLOCATEÂ ifÂ possible
#Â IFCÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â workÂ aroundÂ someÂ IFCÂ bugs
#Â CACHE_SIZEÂ Â Â Â Â Â Â Â Â Â 1000Â forÂ PII,PIII,Â 5000Â forÂ Athlon,Â 8000-12000Â P4
#Â RPROMU_DGEMVÂ Â Â Â Â Â Â Â useÂ DGEMVÂ insteadÂ ofÂ DGEMMÂ inÂ RPROÂ (dependsÂ onÂ usedÂ BLAS)
#Â RACCMU_DGEMVÂ Â Â Â Â Â Â Â useÂ DGEMVÂ insteadÂ ofÂ DGEMMÂ inÂ RACCÂ (dependsÂ onÂ usedÂ BLAS)
#Â forÂ AtlasÂ Â -DRPROMU_DGEMVÂ isÂ recommended
#-----------------------------------------------------------------------

CPPÂ Â Â Â Â =Â $(CPP_)Â Â -DHOST=\"LinuxIFC_ath\"Â \
Â Â Â Â Â Â Â Â Â Â -Dkind8Â -DNGXhalfÂ -DCACHE_SIZE=5000Â -DPGF90Â -DavoidallocÂ \
Â Â Â Â Â Â Â Â Â Â -DMPI
#Â Â Â Â Â Â Â Â Â Â -DRPROMU_DGEMVÂ \

#-----------------------------------------------------------------------
#Â generalÂ fortranÂ flagsÂ Â (thereÂ mustÂ aÂ trailingÂ blankÂ onÂ thisÂ line)
#-----------------------------------------------------------------------

FFLAGSÂ =Â Â -FRÂ -lowercaseÂ -assuÂ bytereclÂ 

#-----------------------------------------------------------------------
#Â optimization
#Â weÂ haveÂ testedÂ whetherÂ higherÂ optimisationÂ improvesÂ performance
#Â -axKÂ Â SSE1Â optimization,Â Â butÂ alsoÂ generateÂ codeÂ executableÂ onÂ allÂ mach.
#Â Â Â Â Â Â Â xKÂ improvesÂ performanceÂ somewhatÂ onÂ XP,Â andÂ aÂ isÂ requiredÂ inÂ order
#Â Â Â Â Â Â Â toÂ runÂ theÂ codeÂ onÂ olderÂ AthlonsÂ asÂ well
#Â -xWÂ Â Â SSE2Â optimization
#Â -axWÂ Â SSE2Â optimization,Â Â butÂ alsoÂ generateÂ codeÂ executableÂ onÂ allÂ mach.
#Â -tpp6Â P3Â optimization
#Â -tpp7Â P4Â optimization
#-----------------------------------------------------------------------
#Â -axWÂ Â CanÂ Â generateÂ Â specializedÂ Â codeÂ Â pathsÂ forÂ SSE2Â andÂ SSEÂ instructionsÂ 
#Â Â Â Â Â Â Â forÂ IntelÂ processors,Â andÂ itÂ canÂ optimizeÂ forÂ Intel
#Â Â Â Â Â Â Â Pentium(R)Â 4Â processorsÂ andÂ Intel(R)Â Xeon(R)Â processorsÂ withÂ SSE2.

OFLAG=-O3Â -axWÂ -tpp6
OFLAG=-O3Â -axWÂ Â Â Â Â Â 
#Â ItÂ seemsÂ toÂ beÂ necessaryÂ aÂ low-levelÂ optimizationÂ whenÂ compilingÂ theÂ parallelÂ version
#Â ("veryÂ seriousÂ problemsÂ theÂ oldÂ andÂ theÂ newÂ chargeÂ densityÂ differ")Â atÂ leastÂ forÂ FFTs
OFLAG2=-O1Â -axWÂ Â Â Â Â Â 

OFLAG_HIGHÂ =Â $(OFLAG)
OBJ_HIGHÂ =Â 

OBJ_NOOPTÂ =Â 
DEBUGÂ Â =Â -FRÂ -O0
INLINEÂ =Â $(OFLAG)


#-----------------------------------------------------------------------
#Â theÂ followingÂ linesÂ specifyÂ theÂ positionÂ ofÂ BLASÂ Â andÂ LAPACK
#Â onÂ Athlon,Â VASPÂ worksÂ fastestÂ withÂ theÂ AtlasÂ library
#Â soÂ that'sÂ whatÂ IÂ recommend
#-----------------------------------------------------------------------

#Â AtlasÂ basedÂ libraries
ATLASHOME=Â $(HOME)/archives/BLAS_OPT/ATLAS/lib/Linux_ATHLONXP_SSE1/
BLAS=Â Â Â -L$(ATLASHOME)Â Â -lf77blasÂ -latlas

#Â useÂ theÂ mklÂ IntelÂ librariesÂ forÂ p4Â (www.intel.com)
#Â mkl.5.1
#Â setÂ -DRPROMU_DGEMVÂ Â -DRACCMU_DGEMVÂ inÂ theÂ CPPÂ lines
#BLAS=-L/opt/intel/mkl/lib/32Â -lmkl_p4Â Â -lpthread

#Â mkl.5.2Â requiresÂ alsoÂ toÂ -lguideÂ library
#Â setÂ -DRPROMU_DGEMVÂ Â -DRACCMU_DGEMVÂ inÂ theÂ CPPÂ lines
#BLAS=-L/opt/intel/mkl/lib/32Â -lmkl_p4Â -lguideÂ -lpthread

#Â evenÂ fasterÂ KazushigeÂ Goto'sÂ BLAS
#Â http://www.cs.utexas.edu/users/kgoto/signup_first.html
#BLAS=Â Â /opt/libs/libgoto/libgoto_p4_512-r0.6.so

#Â LAPACK,Â simplestÂ useÂ vasp.4.lib/lapack_double
#LAPACK=Â ../vasp.4.lib/lapack_double.o

#Â useÂ atlasÂ optimizedÂ partÂ ofÂ lapackÂ 
LAPACK=Â ../vasp.4.lib/lapack_atlas.oÂ -llapackÂ -lcblas

#Â useÂ theÂ mklÂ IntelÂ lapack
#LAPACK=Â -lmkl_lapack

#Â Â Â Â MKLÂ pureÂ layeredÂ modelÂ (10Â style).Â BLAS/LAPACKÂ libraries
#Â Â Â Â withÂ parallelÂ MKLÂ supportingÂ LP64Â interface.Â -lmÂ ifÂ forÂ FFTÂ interface.
#Â Â Â Â RTLÂ libraryÂ iomp5Â isÂ preferredÂ overÂ traditionalÂ guideÂ sinceÂ itÂ fully
#Â Â Â Â supportsÂ threadedÂ andÂ non-threadedÂ user'sÂ applicationsÂ withÂ both
#Â Â Â Â intelÂ andÂ gnuÂ compilers
#Â Â Â Â MakeÂ sureÂ thatÂ -lpthreadÂ appearsÂ atÂ theÂ endÂ ofÂ theÂ linkÂ line.

BLAS=Â -lmkl_intel_lp64Â -lmkl_intel_threadÂ -lmkl_coreÂ -lmÂ -liomp5Â -lpthread
LAPACK=Â 

#-----------------------------------------------------------------------

#LIBÂ Â =Â -L../vasp.4.libÂ -ldmyÂ \
#Â Â Â Â Â ../vasp.4.lib/linpack_double.oÂ $(LAPACK)Â \
#Â Â Â Â Â $(BLAS)

#Â optionsÂ forÂ linkingÂ (forÂ compilerÂ versionÂ 6.X,Â 7.1)Â nothingÂ isÂ required
LINKÂ Â Â Â =Â 
#Â compilerÂ versionÂ 7.0Â generatesÂ someÂ vectorÂ statmentsÂ whichÂ areÂ located
#Â inÂ theÂ svmlÂ library,Â addÂ theÂ LIBPATHÂ andÂ theÂ libraryÂ (justÂ inÂ case)
#LINKÂ Â Â Â =Â Â -L/opt/intel/compiler70/ia32/lib/Â -lsvmlÂ 

#-----------------------------------------------------------------------
#Â fftÂ libraries:
#Â VASP.4.6Â canÂ useÂ fftw.3.0.XÂ (http://www.fftw.org)
#Â sinceÂ thisÂ versionÂ isÂ fasterÂ onÂ P4Â machines,Â weÂ recommendÂ toÂ useÂ it
#-----------------------------------------------------------------------

#FFT3DÂ Â Â =Â fft3dfurth.oÂ fft3dlib.o
#FFT3DÂ Â Â =Â fftw3d.oÂ fft3dlib.oÂ Â Â /opt/libs/fftw-3.0.1/lib/libfftw3.a


#=======================================================================
#Â MPIÂ section,Â uncommentÂ theÂ followingÂ lines
#Â 
#Â oneÂ commentÂ forÂ usersÂ ofÂ mpichÂ orÂ lam:
#Â YouÂ mustÂ *not*Â compileÂ mpiÂ withÂ g77/f77,Â becauseÂ f77/g77Â Â Â Â Â Â Â Â Â Â Â Â Â 
#Â appendsÂ *two*Â underscoresÂ toÂ symbolsÂ thatÂ containÂ alreadyÂ anÂ Â Â Â Â Â Â Â 
#Â underscoreÂ (i.e.Â MPI_SENDÂ becomesÂ mpi_send__).Â Â TheÂ pgf90/ifc
#Â compilersÂ howeverÂ appendÂ onlyÂ oneÂ underscore.
#Â PrecompiledÂ mpiÂ versionÂ willÂ alsoÂ notÂ workÂ !!!
#
#Â WeÂ foundÂ thatÂ mpich.1.2.1Â andÂ lam-6.5.XÂ toÂ lam-7.0.4Â areÂ stable
#Â mpich.1.2.1Â wasÂ configuredÂ withÂ 
#Â Â ./configureÂ -prefix=/usr/local/mpich_nodvdbgÂ -fc="pgf77Â -Mx,119,0x200000"Â Â \
#Â -f90="pgf90Â -Mx,119,0x200000"Â \
#Â --without-romioÂ --without-mpeÂ -opt=-OÂ \
#Â 
#Â lamÂ wasÂ configuredÂ withÂ theÂ line
#Â Â ./configureÂ Â -prefixÂ /opt/libs/lam-7.0.4Â --with-cflags=-OÂ -with-fc=ifcÂ \
#Â --with-f77flags=-OÂ --without-romio
#Â 
#Â pleaseÂ noteÂ thatÂ youÂ mightÂ beÂ ableÂ toÂ useÂ aÂ lamÂ orÂ mpichÂ versionÂ 
#Â compiledÂ withÂ f77/g77,Â butÂ thenÂ youÂ needÂ toÂ addÂ theÂ following
#Â options:Â -Msecond_underscoreÂ (compilation)Â andÂ -g77libsÂ (linking)
#
#Â !!!Â PleaseÂ doÂ notÂ sendÂ meÂ anyÂ queriesÂ onÂ howÂ toÂ installÂ MPI,Â IÂ will
#Â certainlyÂ notÂ answerÂ themÂ !!!!
#=======================================================================
#-----------------------------------------------------------------------
#Â fortranÂ linkerÂ forÂ mpi:Â ifÂ youÂ useÂ LAMÂ andÂ compiledÂ itÂ withÂ theÂ options
#Â suggestedÂ above,Â Â youÂ canÂ useÂ theÂ followingÂ line
#-----------------------------------------------------------------------

FC=mpif90
FCL=$(FC)

#-----------------------------------------------------------------------
#Â additionalÂ optionsÂ forÂ CPPÂ inÂ parallelÂ versionÂ (seeÂ alsoÂ above):
#Â NGZhalfÂ Â Â Â Â Â Â Â Â Â Â Â Â Â Â chargeÂ densityÂ Â Â reducedÂ inÂ ZÂ direction
#Â wNGZhalfÂ Â Â Â Â Â Â Â Â Â Â Â Â Â gammaÂ pointÂ onlyÂ reducedÂ inÂ ZÂ direction
#Â scaLAPACKÂ Â Â Â Â Â Â Â Â Â Â Â Â useÂ scaLAPACKÂ (usuallyÂ slowerÂ onÂ 100Â MbitÂ Net)
#Â 1000Â orÂ 2000Â areÂ theÂ optimalÂ CACHE_SIZEÂ forÂ theÂ parallelÂ version
#Â andÂ IFCÂ onÂ AthlonÂ XPÂ (gK)
#-----------------------------------------------------------------------

CPPÂ Â Â Â =Â $(CPP_)Â -DMPIÂ Â -DHOST=\"LinuxIFC_ath\"Â -DIFCÂ \
Â Â Â Â Â -Dkind8Â -DNGZhalfÂ -DCACHE_SIZE=2000Â -DPGF90Â -DavoidallocÂ \
#Â Â Â Â Â -DRPROMU_DGEMV

#-----------------------------------------------------------------------
#Â locationÂ ofÂ SCALAPACK
#Â ifÂ youÂ doÂ notÂ useÂ SCALAPACKÂ simplyÂ uncommentÂ theÂ lineÂ SCA
#-----------------------------------------------------------------------

BLACS=$(HOME)/archives/SCALAPACK/BLACS/
SCA_=$(HOME)/archives/SCALAPACK/SCALAPACK

SCA=Â $(SCA_)/libscalapack.aÂ Â \
Â $(BLACS)/LIB/blacsF77init_MPI-LINUX-0.aÂ $(BLACS)/LIB/blacs_MPI-LINUX-0.aÂ $(BLACS)/LIB/blacsF77init_MPI-LINUX-0.a

SCA=

#-----------------------------------------------------------------------
#Â librariesÂ forÂ mpi
#-----------------------------------------------------------------------

LIBÂ Â Â Â Â =Â -L../vasp.4.libÂ -ldmyÂ Â \
Â Â Â Â Â Â ../vasp.4.lib/linpack_double.oÂ $(LAPACK)Â \
Â Â Â Â Â Â $(SCA)Â $(BLAS)

#Â FFT:Â fftmpi.oÂ withÂ fft3dlibÂ ofÂ JuergenÂ Furthmueller
FFT3DÂ Â Â =Â fftmpi.oÂ fftmpi_map.oÂ fft3dlib.oÂ 


#Â fftw.3.0.1Â isÂ slighlyÂ fasterÂ andÂ shouldÂ beÂ usedÂ ifÂ available
#Â NoÂ wayÂ toÂ useÂ FFTWÂ (copyÂ /prg/fftw/include/fftw3.fÂ inÂ theÂ dirÂ orÂ useÂ -I)Â Â 
#Â FFT3DÂ =Â fftmpiw.oÂ fftmpi_map.oÂ Â fft3dlib.oÂ Â Â /prg/fftw/lib/libfftw3.a
#[nodo08:05111]Â [Â 0]Â /lib64/libpthread.so.0Â [0x2b47c7924e80]
#[nodo08:05111]Â [Â 1]Â /prg/Source/VASP/bin/vasp-orgw(fftw_destroy_plan+0xa)Â [0x68df72]
#[nodo08:05111]Â [Â 2]Â /prg/Source/VASP/bin/vasp-orgw(dfftw_destroy_plan_+0x9)Â [0x68c785]
#[nodo08:05111]Â [Â 3]Â /prg/Source/VASP/bin/vasp-orgw(fftbas_plan_+0x1ed)Â [0x674193]
#[nodo08:05111]Â [Â 4]Â /prg/Source/VASP/bin/vasp-orgw(fftmakeplan_+0x1e)Â [0x673f8c]
#[nodo08:05111]Â [Â 5]Â /prg/Source/VASP/bin/vasp-orgw(MAIN__+0x17885)Â [0x431f55]
#[nodo08:05111]Â [Â 6]Â /prg/Source/VASP/bin/vasp-orgw(main+0x2a)Â [0x41a6c2]
#[nodo08:05111]Â [Â 7]Â /lib64/libc.so.6(__libc_start_main+0xf4)Â [0x2b47c8d558b4]
#[nodo08:05111]Â [Â 8]Â /prg/Source/VASP/bin/vasp-orgwÂ [0x41a5e9]


#Â NoÂ wayÂ toÂ useÂ FFTWÂ withÂ MKLÂ wrappersÂ (usingÂ bothÂ fftw.fÂ fromÂ MKLÂ andÂ originalÂ one):Â 
#Â FFT3DÂ Â =Â fftmpiw.oÂ fftmpi_map.oÂ fft3dlib.oÂ Â

Code: Select all

/prg/intel/mkl/10.0.4.023/lib/em64t/libfftw3xf_intel.a
#Â nodo08:05023]Â FailingÂ atÂ address:Â (nil)
#Â [nodo08:05023]Â [Â 0]Â /lib64/libpthread.so.0Â [0x2b7188e2ae80]
#Â [nodo08:05023]Â [Â 1]Â /prg/Source/VASP/bin/vasp-orgw(fftw_destroy_plan+0x8)Â [0x6852b4]
#Â [nodo08:05023]Â [Â 2]Â /prg/Source/VASP/bin/vasp-orgw(dfftw_destroy_plan_+0x9)Â [0x684f89]
#Â [nodo08:05023]Â [Â 3]Â /prg/Source/VASP/bin/vasp-orgw(fftbas_plan_+0x1ed)Â [0x66ee13]
#Â [nodo08:05023]Â [Â 4]Â /prg/Source/VASP/bin/vasp-orgw(fftmakeplan_+0x1e)Â [0x66ec0c]
#Â [nodo08:05023]Â [Â 5]Â /prg/Source/VASP/bin/vasp-orgw(MAIN__+0x17885)Â [0x42cbd5]
#Â [nodo08:05023]Â [Â 6]Â /prg/Source/VASP/bin/vasp-orgw(main+0x2a)Â [0x415342]


#-----------------------------------------------------------------------
#Â generalÂ rulesÂ andÂ compileÂ lines
#-----------------------------------------------------------------------
BASIC=Â Â Â symmetry.oÂ symlib.oÂ Â Â lattlib.oÂ Â random.oÂ Â Â 

SOURCE=Â Â base.oÂ Â Â Â Â mpi.oÂ Â Â Â Â Â smart_allocate.oÂ Â Â Â Â Â xml.oÂ Â \
Â Â Â Â Â Â Â Â Â constant.oÂ jacobi.oÂ Â Â main_mpi.oÂ Â scala.oÂ Â Â \
Â Â Â Â Â Â Â Â Â asa.oÂ Â Â Â Â Â lattice.oÂ Â poscar.oÂ Â Â ini.oÂ Â Â Â Â Â setex.oÂ Â Â Â Â radial.oÂ Â \
Â Â Â Â Â Â Â Â Â pseudo.oÂ Â Â mgrid.oÂ Â Â Â mkpoints.oÂ wave.oÂ Â Â Â Â Â wave_mpi.oÂ Â $(BASIC)Â \
Â Â Â Â Â Â Â Â Â nonl.oÂ Â Â Â Â nonlr.oÂ Â Â Â dfast.oÂ Â Â Â choleski2.oÂ Â Â Â \
Â Â Â Â Â Â Â Â Â mix.oÂ Â Â Â Â Â charge.oÂ Â Â xcgrad.oÂ Â Â xcspin.oÂ Â Â Â potex1.oÂ Â Â potex2.oÂ Â \
Â Â Â Â Â Â Â Â Â metagga.oÂ Â constrmag.oÂ pot.oÂ Â Â Â Â Â cl_shift.oÂ force.oÂ Â Â Â dos.oÂ Â Â Â Â Â elf.oÂ Â Â Â Â Â \
Â Â Â Â Â Â Â Â Â tet.oÂ Â Â Â Â Â hamil.oÂ Â Â Â steep.oÂ Â Â Â \
Â Â Â Â Â Â Â Â Â chain.oÂ Â Â Â dyna.oÂ Â Â Â Â relativistic.oÂ LDApU.oÂ sphpro.oÂ Â paw.oÂ Â Â us.oÂ \
Â Â Â Â Â Â Â Â Â ebs.oÂ Â Â Â Â Â wavpre.oÂ Â Â wavpre_noio.oÂ broyden.oÂ \
Â Â Â Â Â Â Â Â Â dynbr.oÂ Â Â Â rmm-diis.oÂ reader.oÂ Â Â writer.oÂ Â Â tutor.oÂ xml_writer.oÂ \
Â Â Â Â Â Â Â Â Â brent.oÂ Â Â Â stufak.oÂ Â Â fileio.oÂ Â Â opergrid.oÂ stepver.oÂ Â \
Â Â Â Â Â Â Â Â Â dipol.oÂ Â Â Â xclib.oÂ Â Â Â chgloc.oÂ Â Â subrot.oÂ Â Â optreal.oÂ Â Â davidson.oÂ \
Â Â Â Â Â Â Â Â Â edtest.oÂ Â Â electron.oÂ shm.oÂ Â Â Â Â Â pardens.oÂ Â paircorrection.oÂ \
Â Â Â Â Â Â Â Â Â optics.oÂ Â Â constr_cell_relax.oÂ Â Â stm.oÂ Â Â Â finite_diff.oÂ \
Â Â Â Â Â Â Â Â Â elpol.oÂ Â Â Â setlocalpp.oÂ aedens.oÂ 
Â 
INC=

vasp:Â $(SOURCE)Â $(FFT3D)Â $(INC)Â main.oÂ 
	rmÂ -fÂ vasp
	$(FCL)Â -oÂ vaspÂ $(LINK)Â main.oÂ Â $(SOURCE)Â Â Â $(FFT3D)Â $(LIB)Â 
makeparam:Â $(SOURCE)Â $(FFT3D)Â makeparam.oÂ main.FÂ $(INC)
	$(FCL)Â -oÂ makeparamÂ Â $(LINK)Â makeparam.oÂ $(SOURCE)Â $(FFT3D)Â $(LIB)
zgemmtest:Â zgemmtest.oÂ base.oÂ random.oÂ $(INC)
	$(FCL)Â -oÂ zgemmtestÂ $(LINK)Â zgemmtest.oÂ random.oÂ base.oÂ $(LIB)
dgemmtest:Â dgemmtest.oÂ base.oÂ random.oÂ $(INC)
	$(FCL)Â -oÂ dgemmtestÂ $(LINK)Â dgemmtest.oÂ random.oÂ base.oÂ $(LIB)Â 
ffttest:Â base.oÂ smart_allocate.oÂ mpi.oÂ mgrid.oÂ random.oÂ ffttest.oÂ $(FFT3D)Â $(INC)
	$(FCL)Â -oÂ ffttestÂ $(LINK)Â ffttest.oÂ mpi.oÂ mgrid.oÂ random.oÂ smart_allocate.oÂ base.oÂ $(FFT3D)Â $(LIB)
kpoints:Â $(SOURCE)Â $(FFT3D)Â makekpoints.oÂ main.FÂ $(INC)
	$(FCL)Â -oÂ kpointsÂ $(LINK)Â makekpoints.oÂ $(SOURCE)Â $(FFT3D)Â $(LIB)

clean:	
	-rmÂ -fÂ *.gÂ *.fÂ *.oÂ *.LÂ *.modÂ ;Â touchÂ *.F

main.o:Â main$(SUFFIX)
	$(FC)Â $(FFLAGS)$(DEBUG)Â Â $(INCS)Â -cÂ main$(SUFFIX)
xcgrad.o:Â xcgrad$(SUFFIX)
	$(FC)Â $(FFLAGS)Â $(INLINE)Â Â $(INCS)Â -cÂ xcgrad$(SUFFIX)
xcspin.o:Â xcspin$(SUFFIX)
	$(FC)Â $(FFLAGS)Â $(INLINE)Â Â $(INCS)Â -cÂ xcspin$(SUFFIX)

makeparam.o:Â makeparam$(SUFFIX)
	$(FC)Â $(FFLAGS)$(DEBUG)Â Â $(INCS)Â -cÂ makeparam$(SUFFIX)

makeparam$(SUFFIX):Â makeparam.FÂ main.FÂ 
#
#Â MIND:Â IÂ doÂ notÂ haveÂ aÂ fullÂ dependencyÂ listÂ forÂ theÂ include
#Â andÂ MODULES:Â hereÂ areÂ onlyÂ theÂ minimalÂ basicÂ dependencies
#Â ifÂ oneÂ strucutureÂ isÂ changedÂ thenÂ touch_depÂ mustÂ beÂ called
#Â withÂ theÂ correspondingÂ nameÂ ofÂ theÂ structure
#
base.o:Â base.incÂ base.F
mgrid.o:Â mgrid.incÂ mgrid.F
constant.o:Â constant.incÂ constant.F
lattice.o:Â lattice.incÂ lattice.F
setex.o:Â setexm.incÂ setex.F
pseudo.o:Â pseudo.incÂ pseudo.F
poscar.o:Â poscar.incÂ poscar.F
mkpoints.o:Â mkpoints.incÂ mkpoints.F
wave.o:Â wave.incÂ wave.F
nonl.o:Â nonl.incÂ nonl.F
nonlr.o:Â nonlr.incÂ nonlr.F

$(OBJ_HIGH):
	$(CPP)
	$(FC)Â $(FFLAGS)Â $(OFLAG_HIGH)Â $(INCS)Â -cÂ $*$(SUFFIX)
$(OBJ_NOOPT):
	$(CPP)
	$(FC)Â $(FFLAGS)Â $(INCS)Â -cÂ $*$(SUFFIX)

fft3dlib_f77.o:Â fft3dlib_f77.F
	$(CPP)
	$(F77)Â $(FFLAGS_F77)Â -cÂ $*$(SUFFIX)

.F.o:
	$(CPP)
	$(FC)Â $(FFLAGS)Â $(OFLAG)Â $(INCS)Â -cÂ $*$(SUFFIX)
.F$(SUFFIX):
	$(CPP)
$(SUFFIX).o:
	$(FC)Â $(FFLAGS)Â $(OFLAG)Â $(INCS)Â -cÂ $*$(SUFFIX)

#Â specialÂ rules
#-----------------------------------------------------------------------
#Â -tpp5|6|7Â P,Â PII-PIII,Â PIV
#Â -xWÂ useÂ SIMDÂ (doesÂ notÂ payÂ ofÂ onÂ PII,Â sinceÂ fft3dÂ usesÂ doubleÂ prec)
#Â allÂ otherÂ optionsÂ doÂ noÂ affectÂ theÂ codeÂ performanceÂ sinceÂ -O1Â isÂ used
#-----------------------------------------------------------------------

#Â Let'sÂ addÂ theseÂ twoÂ specialÂ rules
fftmpi.oÂ :Â fftmpi.F
	$(FC)Â $(FFLAGS)Â $(OFLAG2)Â $(INCS)Â -cÂ $*$(SUFFIX)Â Â 
fftmpi_map.oÂ :Â fftmpi_map.F
	$(FC)Â $(FFLAGS)Â $(OFLAG2)Â $(INCS)Â -cÂ $*$(SUFFIX)Â Â 

fft3dlib.oÂ :Â fft3dlib.F
	$(CPP)
	$(FC)Â -FRÂ -lowercaseÂ -O1Â Â Â Â Â Â Â -xWÂ -prefetch-Â -unroll0Â -vec_report3Â -cÂ $*$(SUFFIX)
#	$(FC)Â -FRÂ -lowercaseÂ -e95Â -vec_report3Â -O1Â -tpp6Â -prefetchÂ -unroll0Â -cÂ $*$(SUFFIX)
	$(CPP)
#	$(FC)Â -FRÂ -lowercaseÂ -e95Â -cÂ $*$(SUFFIX)

lattlib.o:Â lattlib.F
	$(CPP)
	$(FC)Â -FRÂ -lowercaseÂ Â Â Â Â Â -cÂ $*$(SUFFIX)
#	$(FC)Â -FRÂ -lowercaseÂ -e95Â -cÂ $*$(SUFFIX)

radial.oÂ :Â radial.F
	$(CPP)
	$(FC)Â -FRÂ -lowercaseÂ -O1Â -cÂ $*$(SUFFIX)

symlib.oÂ :Â symlib.F
	$(CPP)
	$(FC)Â -FRÂ -lowercaseÂ -O1Â -cÂ $*$(SUFFIX)

symmetry.oÂ :Â symmetry.F
	$(CPP)
	$(FC)Â -FRÂ -lowercaseÂ -O1Â -cÂ $*$(SUFFIX)

dynbr.oÂ :Â dynbr.F
	$(CPP)
	$(FC)Â -FRÂ -lowercaseÂ -O1Â -cÂ $*$(SUFFIX)

us.oÂ :Â us.F
	$(CPP)
	$(FC)Â -FRÂ -lowercaseÂ -O1Â -cÂ $*$(SUFFIX)

broyden.oÂ :Â broyden.F
	$(CPP)
	$(FC)Â -FRÂ -lowercaseÂ -O2Â -cÂ $*$(SUFFIX)

wave.oÂ :Â wave.F
	$(CPP)
	$(FC)Â -FRÂ -lowercaseÂ -O0Â -cÂ $*$(SUFFIX)

LDApU.oÂ :Â LDApU.F
	$(CPP)
	$(FC)Â -FRÂ -lowercaseÂ -O2Â -cÂ $*$(SUFFIX)

Good luck!

Best Regards

Rocco Martinazzo and Simone Casolo
Univ. Milan, Italy

<span class='smallblacktext'>[ Edited ]</span>

jagladden · #2 Post by **jagladden** » Tue Jan 27, 2009 5:24 am

Dear Simoneca,

I am curious to know what kind of results your build yielded on your Intel dual quad-core nodes. I have invested some effort in benchmarking a Dell PE1950 box (dual Intel L5410 quad-cores) and have been disappointed with the scaling behavior.

For Hg.bench I get results roughly like this:

1 core = 45 Secs
2 core = 29 Secs
4 core = 23 Secs
8 core = 21 Secs

These are wall clock times. As you can see there is very little improvement when going from 4 to 8 cores on the same box.

Is this consistent with your results?

Interestingly, our exiting cluster composed of four year old Dell SC1425s (dual single-core Xeons) actually scales better even though communication is via Gig Ethernet rather than strictly shared memory.

Jim

TMarques · #3 Post by **TMarques** » Tue Feb 17, 2009 5:10 pm

Try to run with 6 cores. For me it gives me more performance than 8.
I also have worse performance with 4 cores than with 3. Haven't been able to figure out why, yet, it just seems to be BW starved, as another code runs better on 7 than on 8. These are consistent results, no matter how long the test is.

pafell · #4 Post by **pafell** » Wed Feb 18, 2009 11:55 pm

This behavior is caused by not-so-good core2 design. The main problem is very slow memory to cpu connection.
On E5430 single cpu server we get following improvement:
using 2 cores, we get 148% of single-core power,
using 3 cores, we get 208% of single-core power,
using 4 cored, we get 198% of single-core power.
So indeed results are pretty the same here.
Unfortunately those shiny quad-cores from AMD are much slower in single-cpu jobs, so in here we find core2 better. Although there are rumours that i7 from Intel is much faster.