Page 1 of 1

VASP on intel em64t and MKL

Posted: Mon Nov 03, 2008 9:54 am
by simoneca
Dear all,

we've just succeeded in comipling a parallel version of VASP and want to share the experience. Suggestions about how to improve our installation scheme are of course welcome. :)

*system*

- 8 nodes, 2 x Intel(R) Xeon(R) QuadCore (em64t) each, so 64 cores overall
- Oscarized CentoOS 5.2
- Intel Fortran and C/C++ compilers 10.1
- Intel Math Kernel Libraries 10.0
- OpenMPI 1.2.5, icc compiled
- Vasp 4.6

Of course our starting point will be the corresponding makefile provided by the VASP team together with the source code

installation

Installation of the Intel compilers/libraries is straightforward and won't be detailed here. The same is true for OpenMPI which we compiled with the Intel C/C++ compiler to avoid possible conflicts. The source comes with the configure script, which allows one to easily choose the compilers to be used, e.g.

Code: Select all

./configure CC=icc CXX=icc F77=ifort F90=ifort
Environment variables are handled with the module/switcher tools available with Oscar (Alternatively, one can manually set them). For example a module for intel environment has been created with the following settings

Code: Select all

$module show compile/intel
-------------------------------------------------------------------
/opt/env-switcher/share/env-switcher/compile/intel:

module-whatis   Setup Intel-suite in your environment.
conflict             compilers
prepend-path     INCLUDE /prg/intel/mkl/10.0.4.023/include
prepend-path     CPATH /prg/intel/mkl/10.0.4.023/include
prepend-path     FPATH /prg/intel/mkl/10.0.4.023/include
prepend-path     PATH /prg/intel/fce/10.1.018/bin/
prepend-path     PATH /prg/intel/cce/10.1.018/bin/
prepend-path     PATH /prg/intel/mkl/10.0.4.023/lib/em64t/
prepend-path     NLSPATH /prg/intel/fce/10.1.018/lib/locale/en_US/%N
prepend-path     LD_LIBRARY_PATH /prg/intel/fce/10.1.018/lib
prepend-path     LD_LIBRARY_PATH /prg/intel/cce/10.1.018/lib
prepend-path     LD_LIBRARY_PATH /prg/intel/mkl/10.0.4.023/lib/em64t/
prepend-path     LIBRARY_PATH /prg/intel/mkl/10.0.4.023/lib/em64t/
prepend-path     MANPATH /prg/intel/cce/10.1.018/man/
prepend-path     MANPATH /prg/intel/fce/10.1.018/man/
prepend-path     MANPATH /prg/intel/mkl/10.0.4.023/man/
-------------------------------------------------------------------
and loaded for intel compilations. The same holds for the openmpi module

Code: Select all

$module show mpi/openmpi-1.2.5-icc
-------------------------------------------------------------------
/opt/env-switcher/share/env-switcher/mpi/openmpi-1.2.5-icc:

module-whatis    Sets up the OpenMPI-icc environment for an OSCAR cluster.
conflict             mpi
prepend-path     PATH /usr/lib64/openmpi/1.2.5-icc/bin/
prepend-path     LD_LIBRARY_PATH /usr/lib64/openmpi/1.2.5-icc/lib
prepend-path     MANPATH /usr/lib64/openmpi/1.2.5-icc/share/man
-------------------------------------------------------------------
VASP.4.lib is easy, just set

Code: Select all

FC=ifort
in the makefile.

For VASP 4.6, starting with makefile.linux_ifc_ath coming with VASP, we did the following modifications

compiler flags:

Code: Select all

FFLAGS =  -FR -lowercase -assu byterecl
OFLAG=-O3 -axW
OFLAG2=-O1 -axW (used for special cases below)
BLAS/Lapack libraries:

Code: Select all

BLAS= -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lm -liomp5 -lpthread
LAPACK=
This is the MKL pure layered model (MKL 10 style). BLAS/LAPACK libraries
with parallel MKL supporting LP64 interface. -lm if for FFT interface.
RTL library iomp5 is preferred over traditional guide since it fully
supports threaded and non-threaded user's applications with both
intel and gnu compilers. Just only make sure that -lpthread appears at the end of the link line.

mpi wrapper

In the MPI section set the mpi wrapper

Code: Select all

FC=mpif90
FCL=$(FC)
and make sure,if you have more than one mpi installation, that mpif90 corresponds to the one desired (we simply load the above openmpi-icc module).

FFT

Set fftmpi.o with fft3dlib of Juergen Furthmueller

Code: Select all

FFT3D   = fftmpi.o fftmpi_map.o fft3dlib.o
FFTs represent a tricky point. Despite many efforts we were able to use neither the FFTW from http://www.fftw.org nor the FFTW wrapper to the MKL library (Note that in both cases fftw3.f has to be copied in the vasp dir or -I/<include FFTW>/dir has to be added in the compile line. According to the MKL guide, the native fftw3.f has to be use when compiling user applications). Successful compilation was achieved in both cases but when running VASP the following errors were typically found (in both cases)

Code: Select all

[nodo08:05023] Failing at address: (nil)
[nodo08:05023] [ 0] /lib64/libpthread.so.0 [0x2b7188e2ae80]
[nodo08:05023] [ 1] /prg/Source/VASP/bin/vasp-orgw(fftw_destroy_plan+0x8) [0x6852b4]
[nodo08:05023] [ 2] /prg/Source/VASP/bin/vasp-orgw(dfftw_destroy_plan_+0x9) [0x684f89]
[nodo08:05023] [ 3] /prg/Source/VASP/bin/vasp-orgw(fftbas_plan_+0x1ed) [0x66ee13]
[nodo08:05023] [ 4] /prg/Source/VASP/bin/vasp-orgw(fftmakeplan_+0x1e) [0x66ec0c]
[nodo08:05023] [ 5] /prg/Source/VASP/bin/vasp-orgw(MAIN__+0x17885) [0x42cbd5]
[nodo08:05023] [ 6] /prg/Source/VASP/bin/vasp-orgw(main+0x2a) [0x415342]


At the present we do not know where they come from.

special cases

Two special cases have to be added at the end of the makefile for compiling FFT at a lower optimization level (-O3 runs into troubles when using vasp)

Code: Select all

fftmpi.o : fftmpi.F
        $(FC) $(FFLAGS) $(OFLAG2) $(INCS) -c $*$(SUFFIX)
fftmpi_map.o : fftmpi_map.F
        $(FC) $(FFLAGS) $(OFLAG2) $(INCS) -c $*$(SUFFIX)
In addition the compilation prescription for fft3dlib is set to

Code: Select all

$(FC) -FR -lowercase -O1       -xW -prefetch- -unroll0 -vec_report3 -c $*$(SUFFIX)
(does the -e95 option set in the example makefiles work with other compilers? The point is that fft3dlib.F contains goto statements..).

Before running make tell mpich that you are using ifort, e.g. use a compile script like

Code: Select all

#!/bin/bash

export OMPI_FC=ifort
export OMPI_F77=ifort

make
We all hope that our efforts will be useful to somebody in the installation process.
The makefile is the following:

Code: Select all

.SUFFIXES: .inc .f .f90 .F
#-----------------------------------------------------------------------
# Makefile for Intel Fortran compiler for Athlon XP systems
#
# The makefile was tested only under Linux on Intel platforms
# (Suse 5.3- Suse 9.0)
# the followin compiler versions have been tested
# 5.0, 6.0, 7.0 and 7.1 (some 8.0 versions seem to fail compiling the code)
# presently we recommend version 7.1 or 7.0, since these
# releases have been used to compile the present code versions
#
# it might be required to change some of library pathes, since
# LINUX installation vary a lot
# Hence check ***ALL**** options in this makefile very carefully
#-----------------------------------------------------------------------
#
# BLAS must be installed on the machine
# there are several options:
# 1) very slow but works:
#   retrieve the lapackage from ftp.netlib.org
#   and compile the blas routines (BLAS/SRC directory)
#   please use g77 or f77 for the compilation. When I tried to
#   use pgf77 or pgf90 for BLAS, VASP hang up when calling
#   ZHEEV  (however this was with lapack 1.1 now I use lapack 2.0)
# 2) most desirable: get an optimized BLAS 
#
# the two most reliable packages around are presently:
# 3a) Intels own optimised BLAS (PIII, P4, Itanium)
#     http://developer.intel.com/software/products/mkl/
#   this is really excellent when you use Intel CPU's
#
# 3b) or obtain the atlas based BLAS routines
#     http://math-atlas.sourceforge.net/
#   you certainly need atlas on the Athlon, since the  mkl
#   routines are not optimal on the Athlon.
#   If you want to use atlas based BLAS, check the lines around LIB=
#
# 3c) mindblowing fast SSE2 (4 GFlops on P4, 2.53 GHz)
#   Kazushige Goto's BLAS
#   http://www.cs.utexas.edu/users/kgoto/signup_first.html
# 
#-----------------------------------------------------------------------

# all CPP processed fortran files have the extension .f90
SUFFIX=.f90

#-----------------------------------------------------------------------
# fortran compiler and linker
#-----------------------------------------------------------------------
#FC=ifc 
# fortran linker
#FCL=$(FC)


#-----------------------------------------------------------------------
# whereis CPP ?? (I need CPP, can't use gcc with proper options)
# that's the location of gcc for SUSE 5.3
#
#  CPP_   =  /usr/lib/gcc-lib/i486-linux/2.7.2/cpp -P -C 
#
# that's probably the right line for some Red Hat distribution:
#
#  CPP_   =  /usr/lib/gcc-lib/i386-redhat-linux/2.7.2.3/cpp -P -C
#
#  SUSE X.X, maybe some Red Hat distributions:

CPP_ =  ./preprocess <$*.F | /usr/bin/cpp -P -C -traditional >$*$(SUFFIX)

#-----------------------------------------------------------------------
# possible options for CPP:
# NGXhalf             charge density   reduced in X direction
# wNGXhalf            gamma point only reduced in X direction
# avoidalloc          avoid ALLOCATE if possible
# IFC                 work around some IFC bugs
# CACHE_SIZE          1000 for PII,PIII, 5000 for Athlon, 8000-12000 P4
# RPROMU_DGEMV        use DGEMV instead of DGEMM in RPRO (depends on used BLAS)
# RACCMU_DGEMV        use DGEMV instead of DGEMM in RACC (depends on used BLAS)
# for Atlas  -DRPROMU_DGEMV is recommended
#-----------------------------------------------------------------------

CPP     = $(CPP_)  -DHOST=\"LinuxIFC_ath\" \
          -Dkind8 -DNGXhalf -DCACHE_SIZE=5000 -DPGF90 -Davoidalloc \
          -DMPI
#          -DRPROMU_DGEMV \

#-----------------------------------------------------------------------
# general fortran flags  (there must a trailing blank on this line)
#-----------------------------------------------------------------------

FFLAGS =  -FR -lowercase -assu byterecl 

#-----------------------------------------------------------------------
# optimization
# we have tested whether higher optimisation improves performance
# -axK  SSE1 optimization,  but also generate code executable on all mach.
#       xK improves performance somewhat on XP, and a is required in order
#       to run the code on older Athlons as well
# -xW   SSE2 optimization
# -axW  SSE2 optimization,  but also generate code executable on all mach.
# -tpp6 P3 optimization
# -tpp7 P4 optimization
#-----------------------------------------------------------------------
# -axW  Can  generate  specialized  code  paths for SSE2 and SSE instructions 
#       for Intel processors, and it can optimize for Intel
#       Pentium(R) 4 processors and Intel(R) Xeon(R) processors with SSE2.

OFLAG=-O3 -axW -tpp6
OFLAG=-O3 -axW      
# It seems to be necessary a low-level optimization when compiling the parallel version
# ("very serious problems the old and the new charge density differ") at least for FFTs
OFLAG2=-O1 -axW      

OFLAG_HIGH = $(OFLAG)
OBJ_HIGH = 

OBJ_NOOPT = 
DEBUG  = -FR -O0
INLINE = $(OFLAG)


#-----------------------------------------------------------------------
# the following lines specify the position of BLAS  and LAPACK
# on Athlon, VASP works fastest with the Atlas library
# so that's what I recommend
#-----------------------------------------------------------------------

# Atlas based libraries
ATLASHOME= $(HOME)/archives/BLAS_OPT/ATLAS/lib/Linux_ATHLONXP_SSE1/
BLAS=   -L$(ATLASHOME)  -lf77blas -latlas

# use the mkl Intel libraries for p4 (www.intel.com)
# mkl.5.1
# set -DRPROMU_DGEMV  -DRACCMU_DGEMV in the CPP lines
#BLAS=-L/opt/intel/mkl/lib/32 -lmkl_p4  -lpthread

# mkl.5.2 requires also to -lguide library
# set -DRPROMU_DGEMV  -DRACCMU_DGEMV in the CPP lines
#BLAS=-L/opt/intel/mkl/lib/32 -lmkl_p4 -lguide -lpthread

# even faster Kazushige Goto's BLAS
# http://www.cs.utexas.edu/users/kgoto/signup_first.html
#BLAS=  /opt/libs/libgoto/libgoto_p4_512-r0.6.so

# LAPACK, simplest use vasp.4.lib/lapack_double
#LAPACK= ../vasp.4.lib/lapack_double.o

# use atlas optimized part of lapack 
LAPACK= ../vasp.4.lib/lapack_atlas.o -llapack -lcblas

# use the mkl Intel lapack
#LAPACK= -lmkl_lapack

#    MKL pure layered model (10 style). BLAS/LAPACK libraries
#    with parallel MKL supporting LP64 interface. -lm if for FFT interface.
#    RTL library iomp5 is preferred over traditional guide since it fully
#    supports threaded and non-threaded user's applications with both
#    intel and gnu compilers
#    Make sure that -lpthread appears at the end of the link line.

BLAS= -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lm -liomp5 -lpthread
LAPACK= 

#-----------------------------------------------------------------------

#LIB  = -L../vasp.4.lib -ldmy \
#     ../vasp.4.lib/linpack_double.o $(LAPACK) \
#     $(BLAS)

# options for linking (for compiler version 6.X, 7.1) nothing is required
LINK    = 
# compiler version 7.0 generates some vector statments which are located
# in the svml library, add the LIBPATH and the library (just in case)
#LINK    =  -L/opt/intel/compiler70/ia32/lib/ -lsvml 

#-----------------------------------------------------------------------
# fft libraries:
# VASP.4.6 can use fftw.3.0.X (http://www.fftw.org)
# since this version is faster on P4 machines, we recommend to use it
#-----------------------------------------------------------------------

#FFT3D   = fft3dfurth.o fft3dlib.o
#FFT3D   = fftw3d.o fft3dlib.o   /opt/libs/fftw-3.0.1/lib/libfftw3.a


#=======================================================================
# MPI section, uncomment the following lines
# 
# one comment for users of mpich or lam:
# You must *not* compile mpi with g77/f77, because f77/g77             
# appends *two* underscores to symbols that contain already an        
# underscore (i.e. MPI_SEND becomes mpi_send__).  The pgf90/ifc
# compilers however append only one underscore.
# Precompiled mpi version will also not work !!!
#
# We found that mpich.1.2.1 and lam-6.5.X to lam-7.0.4 are stable
# mpich.1.2.1 was configured with 
#  ./configure -prefix=/usr/local/mpich_nodvdbg -fc="pgf77 -Mx,119,0x200000"  \
# -f90="pgf90 -Mx,119,0x200000" \
# --without-romio --without-mpe -opt=-O \
# 
# lam was configured with the line
#  ./configure  -prefix /opt/libs/lam-7.0.4 --with-cflags=-O -with-fc=ifc \
# --with-f77flags=-O --without-romio
# 
# please note that you might be able to use a lam or mpich version 
# compiled with f77/g77, but then you need to add the following
# options: -Msecond_underscore (compilation) and -g77libs (linking)
#
# !!! Please do not send me any queries on how to install MPI, I will
# certainly not answer them !!!!
#=======================================================================
#-----------------------------------------------------------------------
# fortran linker for mpi: if you use LAM and compiled it with the options
# suggested above,  you can use the following line
#-----------------------------------------------------------------------

FC=mpif90
FCL=$(FC)

#-----------------------------------------------------------------------
# additional options for CPP in parallel version (see also above):
# NGZhalf               charge density   reduced in Z direction
# wNGZhalf              gamma point only reduced in Z direction
# scaLAPACK             use scaLAPACK (usually slower on 100 Mbit Net)
# 1000 or 2000 are the optimal CACHE_SIZE for the parallel version
# and IFC on Athlon XP (gK)
#-----------------------------------------------------------------------

CPP    = $(CPP_) -DMPI  -DHOST=\"LinuxIFC_ath\" -DIFC \
     -Dkind8 -DNGZhalf -DCACHE_SIZE=2000 -DPGF90 -Davoidalloc \
#     -DRPROMU_DGEMV

#-----------------------------------------------------------------------
# location of SCALAPACK
# if you do not use SCALAPACK simply uncomment the line SCA
#-----------------------------------------------------------------------

BLACS=$(HOME)/archives/SCALAPACK/BLACS/
SCA_=$(HOME)/archives/SCALAPACK/SCALAPACK

SCA= $(SCA_)/libscalapack.a  \
 $(BLACS)/LIB/blacsF77init_MPI-LINUX-0.a $(BLACS)/LIB/blacs_MPI-LINUX-0.a $(BLACS)/LIB/blacsF77init_MPI-LINUX-0.a

SCA=

#-----------------------------------------------------------------------
# libraries for mpi
#-----------------------------------------------------------------------

LIB     = -L../vasp.4.lib -ldmy  \
      ../vasp.4.lib/linpack_double.o $(LAPACK) \
      $(SCA) $(BLAS)

# FFT: fftmpi.o with fft3dlib of Juergen Furthmueller
FFT3D   = fftmpi.o fftmpi_map.o fft3dlib.o 


# fftw.3.0.1 is slighly faster and should be used if available
# No way to use FFTW (copy /prg/fftw/include/fftw3.f in the dir or use -I)  
# FFT3D = fftmpiw.o fftmpi_map.o  fft3dlib.o   /prg/fftw/lib/libfftw3.a
#[nodo08:05111] [ 0] /lib64/libpthread.so.0 [0x2b47c7924e80]
#[nodo08:05111] [ 1] /prg/Source/VASP/bin/vasp-orgw(fftw_destroy_plan+0xa) [0x68df72]
#[nodo08:05111] [ 2] /prg/Source/VASP/bin/vasp-orgw(dfftw_destroy_plan_+0x9) [0x68c785]
#[nodo08:05111] [ 3] /prg/Source/VASP/bin/vasp-orgw(fftbas_plan_+0x1ed) [0x674193]
#[nodo08:05111] [ 4] /prg/Source/VASP/bin/vasp-orgw(fftmakeplan_+0x1e) [0x673f8c]
#[nodo08:05111] [ 5] /prg/Source/VASP/bin/vasp-orgw(MAIN__+0x17885) [0x431f55]
#[nodo08:05111] [ 6] /prg/Source/VASP/bin/vasp-orgw(main+0x2a) [0x41a6c2]
#[nodo08:05111] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b47c8d558b4]
#[nodo08:05111] [ 8] /prg/Source/VASP/bin/vasp-orgw [0x41a5e9]


# No way to use FFTW with MKL wrappers (using both fftw.f from MKL and original one): 
# FFT3D  = fftmpiw.o fftmpi_map.o fft3dlib.o  

Code: Select all

/prg/intel/mkl/10.0.4.023/lib/em64t/libfftw3xf_intel.a
# nodo08:05023] Failing at address: (nil)
# [nodo08:05023] [ 0] /lib64/libpthread.so.0 [0x2b7188e2ae80]
# [nodo08:05023] [ 1] /prg/Source/VASP/bin/vasp-orgw(fftw_destroy_plan+0x8) [0x6852b4]
# [nodo08:05023] [ 2] /prg/Source/VASP/bin/vasp-orgw(dfftw_destroy_plan_+0x9) [0x684f89]
# [nodo08:05023] [ 3] /prg/Source/VASP/bin/vasp-orgw(fftbas_plan_+0x1ed) [0x66ee13]
# [nodo08:05023] [ 4] /prg/Source/VASP/bin/vasp-orgw(fftmakeplan_+0x1e) [0x66ec0c]
# [nodo08:05023] [ 5] /prg/Source/VASP/bin/vasp-orgw(MAIN__+0x17885) [0x42cbd5]
# [nodo08:05023] [ 6] /prg/Source/VASP/bin/vasp-orgw(main+0x2a) [0x415342]


#-----------------------------------------------------------------------
# general rules and compile lines
#-----------------------------------------------------------------------
BASIC=   symmetry.o symlib.o   lattlib.o  random.o   

SOURCE=  base.o     mpi.o      smart_allocate.o      xml.o  \
         constant.o jacobi.o   main_mpi.o  scala.o   \
         asa.o      lattice.o  poscar.o   ini.o      setex.o     radial.o  \
         pseudo.o   mgrid.o    mkpoints.o wave.o      wave_mpi.o  $(BASIC) \
         nonl.o     nonlr.o    dfast.o    choleski2.o    \
         mix.o      charge.o   xcgrad.o   xcspin.o    potex1.o   potex2.o  \
         metagga.o  constrmag.o pot.o      cl_shift.o force.o    dos.o      elf.o      \
         tet.o      hamil.o    steep.o    \
         chain.o    dyna.o     relativistic.o LDApU.o sphpro.o  paw.o   us.o \
         ebs.o      wavpre.o   wavpre_noio.o broyden.o \
         dynbr.o    rmm-diis.o reader.o   writer.o   tutor.o xml_writer.o \
         brent.o    stufak.o   fileio.o   opergrid.o stepver.o  \
         dipol.o    xclib.o    chgloc.o   subrot.o   optreal.o   davidson.o \
         edtest.o   electron.o shm.o      pardens.o  paircorrection.o \
         optics.o   constr_cell_relax.o   stm.o    finite_diff.o \
         elpol.o    setlocalpp.o aedens.o 
 
INC=

vasp: $(SOURCE) $(FFT3D) $(INC) main.o 
	rm -f vasp
	$(FCL) -o vasp $(LINK) main.o  $(SOURCE)   $(FFT3D) $(LIB) 
makeparam: $(SOURCE) $(FFT3D) makeparam.o main.F $(INC)
	$(FCL) -o makeparam  $(LINK) makeparam.o $(SOURCE) $(FFT3D) $(LIB)
zgemmtest: zgemmtest.o base.o random.o $(INC)
	$(FCL) -o zgemmtest $(LINK) zgemmtest.o random.o base.o $(LIB)
dgemmtest: dgemmtest.o base.o random.o $(INC)
	$(FCL) -o dgemmtest $(LINK) dgemmtest.o random.o base.o $(LIB) 
ffttest: base.o smart_allocate.o mpi.o mgrid.o random.o ffttest.o $(FFT3D) $(INC)
	$(FCL) -o ffttest $(LINK) ffttest.o mpi.o mgrid.o random.o smart_allocate.o base.o $(FFT3D) $(LIB)
kpoints: $(SOURCE) $(FFT3D) makekpoints.o main.F $(INC)
	$(FCL) -o kpoints $(LINK) makekpoints.o $(SOURCE) $(FFT3D) $(LIB)

clean:	
	-rm -f *.g *.f *.o *.L *.mod ; touch *.F

main.o: main$(SUFFIX)
	$(FC) $(FFLAGS)$(DEBUG)  $(INCS) -c main$(SUFFIX)
xcgrad.o: xcgrad$(SUFFIX)
	$(FC) $(FFLAGS) $(INLINE)  $(INCS) -c xcgrad$(SUFFIX)
xcspin.o: xcspin$(SUFFIX)
	$(FC) $(FFLAGS) $(INLINE)  $(INCS) -c xcspin$(SUFFIX)

makeparam.o: makeparam$(SUFFIX)
	$(FC) $(FFLAGS)$(DEBUG)  $(INCS) -c makeparam$(SUFFIX)

makeparam$(SUFFIX): makeparam.F main.F 
#
# MIND: I do not have a full dependency list for the include
# and MODULES: here are only the minimal basic dependencies
# if one strucuture is changed then touch_dep must be called
# with the corresponding name of the structure
#
base.o: base.inc base.F
mgrid.o: mgrid.inc mgrid.F
constant.o: constant.inc constant.F
lattice.o: lattice.inc lattice.F
setex.o: setexm.inc setex.F
pseudo.o: pseudo.inc pseudo.F
poscar.o: poscar.inc poscar.F
mkpoints.o: mkpoints.inc mkpoints.F
wave.o: wave.inc wave.F
nonl.o: nonl.inc nonl.F
nonlr.o: nonlr.inc nonlr.F

$(OBJ_HIGH):
	$(CPP)
	$(FC) $(FFLAGS) $(OFLAG_HIGH) $(INCS) -c $*$(SUFFIX)
$(OBJ_NOOPT):
	$(CPP)
	$(FC) $(FFLAGS) $(INCS) -c $*$(SUFFIX)

fft3dlib_f77.o: fft3dlib_f77.F
	$(CPP)
	$(F77) $(FFLAGS_F77) -c $*$(SUFFIX)

.F.o:
	$(CPP)
	$(FC) $(FFLAGS) $(OFLAG) $(INCS) -c $*$(SUFFIX)
.F$(SUFFIX):
	$(CPP)
$(SUFFIX).o:
	$(FC) $(FFLAGS) $(OFLAG) $(INCS) -c $*$(SUFFIX)

# special rules
#-----------------------------------------------------------------------
# -tpp5|6|7 P, PII-PIII, PIV
# -xW use SIMD (does not pay of on PII, since fft3d uses double prec)
# all other options do no affect the code performance since -O1 is used
#-----------------------------------------------------------------------

# Let's add these two special rules
fftmpi.o : fftmpi.F
	$(FC) $(FFLAGS) $(OFLAG2) $(INCS) -c $*$(SUFFIX)  
fftmpi_map.o : fftmpi_map.F
	$(FC) $(FFLAGS) $(OFLAG2) $(INCS) -c $*$(SUFFIX)  

fft3dlib.o : fft3dlib.F
	$(CPP)
	$(FC) -FR -lowercase -O1       -xW -prefetch- -unroll0 -vec_report3 -c $*$(SUFFIX)
#	$(FC) -FR -lowercase -e95 -vec_report3 -O1 -tpp6 -prefetch -unroll0 -c $*$(SUFFIX)
	$(CPP)
#	$(FC) -FR -lowercase -e95 -c $*$(SUFFIX)

lattlib.o: lattlib.F
	$(CPP)
	$(FC) -FR -lowercase      -c $*$(SUFFIX)
#	$(FC) -FR -lowercase -e95 -c $*$(SUFFIX)

radial.o : radial.F
	$(CPP)
	$(FC) -FR -lowercase -O1 -c $*$(SUFFIX)

symlib.o : symlib.F
	$(CPP)
	$(FC) -FR -lowercase -O1 -c $*$(SUFFIX)

symmetry.o : symmetry.F
	$(CPP)
	$(FC) -FR -lowercase -O1 -c $*$(SUFFIX)

dynbr.o : dynbr.F
	$(CPP)
	$(FC) -FR -lowercase -O1 -c $*$(SUFFIX)

us.o : us.F
	$(CPP)
	$(FC) -FR -lowercase -O1 -c $*$(SUFFIX)

broyden.o : broyden.F
	$(CPP)
	$(FC) -FR -lowercase -O2 -c $*$(SUFFIX)

wave.o : wave.F
	$(CPP)
	$(FC) -FR -lowercase -O0 -c $*$(SUFFIX)

LDApU.o : LDApU.F
	$(CPP)
	$(FC) -FR -lowercase -O2 -c $*$(SUFFIX)
Good luck!

Best Regards

Rocco Martinazzo and Simone Casolo
Univ. Milan, Italy


<span class='smallblacktext'>[ Edited ]</span>

VASP on intel em64t and MKL

Posted: Tue Jan 27, 2009 5:24 am
by jagladden
Dear Simoneca,

I am curious to know what kind of results your build yielded on your Intel dual quad-core nodes. I have invested some effort in benchmarking a Dell PE1950 box (dual Intel L5410 quad-cores) and have been disappointed with the scaling behavior.

For Hg.bench I get results roughly like this:

1 core = 45 Secs
2 core = 29 Secs
4 core = 23 Secs
8 core = 21 Secs

These are wall clock times. As you can see there is very little improvement when going from 4 to 8 cores on the same box.

Is this consistent with your results?

Interestingly, our exiting cluster composed of four year old Dell SC1425s (dual single-core Xeons) actually scales better even though communication is via Gig Ethernet rather than strictly shared memory.

Jim

VASP on intel em64t and MKL

Posted: Tue Feb 17, 2009 5:10 pm
by TMarques
Try to run with 6 cores. For me it gives me more performance than 8.
I also have worse performance with 4 cores than with 3. Haven't been able to figure out why, yet, it just seems to be BW starved, as another code runs better on 7 than on 8. These are consistent results, no matter how long the test is.

VASP on intel em64t and MKL

Posted: Wed Feb 18, 2009 11:55 pm
by pafell
This behavior is caused by not-so-good core2 design. The main problem is very slow memory to cpu connection.
On E5430 single cpu server we get following improvement:
using 2 cores, we get 148% of single-core power,
using 3 cores, we get 208% of single-core power,
using 4 cored, we get 198% of single-core power.
So indeed results are pretty the same here.
Unfortunately those shiny quad-cores from AMD are much slower in single-cpu jobs, so in here we find core2 better. Although there are rumours that i7 from Intel is much faster.