last week I compiled the GPU version of VASP with this Makefile:
Code: Select all
# Precompiler options
CPP_OPTIONS= -DMPI -DHOST=\"Lichteb-5.41-gpu-half\" -DIFC \
-DNGXhalf -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc \
-DMPI_BLOCK=8000 -Duse_collective \
-DnoAugXCmeta -Duse_bse_te \
-Duse_shmem -Dkind8
CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)
FC = mpiifort
FCL = mpiifort -mkl -lstdc++
FREE = -free -names lowercase
FFLAGS = -assume byterecl
OFLAG = -O2
OFLAG_IN = $(OFLAG)
DEBUG = -O0
MKL_PATH = $(MKLROOT)/lib/intel64
BLAS =
LAPACK =
BLACS = -lmkl_blacs_intelmpi_lp64
SCALAPACK = $(MKL_PATH)/libmkl_scalapack_lp64.a $(BLACS)
OBJECTS = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d.o \
/home/km87kymy/fft-intel/libfftw3xf_intel.a
INCS =-I$(MKLROOT)/include/fftw
LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS)
#OBJECTS_O1 += fft3dfurth.o fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o
# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = icc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)
OBJECTS_LIB= linpack_double.o getshmem.o
# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin
#================================================
# GPU Stuff
CPP_GPU = -DCUDA_GPU -DRPROMU_CPROJ_OVERLAP -DUSE_PINNED_MEMORY -DCUFFT_MIN=28
OBJECTS_GPU = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d_gpu.o fftmpiw_gpu.o \
/home/km87kymy/fft-intel/libfftw3xf_intel.a
CUDA_ROOT := /shared/apps/cuda/7.5
NVCC := $(CUDA_ROOT)/bin/nvcc
CUDA_LIB := -L$(CUDA_ROOT)/lib64 -L$(CUDA_ROOT)/lib64/stubs -lnvToolsExt -lcudart -lcuda -lcufft -lcublas
GENCODE_ARCH := -gencode=arch=compute_30,code=\"sm_30,compute_30\" -gencode=arch=compute_35,code=\"sm_35,compute_35\"
MPI_INC =/shared/apps/intel/2016u2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/include
Code: Select all
creating 32 CUFFT plans with grid size 126 x 126 x 126...
CUFFT Error in cuda_fft.cu, line 98: CUFFT_ALLOC_FAILED
Failed to create CUFFT plan!
Code: Select all
Exited with exit code 255.
Resource usage summary:
CPU time : 35.24 sec.
Max Memory : 195 MB
Average Memory : 195.00 MB
Total Requested Memory : 28016.00 MB
Delta Memory : 27821.00 MB
(Delta: the difference between total requested memory and actual max usage.)
Max Processes : 8
Max Threads : 9
If I reduce the system size (40 atoms, 2x2x2 KP) it runs without errors, but very slowly: ~10times slower than the cpu version. I even aborted the 4x4x4 KPOINTS run, because it was just too slow. Playing around with NSIM doesn*t seem to change much.
My guess is that is has something to do with the compilation. I would like to experiment with values given in the first part of the Makefile, the CPP_OPTIONS (-DCACHE_SIZE=4000, -DMPI_BLOCK=8000), but I have no idea which values to plug in.
Help would be very much appreciated, thank you,
Kai Meyer