CUFFT_ALLOC_FAILED
Posted: Thu Feb 25, 2016 5:52 pm
Hello dear VASP team,
last week I compiled the GPU version of VASP with this Makefile:
However, when I try to run a calculation on one node (16 cores 32 GByte main memory, 2x “NVIDIA® Tesla™ K20X”) I get the following error message for a system of 318 atoms.
which refers to some kind of problem with memory (although in the cpu version it runs without problems). And the memory usage is small (from the the LSF outfile)
If I reduce the system size (40 atoms, 2x2x2 KP) it runs without errors, but very slowly: ~10times slower than the cpu version. I even aborted the 4x4x4 KPOINTS run, because it was just too slow. Playing around with NSIM doesn*t seem to change much.
My guess is that is has something to do with the compilation. I would like to experiment with values given in the first part of the Makefile, the CPP_OPTIONS (-DCACHE_SIZE=4000, -DMPI_BLOCK=8000), but I have no idea which values to plug in.
Help would be very much appreciated, thank you,
Kai Meyer
last week I compiled the GPU version of VASP with this Makefile:
Code: Select all
# Precompiler options
CPP_OPTIONS= -DMPI -DHOST=\"Lichteb-5.41-gpu-half\" -DIFC \
-DNGXhalf -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc \
-DMPI_BLOCK=8000 -Duse_collective \
-DnoAugXCmeta -Duse_bse_te \
-Duse_shmem -Dkind8
CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)
FC = mpiifort
FCL = mpiifort -mkl -lstdc++
FREE = -free -names lowercase
FFLAGS = -assume byterecl
OFLAG = -O2
OFLAG_IN = $(OFLAG)
DEBUG = -O0
MKL_PATH = $(MKLROOT)/lib/intel64
BLAS =
LAPACK =
BLACS = -lmkl_blacs_intelmpi_lp64
SCALAPACK = $(MKL_PATH)/libmkl_scalapack_lp64.a $(BLACS)
OBJECTS = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d.o \
/home/km87kymy/fft-intel/libfftw3xf_intel.a
INCS =-I$(MKLROOT)/include/fftw
LLIBS = $(SCALAPACK) $(LAPACK) $(BLAS)
#OBJECTS_O1 += fft3dfurth.o fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o
# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = icc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)
OBJECTS_LIB= linpack_double.o getshmem.o
# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin
#================================================
# GPU Stuff
CPP_GPU = -DCUDA_GPU -DRPROMU_CPROJ_OVERLAP -DUSE_PINNED_MEMORY -DCUFFT_MIN=28
OBJECTS_GPU = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d_gpu.o fftmpiw_gpu.o \
/home/km87kymy/fft-intel/libfftw3xf_intel.a
CUDA_ROOT := /shared/apps/cuda/7.5
NVCC := $(CUDA_ROOT)/bin/nvcc
CUDA_LIB := -L$(CUDA_ROOT)/lib64 -L$(CUDA_ROOT)/lib64/stubs -lnvToolsExt -lcudart -lcuda -lcufft -lcublas
GENCODE_ARCH := -gencode=arch=compute_30,code=\"sm_30,compute_30\" -gencode=arch=compute_35,code=\"sm_35,compute_35\"
MPI_INC =/shared/apps/intel/2016u2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/include
Code: Select all
creating 32 CUFFT plans with grid size 126 x 126 x 126...
CUFFT Error in cuda_fft.cu, line 98: CUFFT_ALLOC_FAILED
Failed to create CUFFT plan!
Code: Select all
Exited with exit code 255.
Resource usage summary:
CPU time : 35.24 sec.
Max Memory : 195 MB
Average Memory : 195.00 MB
Total Requested Memory : 28016.00 MB
Delta Memory : 27821.00 MB
(Delta: the difference between total requested memory and actual max usage.)
Max Processes : 8
Max Threads : 9
If I reduce the system size (40 atoms, 2x2x2 KP) it runs without errors, but very slowly: ~10times slower than the cpu version. I even aborted the 4x4x4 KPOINTS run, because it was just too slow. Playing around with NSIM doesn*t seem to change much.
My guess is that is has something to do with the compilation. I would like to experiment with values given in the first part of the Makefile, the CPP_OPTIONS (-DCACHE_SIZE=4000, -DMPI_BLOCK=8000), but I have no idea which values to plug in.
Help would be very much appreciated, thank you,
Kai Meyer