My Community

Posted: **Thu Feb 25, 2016 5:52 pm**

Hello dear VASP team,

last week I compiled the GPU version of VASP with this Makefile:

# Precompiler options
CPP_OPTIONS= -DMPI -DHOST=\"Lichteb-5.41-gpu-half\" -DIFC \
             -DNGXhalf -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc \
             -DMPI_BLOCK=8000 -Duse_collective \
             -DnoAugXCmeta -Duse_bse_te \
             -Duse_shmem -Dkind8

CPP        = fpp -f_com=no -free -w0  $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)

FC         = mpiifort
FCL        = mpiifort -mkl -lstdc++

FREE       = -free -names lowercase

FFLAGS     = -assume byterecl
OFLAG      = -O2
OFLAG_IN   = $(OFLAG)
DEBUG      = -O0

MKL_PATH   = $(MKLROOT)/lib/intel64
BLAS       =
LAPACK     =
BLACS      = -lmkl_blacs_intelmpi_lp64
SCALAPACK  = $(MKL_PATH)/libmkl_scalapack_lp64.a $(BLACS)

OBJECTS    = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d.o \
             /home/km87kymy/fft-intel/libfftw3xf_intel.a

INCS       =-I$(MKLROOT)/include/fftw

LLIBS      = $(SCALAPACK) $(LAPACK) $(BLAS)


#OBJECTS_O1 += fft3dfurth.o fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O1 += fftw3d.o fftmpi.o fftmpiw.o
OBJECTS_O2 += fft3dlib.o

# For what used to be vasp.5.lib
CPP_LIB    = $(CPP)
FC_LIB     = $(FC)
CC_LIB     = icc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB   = $(FREE)

OBJECTS_LIB= linpack_double.o getshmem.o

# Normally no need to change this
SRCDIR     = ../../src
BINDIR     = ../../bin

#================================================
# GPU Stuff

CPP_GPU    = -DCUDA_GPU -DRPROMU_CPROJ_OVERLAP -DUSE_PINNED_MEMORY -DCUFFT_MIN=28

OBJECTS_GPU = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d_gpu.o fftmpiw_gpu.o \
              /home/km87kymy/fft-intel/libfftw3xf_intel.a

CUDA_ROOT  := /shared/apps/cuda/7.5
NVCC       := $(CUDA_ROOT)/bin/nvcc
CUDA_LIB   := -L$(CUDA_ROOT)/lib64 -L$(CUDA_ROOT)/lib64/stubs -lnvToolsExt -lcudart -lcuda -lcufft -lcublas

GENCODE_ARCH    := -gencode=arch=compute_30,code=\"sm_30,compute_30\" -gencode=arch=compute_35,code=\"sm_35,compute_35\"

MPI_INC    =/shared/apps/intel/2016u2/compilers_and_libraries_2016.2.181/linux/mpi/intel64/include

However, when I try to run a calculation on one node (16 cores 32 GByte main memory, 2x “NVIDIA® Tesla™ K20X”) I get the following error message for a system of 318 atoms.

Code: Select all

creating 32 CUFFT plans with grid size 126 x 126 x 126...

CUFFT Error in cuda_fft.cu, line 98: CUFFT_ALLOC_FAILED
 Failed to create CUFFT plan!

which refers to some kind of problem with memory (although in the cpu version it runs without problems). And the memory usage is small (from the the LSF outfile)

Code: Select all

Exited with exit code 255.

Resource usage summary:

    CPU time :               35.24 sec.
    Max Memory :             195 MB
    Average Memory :         195.00 MB
    Total Requested Memory : 28016.00 MB
    Delta Memory :           27821.00 MB
    (Delta: the difference between total requested memory and actual max usage.)
    Max Processes :          8
    Max Threads :            9

If I reduce the system size (40 atoms, 2x2x2 KP) it runs without errors, but very slowly: ~10times slower than the cpu version. I even aborted the 4x4x4 KPOINTS run, because it was just too slow. Playing around with NSIM doesn*t seem to change much.

My guess is that is has something to do with the compilation. I would like to experiment with values given in the first part of the Makefile, the CPP_OPTIONS (-DCACHE_SIZE=4000, -DMPI_BLOCK=8000), but I have no idea which values to plug in.

Help would be very much appreciated, thank you,
Kai Meyer

Posted: **Thu Dec 21, 2017 3:29 am**

Hi Kai,

I encounter the same issue. Have you solved it? Thanks.

Yecheng Zhou

Posted: **Wed Jun 13, 2018 4:19 pm**

Hello, I have the same issue with vasp 5.4.4 and cuda 9.1

There is any hint or solution to this issue?

Thanks in advance

Posted: **Thu Jul 11, 2019 12:35 am**

You can try re-complie the source adding the followings at the GPU stuff.

#GPU Stuff
CPP_GPU = -DCUDA_GPU -DRPROMU_CPROJ_OVERLAP -DUSE_PINNED_MEMORY -DCUFFT_MIN=28 -UscaLAPACK
OBJECTS_GPU = fftmpiw.o fftmpi_map.o fft3dlib.o fftw3d_gpu.o fftmpiw_gpu.o
CC = icc
CXX = icpc
CFLAGS = -fPIC -DADD_ -Wall -qopenmp -DMAGMA_WITH_MKL -DMAGMA_SETAFFINITY -DGPUSHMEM=300 -DHAVE_CUBLAS

CUDA_ROOT := /usr/cuda
NVCC := $(CUDA_ROOT)/bin/nvcc -ccbin=icc -std=c++11
CUDA_LIB := -L$(CUDA_ROOT)/lib64 -lnvToolsExt -lcudart -lcuda -lcufft -lcublas
GENCODE_ARCH := -gencode=arch=compute_30,code=\"sm_30,compute_30\"
MPI_INC = $(I_MPI_ROOT)/intel64/include

Posted: **Mon Nov 25, 2019 5:10 pm**

The CPP_OPTIONS (-DCACHE_SIZE=4000, -DMPI_BLOCK=8000) should be irrelevant for your problem.

You can set any value, and compile it to check the effects on the performance. For example, 1024,2048 ......
According to my tests, they have a small influence on the performance.

There are many makefile.includes in the website for the intel cpu and gpu. You can find one.

I can not access system with intel cpu and gpu. I can only access the IBM power 9 with V100 gpu systems.

According my tests, magma has a very large influence on the speed of gpu vasp. I recommend you to install one.

Your problem seems to be related to insufficient memory. When the system size is smaller, it can run without error.

If the compilation has no problem, you can reduce the number of mpi /gpu.
Less number of mpi, it requires a less amount of memory.
For example, you can use the same number of cpu as that of gpu. That is to say, only one mpi is used for one gpu.
If memory is insufficient, KPAR can be decreased to 1.

If using magma, multi threading setting can improve the performance significantly.
If magma is not used, multi threading may have negative effects.

NSIM can be tuned. If memory is insufficient, a smaller NSIM can be used. I set it to NSIM=14.

My Community

CUFFT_ALLOC_FAILED

CUFFT_ALLOC_FAILED

Re: CUFFT_ALLOC_FAILED

Re: CUFFT_ALLOC_FAILED

Re: CUFFT_ALLOC_FAILED

Re: CUFFT_ALLOC_FAILED