Very slow parallel version on multiple nodes
Posted: Tue Jan 20, 2009 3:14 am
Dear Admin,
I have successfully complied both serial and parallel vasp.4.6.35 versions in my cluster. The serial version works well, but when I run the parallel version, I encounter a serious problem. The parallel version works well on one node (8 cores inside), while the running speed slows down when using on two nodes (16 cores). It makes me crazy that the speed becomes slower with the increase of the computing nodes.
------------------------------------------------------------------------------------------
The hardware and software configurations one each node are listed below:
Intel Xeon E5420 2.5G CPU (2*4 cores), 4G Memory,146G Diskspace, 1G Networkcard and 1G Exchanger
Suse Linux 10.0
Intel fortran compiler 10.1.021
Lam-mpi 7.1.4
BLAS: Supplied by Intel MKL 10.1.0.015
(setting in makefile: BLAS=-L/opt/intel/mkl/10.1.0.015/lib/32 -lmkl -lmkl_blacs -lmkl_core -lmkl_intel_thread -lsvml -liomp5
-lguide -lpthread)
LAPACK: Supplied by Intel MKL 10.1.0.015
(setting in makefile: LAPACK=-L/opt/intel/mkl/10.1.0.015/lib/32/libmkl_lapack.a)
------------------------------------------------------------------------------------------
For bench.Hg, the running time using one node and two nodes are listed below:
one node (8 cores):
Total CPU time used (sec): 19.897
User time (sec): 19.717
System time (sec): 0.180
Elapsed time (sec): 19.916
Two nodes (16cores):
Total CPU time used (sec): 15.069
User time (sec): 11.309
System time (sec): 3.760
Elapsed time (sec): 91.696
It is obvious that the total cpu time is decreased, but the elapsed time is largely increased. I have check the occupied ratio of each CPU: when running on one node, the value is almost 100% ; while on two nodes, the value is less than 20%.
------------------------------------------------------------------------------------------
This seems to be a vasp-related problem ? I have tested the simple parallel examples supplied by the lam-mpi, there is no problem at all, and they run faster when I increase the computing nodes.
Can anyone give a solution? It is very important to me! Thanks greatly in advance. I can supply more detailed information if required.
I have successfully complied both serial and parallel vasp.4.6.35 versions in my cluster. The serial version works well, but when I run the parallel version, I encounter a serious problem. The parallel version works well on one node (8 cores inside), while the running speed slows down when using on two nodes (16 cores). It makes me crazy that the speed becomes slower with the increase of the computing nodes.
------------------------------------------------------------------------------------------
The hardware and software configurations one each node are listed below:
Intel Xeon E5420 2.5G CPU (2*4 cores), 4G Memory,146G Diskspace, 1G Networkcard and 1G Exchanger
Suse Linux 10.0
Intel fortran compiler 10.1.021
Lam-mpi 7.1.4
BLAS: Supplied by Intel MKL 10.1.0.015
(setting in makefile: BLAS=-L/opt/intel/mkl/10.1.0.015/lib/32 -lmkl -lmkl_blacs -lmkl_core -lmkl_intel_thread -lsvml -liomp5
-lguide -lpthread)
LAPACK: Supplied by Intel MKL 10.1.0.015
(setting in makefile: LAPACK=-L/opt/intel/mkl/10.1.0.015/lib/32/libmkl_lapack.a)
------------------------------------------------------------------------------------------
For bench.Hg, the running time using one node and two nodes are listed below:
one node (8 cores):
Total CPU time used (sec): 19.897
User time (sec): 19.717
System time (sec): 0.180
Elapsed time (sec): 19.916
Two nodes (16cores):
Total CPU time used (sec): 15.069
User time (sec): 11.309
System time (sec): 3.760
Elapsed time (sec): 91.696
It is obvious that the total cpu time is decreased, but the elapsed time is largely increased. I have check the occupied ratio of each CPU: when running on one node, the value is almost 100% ; while on two nodes, the value is less than 20%.
------------------------------------------------------------------------------------------
This seems to be a vasp-related problem ? I have tested the simple parallel examples supplied by the lam-mpi, there is no problem at all, and they run faster when I increase the computing nodes.
Can anyone give a solution? It is very important to me! Thanks greatly in advance. I can supply more detailed information if required.