When I try a run with two nodes, each having a GPU I get:
Code: Select all
" Data for JOB [41425,1] offset 0 Total slots allocated 40
Mapper requested: NULL Last mapper: round_robin Mapping policy: BYCORE:NOOVERSUBSCRIBE Ranking policy: SLOT
Binding policy: CORE:IF-SUPPORTED Cpu set: NULL PPR: NULL Cpus-per-rank: 0
Num new daemons: 0 New daemon starting vpid INVALID
Num nodes: 1
Data for node: gpu001 State: 3 Flags: 11
Daemon: [[41425,0],0] Daemon launched: True
Num slots: 40 Slots in use: 2 Oversubscribed: FALSE
Num slots allocated: 40 Max slots: 0
Num procs: 2 Next node_rank: 2
Data for proc: [[41425,1],0]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
State: INITIALIZED App_context: 0
Locale: [B/././././././././././././././././././.][./././././././././././././././././././.]
Binding: [B/././././././././././././././././././.][./././././././././././././././././././.]
Data for proc: [[41425,1],1]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
State: INITIALIZED App_context: 0
Locale: [./B/./././././././././././././././././.][./././././././././././././././././././.]
Binding: [./B/./././././././././././././././././.][./././././././././././././././././././.]
----------------------------------------------------
OOO PPPP EEEEE N N M M PPPP
O O P P E NN N MM MM P P
O O P E N NN M M P
OOO P EEEEE N N M M P
----------------------------------------------------
running 2 mpi-ranks, with 1 threads/rank
distrk: each k-point on 2 cores, 1 groups
distr: one band on 1 cores, 2 groups
OpenACC runtime initialized ... 1 GPUs detected
-----------------------------------------------------------------------------
| |
| EEEEEEE RRRRRR RRRRRR OOOOOOO RRRRRR ### ### ### |
| E R R R R O O R R ### ### ### |
| E R R R R O O R R ### ### ### |
| EEEEE RRRRRR RRRRRR O O RRRRRR # # # |
| E R R R R O O R R |
| E R R R R O O R R ### ### ### |
| EEEEEEE R R R R OOOOOOO R R ### ### ### |
| |
| M_init_nccl: Error in ncclCommInitRank |
| |
| ----> I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <---- |"
| |