UCX ERROR Invalid active_width on mlx5_0:1: 16

Questions regarding the compilation of VASP on various platforms: hardware, compilers and libraries, etc.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

UCX ERROR Invalid active_width on mlx5_0:1: 16

#1 Post by amihai_silverman1 » Thu Jun 29, 2023 7:43 am

Hello,
I have compiled vasp.6.4.1 on our HPC cluster.
When I run the NaCl example (from the /testsuite/tests) I get an error in the output :

[1688024082.874154] [n017:228578:0] ib_iface.c:947 UCX ERROR Invalid active_width on mlx5_0:1: 16
Abort(1614991) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(143)........:
MPID_Init(1310)..............:
MPIDI_OFI_mpi_init_hook(1901):
create_endpoint(2593)........: OFI endpoint open failed (ofi_init.c:2593:create_endpoint:Input/output error)

I am not sure if the problem is in the cluster or else in my compilation.
I will be grateful for your help.
Attached is the makefile.include which was used for the compilation, and a tar of the running folder with the PBS submit script and the input and output files.
Thank a lot
Amihai
You do not have the required permissions to view the files attached to this post.

fabien_tran1
Global Moderator
Global Moderator
Posts: 417
Joined: Mon Sep 13, 2021 11:02 am

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

#2 Post by fabien_tran1 » Thu Jun 29, 2023 9:02 am

Hi,

Can you please provide information about the HPC cluster like the processor type and RAM?

Is the error occurring for all examples that you have tried or only NaCl?

A google search of " UCX ERROR Invalid active_width" provides only
https://github.com/openucx/ucx/issues/4556
https://forums.developer.nvidia.com/t/u ... 7-9/206236
Have you tried the possible solutions provided there?

amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

#3 Post by amihai_silverman1 » Thu Jun 29, 2023 10:05 am

Hi,
The same error occurs in every run. That is why I tried a simple test.
The cluster has Lenovo compute-nodes with processors : Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
384G RAM in each node, and InfiniBand network.
I compiled with the Intel compiler oneapi2022.

Regarding the links you provided, I observe that in my case, command
ucx_info -d
doesn't show any error. Its output is attached.
Thanks, Amihai
You do not have the required permissions to view the files attached to this post.

fabien_tran1
Global Moderator
Global Moderator
Posts: 417
Joined: Mon Sep 13, 2021 11:02 am

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

#4 Post by fabien_tran1 » Thu Jun 29, 2023 11:38 am

Is the error also occurring in non-parallel calculation with "mpirun -np 1"? Are you running with OpenMP, i.e., with OMP_NUM_THREADS>1?

amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

#5 Post by amihai_silverman1 » Sun Jul 02, 2023 9:23 am

Yes, the run failed also with "mpirun -np 1".
It also failed with the same error when I put
export OMP_NUM_THREADS=12
mpirun -np 12 ...

I thought that it is a problem of the Intel oneapi compiler, so I downloaded and installed the latest version, and recompiled VASP with it,
It didn't solve the problem, the run fails with the same error.
Thanks, Amihai

amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

#6 Post by amihai_silverman1 » Sun Jul 02, 2023 9:38 am

Please see the output of ucx_info -d. It looks like Device: mlx5_0:1 is OK :
#
# Memory domain: self
# component: self
# register: unlimited, cost: 0 nsec
# remote key: 8 bytes
#
# Transport: self
#
# Device: self
#
# capabilities:
# bandwidth: 6911.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8k
# am_bcopy: <= 8k
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# priority: 0
# device address: 0 bytes
# iface address: 8 bytes
# error handling: none
#
#
# Memory domain: tcp
# component: tcp
#
# Transport: tcp
#
# Device: eno6
#
# capabilities:
# bandwidth: 1131.64 MB/sec
# latency: 5258 nsec
# overhead: 50000 nsec
# am_bcopy: <= 8k
# connection: to iface
# priority: 1
# device address: 4 bytes
# iface address: 2 bytes
# error handling: none
#
# Device: eno2
#
# capabilities:
# bandwidth: 113.16 MB/sec
# latency: 5776 nsec
# overhead: 50000 nsec
# am_bcopy: <= 8k
# connection: to iface
# priority: 1
# device address: 4 bytes
# iface address: 2 bytes
# error handling: none
#
# Device: ib0
#
# capabilities:
# bandwidth: 5571.26 MB/sec
# latency: 5212 nsec
# overhead: 50000 nsec
# am_bcopy: <= 8k
# connection: to iface
# priority: 1
# device address: 4 bytes
# iface address: 2 bytes
# error handling: none
#
#
# Memory domain: ib/mlx5_0
# component: ib
# register: unlimited, cost: 90 nsec
# remote key: 16 bytes
# local memory handle is required for zcopy
#
# Transport: rc
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 600 nsec + 1 * N
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8k
# put_zcopy: <= 1g, up to 3 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4k
# get_bcopy: <= 8k
# get_zcopy: 33..1g, up to 3 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4k
# am_short: <= 123
# am_bcopy: <= 8191
# am_zcopy: <= 8191, up to 2 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 127
# domain: device
# connection: to ep
# priority: 30
# device address: 3 bytes
# ep address: 4 bytes
# error handling: peer failure
#
#
# Transport: rc_mlx5
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 600 nsec + 1 * N
# overhead: 40 nsec
# put_short: <= 220
# put_bcopy: <= 8k
# put_zcopy: <= 1g, up to 8 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4k
# get_bcopy: <= 8k
# get_zcopy: 33..1g, up to 8 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4k
# am_short: <= 235
# am_bcopy: <= 8191
# am_zcopy: <= 8191, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 187
# domain: device
# connection: to ep
# priority: 30
# device address: 3 bytes
# ep address: 4 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: dc_mlx5
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 660 nsec
# overhead: 40 nsec
# put_short: <= 172
# put_bcopy: <= 8k
# put_zcopy: <= 1g, up to 8 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4k
# get_bcopy: <= 8k
# get_zcopy: 33..1g, up to 8 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4k
# am_short: <= 187
# am_bcopy: <= 8191
# am_zcopy: <= 8191, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 139
# domain: device
# connection: to iface
# priority: 30
# device address: 3 bytes
# iface address: 5 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: ud
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 610 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 1 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 3984
# connection: to ep, to iface
# priority: 30
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure
#
#
# Transport: ud_mlx5
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 610 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 132
# connection: to ep, to iface
# priority: 30
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure
#
#
# Memory domain: rdmacm
# component: rdmacm
# supports client-server connection establishment via sockaddr
# < no supported devices found >
#
# Memory domain: sysv
# component: sysv
# allocate: unlimited
# remote key: 32 bytes
#
# Transport: mm
#
# Device: sysv
#
# capabilities:
# bandwidth: 6911.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 92
# am_bcopy: <= 8k
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# priority: 0
# device address: 8 bytes
# iface address: 16 bytes
# error handling: none
#
#
# Memory domain: posix
# component: posix
# allocate: unlimited
# remote key: 37 bytes
#
# Transport: mm
#
# Device: posix
#
# capabilities:
# bandwidth: 6911.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 92
# am_bcopy: <= 8k
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# priority: 0
# device address: 8 bytes
# iface address: 16 bytes
# error handling: none
#
#
# Memory domain: cma
# component: cma
# register: unlimited, cost: 9 nsec
#
# Transport: cma
#
# Device: cma
#
# capabilities:
# bandwidth: 11145.00 MB/sec
# latency: 80 nsec
# overhead: 400 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# priority: 0
# device address: 8 bytes
# iface address: 4 bytes
# error handling: none
#

fabien_tran1
Global Moderator
Global Moderator
Posts: 417
Joined: Mon Sep 13, 2021 11:02 am

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

#7 Post by fabien_tran1 » Sun Jul 02, 2023 9:38 am

Hi,

I will ask my colleagues if they have an idea of what could be the problem. Meanwhile you could try what is mentioned in the last post of
https://forums.developer.nvidia.com/t/u ... 9/206236/4
where it is suggested to add "soft memlock unlimited" and "hard memlock unlimited" in /etc/security/limits.d/rdma.conf. I guess it is not related and won't help, but try it just in case.

amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

#8 Post by amihai_silverman1 » Sun Jul 02, 2023 9:56 am

Please see, may help to debug the problem :

$ lspci | grep Mellanox
12:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
$ ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.22.4030
node_guid: 0409:73ff:ffe1:bbd8
sys_image_guid: 0409:73ff:ffe1:bbd8
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: HP_2180110032
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 98
port_lid: 62
port_lmc: 0x00
link_layer: InfiniBand
$ fi_info -l
psm2:
version: 1.7
psm:
version: 1.7
usnic:
version: 1.0
ofi_rxm:
version: 1.0
ofi_rxd:
version: 1.0
verbs:
version: 1.0
UDP:
version: 1.1
sockets:
version: 2.0
tcp:
version: 0.1
ofi_perf_hook:
version: 1.0
ofi_noop_hook:
version: 1.0
shm:
version: 1.0
ofi_mrail:
version: 1.0

fabien_tran1
Global Moderator
Global Moderator
Posts: 417
Joined: Mon Sep 13, 2021 11:02 am

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

#9 Post by fabien_tran1 » Mon Jul 03, 2023 7:12 am

Hi,

For running a calculation, do you load the required Intel modules or set the environment variables correctly?
Does the problem also occur if OpenMP is switched off with "export OMP_NUM_THREADS=1"?

amihai_silverman1
Newbie
Newbie
Posts: 22
Joined: Tue May 16, 2023 11:14 am

Re: UCX ERROR Invalid active_width on mlx5_0:1: 16

#10 Post by amihai_silverman1 » Tue Jul 04, 2023 5:58 am

Hi,
Thank you for your replies, I think that the problem is with the compute-nodes. I have asked the cluster support to check that with the cluster integrator.

Post Reply