UCX ERROR Invalid active_width on mlx5_0:1: 16
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 22
- Joined: Tue May 16, 2023 11:14 am
UCX ERROR Invalid active_width on mlx5_0:1: 16
Hello,
I have compiled vasp.6.4.1 on our HPC cluster.
When I run the NaCl example (from the /testsuite/tests) I get an error in the output :
[1688024082.874154] [n017:228578:0] ib_iface.c:947 UCX ERROR Invalid active_width on mlx5_0:1: 16
Abort(1614991) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(143)........:
MPID_Init(1310)..............:
MPIDI_OFI_mpi_init_hook(1901):
create_endpoint(2593)........: OFI endpoint open failed (ofi_init.c:2593:create_endpoint:Input/output error)
I am not sure if the problem is in the cluster or else in my compilation.
I will be grateful for your help.
Attached is the makefile.include which was used for the compilation, and a tar of the running folder with the PBS submit script and the input and output files.
Thank a lot
Amihai
I have compiled vasp.6.4.1 on our HPC cluster.
When I run the NaCl example (from the /testsuite/tests) I get an error in the output :
[1688024082.874154] [n017:228578:0] ib_iface.c:947 UCX ERROR Invalid active_width on mlx5_0:1: 16
Abort(1614991) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(143)........:
MPID_Init(1310)..............:
MPIDI_OFI_mpi_init_hook(1901):
create_endpoint(2593)........: OFI endpoint open failed (ofi_init.c:2593:create_endpoint:Input/output error)
I am not sure if the problem is in the cluster or else in my compilation.
I will be grateful for your help.
Attached is the makefile.include which was used for the compilation, and a tar of the running folder with the PBS submit script and the input and output files.
Thank a lot
Amihai
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 417
- Joined: Mon Sep 13, 2021 11:02 am
Re: UCX ERROR Invalid active_width on mlx5_0:1: 16
Hi,
Can you please provide information about the HPC cluster like the processor type and RAM?
Is the error occurring for all examples that you have tried or only NaCl?
A google search of " UCX ERROR Invalid active_width" provides only
https://github.com/openucx/ucx/issues/4556
https://forums.developer.nvidia.com/t/u ... 7-9/206236
Have you tried the possible solutions provided there?
Can you please provide information about the HPC cluster like the processor type and RAM?
Is the error occurring for all examples that you have tried or only NaCl?
A google search of " UCX ERROR Invalid active_width" provides only
https://github.com/openucx/ucx/issues/4556
https://forums.developer.nvidia.com/t/u ... 7-9/206236
Have you tried the possible solutions provided there?
-
- Newbie
- Posts: 22
- Joined: Tue May 16, 2023 11:14 am
Re: UCX ERROR Invalid active_width on mlx5_0:1: 16
Hi,
The same error occurs in every run. That is why I tried a simple test.
The cluster has Lenovo compute-nodes with processors : Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
384G RAM in each node, and InfiniBand network.
I compiled with the Intel compiler oneapi2022.
Regarding the links you provided, I observe that in my case, command
ucx_info -d
doesn't show any error. Its output is attached.
Thanks, Amihai
The same error occurs in every run. That is why I tried a simple test.
The cluster has Lenovo compute-nodes with processors : Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
384G RAM in each node, and InfiniBand network.
I compiled with the Intel compiler oneapi2022.
Regarding the links you provided, I observe that in my case, command
ucx_info -d
doesn't show any error. Its output is attached.
Thanks, Amihai
You do not have the required permissions to view the files attached to this post.
-
- Global Moderator
- Posts: 417
- Joined: Mon Sep 13, 2021 11:02 am
Re: UCX ERROR Invalid active_width on mlx5_0:1: 16
Is the error also occurring in non-parallel calculation with "mpirun -np 1"? Are you running with OpenMP, i.e., with OMP_NUM_THREADS>1?
-
- Newbie
- Posts: 22
- Joined: Tue May 16, 2023 11:14 am
Re: UCX ERROR Invalid active_width on mlx5_0:1: 16
Yes, the run failed also with "mpirun -np 1".
It also failed with the same error when I put
export OMP_NUM_THREADS=12
mpirun -np 12 ...
I thought that it is a problem of the Intel oneapi compiler, so I downloaded and installed the latest version, and recompiled VASP with it,
It didn't solve the problem, the run fails with the same error.
Thanks, Amihai
It also failed with the same error when I put
export OMP_NUM_THREADS=12
mpirun -np 12 ...
I thought that it is a problem of the Intel oneapi compiler, so I downloaded and installed the latest version, and recompiled VASP with it,
It didn't solve the problem, the run fails with the same error.
Thanks, Amihai
-
- Newbie
- Posts: 22
- Joined: Tue May 16, 2023 11:14 am
Re: UCX ERROR Invalid active_width on mlx5_0:1: 16
Please see the output of ucx_info -d. It looks like Device: mlx5_0:1 is OK :
#
# Memory domain: self
# component: self
# register: unlimited, cost: 0 nsec
# remote key: 8 bytes
#
# Transport: self
#
# Device: self
#
# capabilities:
# bandwidth: 6911.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8k
# am_bcopy: <= 8k
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# priority: 0
# device address: 0 bytes
# iface address: 8 bytes
# error handling: none
#
#
# Memory domain: tcp
# component: tcp
#
# Transport: tcp
#
# Device: eno6
#
# capabilities:
# bandwidth: 1131.64 MB/sec
# latency: 5258 nsec
# overhead: 50000 nsec
# am_bcopy: <= 8k
# connection: to iface
# priority: 1
# device address: 4 bytes
# iface address: 2 bytes
# error handling: none
#
# Device: eno2
#
# capabilities:
# bandwidth: 113.16 MB/sec
# latency: 5776 nsec
# overhead: 50000 nsec
# am_bcopy: <= 8k
# connection: to iface
# priority: 1
# device address: 4 bytes
# iface address: 2 bytes
# error handling: none
#
# Device: ib0
#
# capabilities:
# bandwidth: 5571.26 MB/sec
# latency: 5212 nsec
# overhead: 50000 nsec
# am_bcopy: <= 8k
# connection: to iface
# priority: 1
# device address: 4 bytes
# iface address: 2 bytes
# error handling: none
#
#
# Memory domain: ib/mlx5_0
# component: ib
# register: unlimited, cost: 90 nsec
# remote key: 16 bytes
# local memory handle is required for zcopy
#
# Transport: rc
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 600 nsec + 1 * N
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8k
# put_zcopy: <= 1g, up to 3 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4k
# get_bcopy: <= 8k
# get_zcopy: 33..1g, up to 3 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4k
# am_short: <= 123
# am_bcopy: <= 8191
# am_zcopy: <= 8191, up to 2 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 127
# domain: device
# connection: to ep
# priority: 30
# device address: 3 bytes
# ep address: 4 bytes
# error handling: peer failure
#
#
# Transport: rc_mlx5
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 600 nsec + 1 * N
# overhead: 40 nsec
# put_short: <= 220
# put_bcopy: <= 8k
# put_zcopy: <= 1g, up to 8 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4k
# get_bcopy: <= 8k
# get_zcopy: 33..1g, up to 8 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4k
# am_short: <= 235
# am_bcopy: <= 8191
# am_zcopy: <= 8191, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 187
# domain: device
# connection: to ep
# priority: 30
# device address: 3 bytes
# ep address: 4 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: dc_mlx5
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 660 nsec
# overhead: 40 nsec
# put_short: <= 172
# put_bcopy: <= 8k
# put_zcopy: <= 1g, up to 8 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4k
# get_bcopy: <= 8k
# get_zcopy: 33..1g, up to 8 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4k
# am_short: <= 187
# am_bcopy: <= 8191
# am_zcopy: <= 8191, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 139
# domain: device
# connection: to iface
# priority: 30
# device address: 3 bytes
# iface address: 5 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: ud
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 610 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 1 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 3984
# connection: to ep, to iface
# priority: 30
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure
#
#
# Transport: ud_mlx5
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 610 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 132
# connection: to ep, to iface
# priority: 30
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure
#
#
# Memory domain: rdmacm
# component: rdmacm
# supports client-server connection establishment via sockaddr
# < no supported devices found >
#
# Memory domain: sysv
# component: sysv
# allocate: unlimited
# remote key: 32 bytes
#
# Transport: mm
#
# Device: sysv
#
# capabilities:
# bandwidth: 6911.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 92
# am_bcopy: <= 8k
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# priority: 0
# device address: 8 bytes
# iface address: 16 bytes
# error handling: none
#
#
# Memory domain: posix
# component: posix
# allocate: unlimited
# remote key: 37 bytes
#
# Transport: mm
#
# Device: posix
#
# capabilities:
# bandwidth: 6911.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 92
# am_bcopy: <= 8k
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# priority: 0
# device address: 8 bytes
# iface address: 16 bytes
# error handling: none
#
#
# Memory domain: cma
# component: cma
# register: unlimited, cost: 9 nsec
#
# Transport: cma
#
# Device: cma
#
# capabilities:
# bandwidth: 11145.00 MB/sec
# latency: 80 nsec
# overhead: 400 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# priority: 0
# device address: 8 bytes
# iface address: 4 bytes
# error handling: none
#
#
# Memory domain: self
# component: self
# register: unlimited, cost: 0 nsec
# remote key: 8 bytes
#
# Transport: self
#
# Device: self
#
# capabilities:
# bandwidth: 6911.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8k
# am_bcopy: <= 8k
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# priority: 0
# device address: 0 bytes
# iface address: 8 bytes
# error handling: none
#
#
# Memory domain: tcp
# component: tcp
#
# Transport: tcp
#
# Device: eno6
#
# capabilities:
# bandwidth: 1131.64 MB/sec
# latency: 5258 nsec
# overhead: 50000 nsec
# am_bcopy: <= 8k
# connection: to iface
# priority: 1
# device address: 4 bytes
# iface address: 2 bytes
# error handling: none
#
# Device: eno2
#
# capabilities:
# bandwidth: 113.16 MB/sec
# latency: 5776 nsec
# overhead: 50000 nsec
# am_bcopy: <= 8k
# connection: to iface
# priority: 1
# device address: 4 bytes
# iface address: 2 bytes
# error handling: none
#
# Device: ib0
#
# capabilities:
# bandwidth: 5571.26 MB/sec
# latency: 5212 nsec
# overhead: 50000 nsec
# am_bcopy: <= 8k
# connection: to iface
# priority: 1
# device address: 4 bytes
# iface address: 2 bytes
# error handling: none
#
#
# Memory domain: ib/mlx5_0
# component: ib
# register: unlimited, cost: 90 nsec
# remote key: 16 bytes
# local memory handle is required for zcopy
#
# Transport: rc
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 600 nsec + 1 * N
# overhead: 75 nsec
# put_short: <= 124
# put_bcopy: <= 8k
# put_zcopy: <= 1g, up to 3 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4k
# get_bcopy: <= 8k
# get_zcopy: 33..1g, up to 3 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4k
# am_short: <= 123
# am_bcopy: <= 8191
# am_zcopy: <= 8191, up to 2 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 127
# domain: device
# connection: to ep
# priority: 30
# device address: 3 bytes
# ep address: 4 bytes
# error handling: peer failure
#
#
# Transport: rc_mlx5
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 600 nsec + 1 * N
# overhead: 40 nsec
# put_short: <= 220
# put_bcopy: <= 8k
# put_zcopy: <= 1g, up to 8 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4k
# get_bcopy: <= 8k
# get_zcopy: 33..1g, up to 8 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4k
# am_short: <= 235
# am_bcopy: <= 8191
# am_zcopy: <= 8191, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 187
# domain: device
# connection: to ep
# priority: 30
# device address: 3 bytes
# ep address: 4 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: dc_mlx5
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 660 nsec
# overhead: 40 nsec
# put_short: <= 172
# put_bcopy: <= 8k
# put_zcopy: <= 1g, up to 8 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4k
# get_bcopy: <= 8k
# get_zcopy: 33..1g, up to 8 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4k
# am_short: <= 187
# am_bcopy: <= 8191
# am_zcopy: <= 8191, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 139
# domain: device
# connection: to iface
# priority: 30
# device address: 3 bytes
# iface address: 5 bytes
# error handling: buffer (zcopy), remote access, peer failure
#
#
# Transport: ud
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 610 nsec
# overhead: 105 nsec
# am_short: <= 116
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 1 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 3984
# connection: to ep, to iface
# priority: 30
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure
#
#
# Transport: ud_mlx5
#
# Device: mlx5_0:1
#
# capabilities:
# bandwidth: 47176.93 MB/sec
# latency: 610 nsec
# overhead: 80 nsec
# am_short: <= 180
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 3 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4k
# am header: <= 132
# connection: to ep, to iface
# priority: 30
# device address: 3 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure
#
#
# Memory domain: rdmacm
# component: rdmacm
# supports client-server connection establishment via sockaddr
# < no supported devices found >
#
# Memory domain: sysv
# component: sysv
# allocate: unlimited
# remote key: 32 bytes
#
# Transport: mm
#
# Device: sysv
#
# capabilities:
# bandwidth: 6911.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 92
# am_bcopy: <= 8k
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# priority: 0
# device address: 8 bytes
# iface address: 16 bytes
# error handling: none
#
#
# Memory domain: posix
# component: posix
# allocate: unlimited
# remote key: 37 bytes
#
# Transport: mm
#
# Device: posix
#
# capabilities:
# bandwidth: 6911.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 92
# am_bcopy: <= 8k
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# priority: 0
# device address: 8 bytes
# iface address: 16 bytes
# error handling: none
#
#
# Memory domain: cma
# component: cma
# register: unlimited, cost: 9 nsec
#
# Transport: cma
#
# Device: cma
#
# capabilities:
# bandwidth: 11145.00 MB/sec
# latency: 80 nsec
# overhead: 400 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# priority: 0
# device address: 8 bytes
# iface address: 4 bytes
# error handling: none
#
-
- Global Moderator
- Posts: 417
- Joined: Mon Sep 13, 2021 11:02 am
Re: UCX ERROR Invalid active_width on mlx5_0:1: 16
Hi,
I will ask my colleagues if they have an idea of what could be the problem. Meanwhile you could try what is mentioned in the last post of
https://forums.developer.nvidia.com/t/u ... 9/206236/4
where it is suggested to add "soft memlock unlimited" and "hard memlock unlimited" in /etc/security/limits.d/rdma.conf. I guess it is not related and won't help, but try it just in case.
I will ask my colleagues if they have an idea of what could be the problem. Meanwhile you could try what is mentioned in the last post of
https://forums.developer.nvidia.com/t/u ... 9/206236/4
where it is suggested to add "soft memlock unlimited" and "hard memlock unlimited" in /etc/security/limits.d/rdma.conf. I guess it is not related and won't help, but try it just in case.
-
- Newbie
- Posts: 22
- Joined: Tue May 16, 2023 11:14 am
Re: UCX ERROR Invalid active_width on mlx5_0:1: 16
Please see, may help to debug the problem :
$ lspci | grep Mellanox
12:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
$ ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.22.4030
node_guid: 0409:73ff:ffe1:bbd8
sys_image_guid: 0409:73ff:ffe1:bbd8
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: HP_2180110032
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 98
port_lid: 62
port_lmc: 0x00
link_layer: InfiniBand
$ fi_info -l
psm2:
version: 1.7
psm:
version: 1.7
usnic:
version: 1.0
ofi_rxm:
version: 1.0
ofi_rxd:
version: 1.0
verbs:
version: 1.0
UDP:
version: 1.1
sockets:
version: 2.0
tcp:
version: 0.1
ofi_perf_hook:
version: 1.0
ofi_noop_hook:
version: 1.0
shm:
version: 1.0
ofi_mrail:
version: 1.0
$ lspci | grep Mellanox
12:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
$ ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.22.4030
node_guid: 0409:73ff:ffe1:bbd8
sys_image_guid: 0409:73ff:ffe1:bbd8
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: HP_2180110032
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 98
port_lid: 62
port_lmc: 0x00
link_layer: InfiniBand
$ fi_info -l
psm2:
version: 1.7
psm:
version: 1.7
usnic:
version: 1.0
ofi_rxm:
version: 1.0
ofi_rxd:
version: 1.0
verbs:
version: 1.0
UDP:
version: 1.1
sockets:
version: 2.0
tcp:
version: 0.1
ofi_perf_hook:
version: 1.0
ofi_noop_hook:
version: 1.0
shm:
version: 1.0
ofi_mrail:
version: 1.0
-
- Global Moderator
- Posts: 417
- Joined: Mon Sep 13, 2021 11:02 am
Re: UCX ERROR Invalid active_width on mlx5_0:1: 16
Hi,
For running a calculation, do you load the required Intel modules or set the environment variables correctly?
Does the problem also occur if OpenMP is switched off with "export OMP_NUM_THREADS=1"?
For running a calculation, do you load the required Intel modules or set the environment variables correctly?
Does the problem also occur if OpenMP is switched off with "export OMP_NUM_THREADS=1"?
-
- Newbie
- Posts: 22
- Joined: Tue May 16, 2023 11:14 am
Re: UCX ERROR Invalid active_width on mlx5_0:1: 16
Hi,
Thank you for your replies, I think that the problem is with the compute-nodes. I have asked the cluster support to check that with the cluster integrator.
Thank you for your replies, I think that the problem is with the compute-nodes. I have asked the cluster support to check that with the cluster integrator.