We are running a setup with two nodes using BroadCom RoCEv2 NICs (100 Gb) and NVIDIA GPUs. Initially we have two NUMA nodes configured on both nods.
Running ib_write_bw seems to indicate that GPUDirect RDMA works between all GPUs, independent of which NUMA node is used. We get an expected bandwith of around 85 Gb/s.
Though with UCX it does not work if the GPU is on a different NUMA node than the NIC. In this case the bandwith drops to 65 Gb/s and we see staging through
If we configure the node to use one NUMA node, GPUDirect RDMA will work again, independent of which NUMA node is used.
We are assuming that UCX has some logic to decide if it will stage through host memory or use RDMA based on the topology.
But given, that RDMA seems to work across NUMA nodes, we were wondering if this can be either configured or if there is another problem.
This should report a lower perfomance if the GPU and the NIC are on different NUMA nodes and cuMemCopy from CUDA.
ompi_info
Package: Open MPI
Distribution
Open MPI: 4.1.1
Open MPI repo revision: v4.1.1
Open MPI release date: Apr 24, 2021
Open RTE: 4.1.1
Open RTE repo revision: v4.1.1
Open RTE release date: Apr 24, 2021
OPAL: 4.1.1
OPAL repo revision: v4.1.1
OPAL release date: Apr 24, 2021
MPI API: 3.1.0
Ident string: 4.1.1
Prefix: /usr/local/mpi
Configured architecture: x86_64-pc-linux-gnu
Configure host: ngt-003-h100-nvl--rnmcnohij2v7-node-2
Configured by: hannesja
Configured on: Tue Jan 13 14:51:48 UTC 2026
Configure host: ngt-003-h100-nvl--rnmcnohij2v7-node-2
Configure command line: '--prefix=/usr/local/mpi'
'--with-ucx=/usr/local/ucx'
'--with-gdrcopy=/usr/local/gdrcopy'
'--with-cuda=/usr/local/cuda-12.8'
'--enable-mca-no-build=btl-uct'
'--enable-mca-no-build=btl-openib'
'--without-verbs'
Built by: root
Built on: Tue Jan 13 15:02:40 UTC 2026
Built host: ngt-003-h100-nvl--rnmcnohij2v7-node-2
C bindings: yes
C++ bindings: no
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
limitations in the gfortran compiler and/or Open
MPI, does not support the following: array
subsections, direct passthru (where possible) to
underlying Open MPI's C functionality
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: runpath
C compiler: gcc
C compiler absolute: /usr/bin/gcc
C compiler family name: GNU
C compiler version: 11.5.0
C++ compiler: g++
C++ compiler absolute: /usr/bin/g++
Fort compiler: gfortran
Fort compiler abs: /usr/bin/gfortran
Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
Fort 08 assumed shape: yes
Fort optional args: yes
Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
Fort STORAGE_SIZE: yes
Fort BIND(C) (all): yes
Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
Fort PROTECTED: yes
Fort ABSTRACT: yes
Fort ASYNCHRONOUS: yes
Fort PROCEDURE: yes
Fort USE...ONLY: yes
Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
C profiling: yes
C++ profiling: no
Fort mpif.h profiling: yes
Fort use mpi profiling: yes
Fort use mpi_f08 prof: yes
C++ exceptions: no
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
OMPI progress: no, ORTE progress: yes, Event lib:
yes)
Sparse Groups: no
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: no
mpirun default --prefix: no
MPI_WTIME support: native
Symbol vis. support: yes
Host topology support: yes
IPv6 support: no
MPI1 compatibility: no
MPI extensions: affinity, cuda, pcollreq
FT Checkpoint support: no (checkpoint thread: no)
C/R Enabled Debugging: no
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA event: libevent2022 (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
MCA hwloc: hwloc201 (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
v4.1.1)
MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.1)
MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.1.1)
MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.1.1)
MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component
v4.1.1)
MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component
v4.1.1)
MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component
v4.1.1)
MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component
v4.1.1)
MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component
v4.1.1)
MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA schizo: singularity (MCA v2.1.0, API v1.0.0, Component
v4.1.1)
MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA op: avx (MCA v2.1.0, API v1.0.0, Component v4.1.1)
MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
v4.1.1)
MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.1)
MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.1)
MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.1)
MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
v4.1.1)
MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
v4.1.1)
ucx_info -d
#
# Memory domain: self
# Component: self
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
# rkey_ptr is supported
# memory types: host (access,reg_nonblock,reg,cache)
#
# Transport: self
# Device: memory
# Type: loopback
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 19360.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8K
# am_bcopy: <= 8K
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: ep_check
# device mem_element: 0 bytes
#
#
# Memory domain: tcp
# Component: tcp
# memory types:
#
# Transport: tcp
# Device: enp194s0f0np0
# Type: network
# System device: enp194s0f0np0 (0)
#
# capabilities:
# bandwidth: 2200.00/ppn + 0.00 MB/sec
# latency: 5206 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 0
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
# device mem_element: 0 bytes
#
# Transport: tcp
# Device: enp3s0f4u1u2c2
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.32/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
# device mem_element: 0 bytes
#
# Transport: tcp
# Device: lo
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.91/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 18 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
# device mem_element: 0 bytes
#
# Transport: tcp
# Device: tunl0
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.60/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
# device mem_element: 0 bytes
#
#
# Connection manager: tcp
# max_conn_priv: 2064 bytes
#
# Memory domain: sysv
# Component: sysv
# allocate: unlimited
# remote key: 12 bytes
# rkey_ptr is supported
# memory types: host (access,alloc,cache)
#
# Transport: sysv
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 15360.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 16 bytes
# iface address: 8 bytes
# error handling: ep_check
# device mem_element: 0 bytes
#
#
# Memory domain: posix
# Component: posix
# allocate: <= 1406352400K
# remote key: 32 bytes
# rkey_ptr is supported
# memory types: host (access,alloc,cache)
#
# Transport: posix
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 15360.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 16 bytes
# iface address: 16 bytes
# error handling: ep_check
# device mem_element: 0 bytes
#
#
# Memory domain: cuda_cpy
# Component: cuda_cpy
# allocate: unlimited
# register: unlimited, cost: 0 nsec
# memory types: host (access,reg), cuda (access,alloc,reg,detect,dmabuf), cuda-managed (access,alloc,reg,cache,detect)
#
# Transport: cuda_copy
# Device: cuda
# Type: accelerator
# System device: <unknown>
#
# capabilities:
# bandwidth: 10000.00/ppn + 0.00 MB/sec
# latency: 8000 nsec
# overhead: 0 nsec
# put_short: <= 4294967295
# put_zcopy: unlimited, up to 1 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_short: <= 4294967295
# get_zcopy: unlimited, up to 1 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: none
# device mem_element: 0 bytes
#
#
# Memory domain: cuda_ipc
# Component: cuda_ipc
# register: unlimited, cost: 0 nsec
# remote key: 192 bytes
# memory invalidation is supported
# memory types: cuda (access,reg,cache)
#
# Transport: cuda_ipc
# Device: cuda
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 400000.00/ppn + 0.00 MB/sec
# latency: 1000 nsec
# overhead: 7000 nsec
# put_zcopy: unlimited, up to 1 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: <= 0, up to 1 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 4 bytes
# error handling: peer failure
# device mem_element: 8 bytes
#
#
# Memory domain: gdr_copy
# Component: gdr_copy
# register: unlimited, cost: 50 nsec
# remote key: 24 bytes
# memory types: cuda (access,reg)
#
# Transport: gdr_copy
# Device: cuda
# Type: accelerator
# System device: <unknown>
#
# capabilities:
# bandwidth: 6911.00/ppn + 250.00 MB/sec
# latency: 400 nsec
# overhead: 0 nsec
# put_short: <= 4294967295
# get_short: <= 4294967295
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: none
# device mem_element: 0 bytes
#
#
# Memory domain: bnxt_re0
# Component: ib
# register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory types: host (access,reg,cache)
#
# Transport: rc_verbs
# Device: bnxt_re0:1
# Type: network
# System device: bnxt_re0 (0)
#
# capabilities:
# bandwidth: 10957.84/ppn + 0.00 MB/sec
# latency: 800 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 96
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 5 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 1K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 5 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 1K
# am_short: <= 95
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 4 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 127
# domain: cpu
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 0
# device num paths: 1
# max eps: 256
# device address: 18 bytes
# ep address: 4 bytes
# error handling: peer failure, ep_check
# device mem_element: 0 bytes
#
#
# Transport: ud_verbs
# Device: bnxt_re0:1
# Type: network
# System device: bnxt_re0 (0)
#
# capabilities:
# bandwidth: 10957.84/ppn + 0.00 MB/sec
# latency: 830 nsec
# overhead: 105 nsec
# am_short: <= 88
# am_bcopy: <= 1016
# am_zcopy: <= 1016, up to 5 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 1K
# am header: <= 920
# connection: to ep, to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 18 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
# device mem_element: 0 bytes
#
#
# Memory domain: bnxt_re1
# Component: ib
# register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory types: host (access,reg,cache)
# < no supported devices found >
#
# Connection manager: rdmacm
# max_conn_priv: 54 bytes
#
# Memory domain: cma
# Component: cma
# memory types:
#
# Transport: cma
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 11145.00 MB/sec
# latency: 80 nsec
# overhead: 2000 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 16 bytes
# iface address: 16 bytes
# error handling: peer failure, ep_check
# device mem_element: 0 bytes
#
Describe the bug
We are running a setup with two nodes using BroadCom RoCEv2 NICs (100 Gb) and NVIDIA GPUs. Initially we have two NUMA nodes configured on both nods.
Running ib_write_bw seems to indicate that GPUDirect RDMA works between all GPUs, independent of which NUMA node is used. We get an expected bandwith of around 85 Gb/s.
Though with UCX it does not work if the GPU is on a different NUMA node than the NIC. In this case the bandwith drops to 65 Gb/s and we see staging through
If we configure the node to use one NUMA node, GPUDirect RDMA will work again, independent of which NUMA node is used.
We are assuming that UCX has some logic to decide if it will stage through host memory or use RDMA based on the topology.
But given, that RDMA seems to work across NUMA nodes, we were wondering if this can be either configured or if there is another problem.
Steps to Reproduce
ib_write_bw
On the server node:
On the client node:
This should report expected bandwith and no
cuMemCopycalls from CUDA.UCX
This should report a lower perfomance if the GPU and the NIC are on different NUMA nodes and
cuMemCopyfrom CUDA.ucx_info -v)UCX version from release
1.20.1Setup and versions
cat /etc/redhat-releaseAlmaLinux release 9.6 (Sage Margay)+uname -a:Linux ngt-003-h100-nvl--rnmcnohij2v7-node-2 6.8.4-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 4 20:45:21 UTC 2024 x86_64 x86_64 x86_64 GNU/Linuxrpm -q rdma-core:rdma-core-54.0-1.el9.x86_64orrpm -q libibverbs:libibverbs-57.0-2.el9.x86_64ibstat:lsmod|grep nv_peer_memdmabuf is used insteadAdditional information (depending on the issue)
ompi_info Package: Open MPI Distribution Open MPI: 4.1.1 Open MPI repo revision: v4.1.1 Open MPI release date: Apr 24, 2021 Open RTE: 4.1.1 Open RTE repo revision: v4.1.1 Open RTE release date: Apr 24, 2021 OPAL: 4.1.1 OPAL repo revision: v4.1.1 OPAL release date: Apr 24, 2021 MPI API: 3.1.0 Ident string: 4.1.1 Prefix: /usr/local/mpi Configured architecture: x86_64-pc-linux-gnu Configure host: ngt-003-h100-nvl--rnmcnohij2v7-node-2 Configured by: hannesja Configured on: Tue Jan 13 14:51:48 UTC 2026 Configure host: ngt-003-h100-nvl--rnmcnohij2v7-node-2 Configure command line: '--prefix=/usr/local/mpi' '--with-ucx=/usr/local/ucx' '--with-gdrcopy=/usr/local/gdrcopy' '--with-cuda=/usr/local/cuda-12.8' '--enable-mca-no-build=btl-uct' '--enable-mca-no-build=btl-openib' '--without-verbs' Built by: root Built on: Tue Jan 13 15:02:40 UTC 2026 Built host: ngt-003-h100-nvl--rnmcnohij2v7-node-2 C bindings: yes C++ bindings: no Fort mpif.h: yes (all) Fort use mpi: yes (full: ignore TKR) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: yes Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the gfortran compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: runpath C compiler: gcc C compiler absolute: /usr/bin/gcc C compiler family name: GNU C compiler version: 11.5.0 C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fort compiler: gfortran Fort compiler abs: /usr/bin/gfortran Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::) Fort 08 assumed shape: yes Fort optional args: yes Fort INTERFACE: yes Fort ISO_FORTRAN_ENV: yes Fort STORAGE_SIZE: yes Fort BIND(C) (all): yes Fort ISO_C_BINDING: yes Fort SUBROUTINE BIND(C): yes Fort TYPE,BIND(C): yes Fort T,BIND(C,name="a"): yes Fort PRIVATE: yes Fort PROTECTED: yes Fort ABSTRACT: yes Fort ASYNCHRONOUS: yes Fort PROCEDURE: yes Fort USE...ONLY: yes Fort C_FUNLOC: yes Fort f08 using wrappers: yes Fort MPI_SIZEOF: yes C profiling: yes C++ profiling: no Fort mpif.h profiling: yes Fort use mpi profiling: yes Fort use mpi_f08 prof: yes C++ exceptions: no Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) Sparse Groups: no Internal debug support: no MPI interface warnings: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: no dl support: yes Heterogeneous support: no mpirun default --prefix: no MPI_WTIME support: native Symbol vis. support: yes Host topology support: yes IPv6 support: no MPI1 compatibility: no MPI extensions: affinity, cuda, pcollreq FT Checkpoint support: no (checkpoint thread: no) C/R Enabled Debugging: no MPI_MAX_PROCESSOR_NAME: 256 MPI_MAX_ERROR_STRING: 256 MPI_MAX_OBJECT_NAME: 64 MPI_MAX_INFO_KEY: 36 MPI_MAX_INFO_VAL: 256 MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128 MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.1) MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.1) MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.1) MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.1) MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA event: libevent2022 (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA hwloc: hwloc201 (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA pmix: pmix3x (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.1) MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.1.1) MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.1.1) MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA schizo: singularity (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA op: avx (MCA v2.1.0, API v1.0.0, Component v4.1.1) MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.1) MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.1) MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.1) MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component v4.1.1) MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component v4.1.1)ucx_info -dto show transports and devices recognized by UCX