Releases: openucx/ucx
Releases · openucx/ucx
v1.19.1
v1.19.1-rc2
1.19.1 (Oct 21, 2025)
Features:
UCP
- Do not require transport memory support if rendezvous protocol is not used
Build
- Added CUDA 13 support to the release pipeline
- Added Rocky OS support to the release pipeline
Bugfixes:
UCS
- Fixed Netlink fetch mechanism
v1.19.1-rc1
1.19.1 (Sep 18, 2025)
Features:
UCP
- Do not require transport memory support if rendezvous protocol is not used
Build
- Added CUDA 13 support to the release pipeline
v1.19.0
1.19.0 (August 6, 2025)
Features:
UCP
- Enabled multi-GPU support within a single process
- Added dynamic selection between strong and weak fences in RMA flush operations
- Improved endpoint reconfiguration capabilities
- Added All2All lane selection for multi-NIC-GPU systems
- Improved rkey debug info when config cache limit is reached
- Improved UCP protocol selection based on available memory types
- Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
- Improved RNDV performance with device-local staging buffers
- Enabled error handling for RMA get_offload protocols
UCT
- Defined uct_rkey_unpack_v2 API to support passing sys-dev
RDMA CORE (IB, ROCE, etc.)
- Added SRD transport support in EFA with reordering, AM, and control operations
- Removed XGVMI BF2 support (umem)
- Removed device memory indirect key
- Fixed VFS objects for DCIs and pools
- Added routing table cache to the reachability check
- Fixed strict order usage in IB auxiliary rkeys
- Improved various init logging messages
CUDA
- Added multi-context support for remote key unpacking to CUDA IPC
- Added context switching aware resource management to CUDA IPC
- Use buffer ID to detect VA recycling in CUDA IPC
- Added support for allocating CUDA memory on specific system devices
- Added multi-device support in CUDA copy
- Improved protocol lane selection for GPU memory operations
- Relaxed CUDA context requirements in CUDA copy
- Added deadlock prevention in CUDA copy
- Added support for address range detection for VMM
- Enabled memory attributes query after switching CUDA GPU
- Added multi-GPU send tests for CUDA transports
- Removed host-to-host performance estimation from CUDA copy transport
- Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
- Improved various init logging messages
ROCM
- Added control parameters for IPC handle cache and signal pool size
- Optimized ROCm memory type detection with caching
UCS
- Removed compilation warnings
Tools
- Added name filter option (-F 'str') to ucx_info for config and feature dumps
- Improved ucx_info input validation
Bugfixes:
UCP
- Made UCX_TLS=^ib disable all transports including auxiliary
- Fixed send request status handling
- Fixed performance degradation in RNDV by optimizing md cache updates
- Fixed protocol selection when first lane is filtered out by fragment size
- Fixed rkey selection by using memory registration flag
UCT
RDMA CORE (IB, ROCE, etc.)
- Improved reliability of DC transport by adding DCI validation and separating connection logic
- Fixed segfault in DC fence operation
GPU (CUDA, ROCM)
- Updated ROCm configuration for ROCm 6.3 compatibility
- Fixed system device detection for CUDA async memory operations
- Fixed legacy type detection during CUDA IPC mpack
- Fixed CUDA IPC RMA operations by using correct context for local buffers
UCS
- Use UCS function for counting leading zeros on x86 architecture
- Fixed a compilation warning
Shared Memory
- Fixed FIFO availability check for sm transport
Documentation
- Fixed open-mpi clone instruction
Build
- Fixed enum-int-mismatch warnings with GCC 15
v1.19.0-rc2
1.19.0 (June 18, 2025)
Features:
UCP
- Enabled multi-GPU support within a single process
- Added dynamic selection between strong and weak fences in RMA flush operations
- Improved endpoint reconfiguration capabilities
- Added All2All lane selection for multi-NIC-GPU systems
- Improved rkey debug info when config cache limit is reached
- Improved UCP protocol selection based on available memory types
- Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
- Improved RNDV performance with device-local staging buffers
- Enabled error handling for RMA get_offload protocols
UCT
- Defined uct_rkey_unpack_v2 API to support passing sys-dev
RDMA CORE (IB, ROCE, etc.)
- Added SRD transport support in EFA with reordering, AM, and control operations
- Removed XGVMI BF2 support (umem)
- Removed device memory indirect key
- Fixed VFS objects for DCIs and pools
- Added routing table cache to the reachability check
- Fixed strict order usage in IB auxiliary rkeys
- Improved various init logging messages
CUDA
- Added multi-context support for remote key unpacking to CUDA IPC
- Added context switching aware resource management to CUDA IPC
- Use buffer ID to detect VA recycling in CUDA IPC
- Added support for allocating CUDA memory on specific system devices
- Added multi-device support in CUDA copy
- Improved protocol lane selection for GPU memory operations
- Relaxed CUDA context requirements in CUDA copy
- Added deadlock prevention in CUDA copy
- Added support for address range detection for VMM
- Enabled memory attributes query after switching CUDA GPU
- Added multi-GPU send tests for CUDA transports
- Removed host-to-host performance estimation from CUDA copy transport
- Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
- Improved various init logging messages
ROCM
- Added control parameters for IPC handle cache and signal pool size
- Optimized ROCm memory type detection with caching
UCS
- Removed compilation warnings
Tools
- Added name filter option (-F 'str') to ucx_info for config and feature dumps
- Improved ucx_info input validation
Bugfixes:
UCP
- Made UCX_TLS=^ib disable all transports including auxiliary
- Fixed send request status handling
- Fixed performance degradation in RNDV by optimizing md cache updates
- Fixed protocol selection when first lane is filtered out by fragment size
- Fixed rkey selection by using memory registration flag
UCT
RDMA CORE (IB, ROCE, etc.)
- Improved reliability of DC transport by adding DCI validation and separating connection logic
- Fixed segfault in DC fence operation
GPU (CUDA, ROCM)
- Updated ROCm configuration for ROCm 6.3 compatibility
- Fixed system device detection for CUDA async memory operations
- Fixed legacy type detection during CUDA IPC mpack
- Fixed CUDA IPC RMA operations by using correct context for local buffers
UCS
- Use UCS function for counting leading zeros on x86 architecture
- Fixed a compilation warning
Shared Memory
- Fixed FIFO availability check for sm transport
Documentation
- Fixed open-mpi clone instruction
Build
- Fixed enum-int-mismatch warnings with GCC 15
v1.19.0-rc1
1.19.0 (June 18, 2025)
Features:
UCP
- Enabled multi-GPU support within a single process
- Added dynamic selection between strong and weak fences in RMA flush operations
- Improved endpoint reconfiguration capabilities
- Added All2All lane selection for multi-NIC-GPU systems
- Improved rkey debug info when config cache limit is reached
- Improved UCP protocol selection based on available memory types
- Removed dummy memory key from irrelevant transports (TCP, CMA and CUDA)
- Improved RNDV performance with device-local staging buffers
- Enabled error handling for RMA get_offload protocols
UCT
- Defined uct_rkey_unpack_v2 API to support passing sys-dev
RDMA CORE (IB, ROCE, etc.)
- Added SRD transport support in EFA with reordering, AM, and control operations
- Removed XGVMI BF2 support (umem)
- Removed device memory indirect key
- Fixed VFS objects for DCIs and pools
- Added routing table cache to the reachability check
- Fixed strict order usage in IB auxiliary rkeys
- Improved various init logging messages
CUDA
- Added multi-context support for remote key unpacking to CUDA IPC
- Added context switching aware resource management to CUDA IPC
- Use buffer ID to detect VA recycling in CUDA IPC
- Added support for allocating CUDA memory on specific system devices
- Added multi-device support in CUDA copy
- Improved protocol lane selection for GPU memory operations
- Relaxed CUDA context requirements in CUDA copy
- Added deadlock prevention in CUDA copy
- Added support for address range detection for VMM
- Enabled memory attributes query after switching CUDA GPU
- Added multi-GPU send tests for CUDA transports
- Removed host-to-host performance estimation from CUDA copy transport
- Replaced cuCtxCreate by cuDevicePrimaryCtxRetain
- Improved various init logging messages
ROCM
- Added control parameters for IPC handle cache and signal pool size
- Optimized ROCm memory type detection with caching
UCS
- Removed compilation warnings
Tools
- Added name filter option (-F 'str') to ucx_info for config and feature dumps
- Improved ucx_info input validation
Bugfixes:
UCP
- Made UCX_TLS=^ib disable all transports including auxiliary
- Fixed send request status handling
- Fixed performance degradation in RNDV by optimizing md cache updates
- Fixed protocol selection when first lane is filtered out by fragment size
- Fixed rkey selection by using memory registration flag
UCT
RDMA CORE (IB, ROCE, etc.)
- Improved reliability of DC transport by adding DCI validation and separating connection logic
- Fixed segfault in DC fence operation
GPU (CUDA, ROCM)
- Updated ROCm configuration for ROCm 6.3 compatibility
- Fixed system device detection for CUDA async memory operations
- Fixed legacy type detection during CUDA IPC mpack
- Fixed CUDA IPC RMA operations by using correct context for local buffers
UCS
- Use UCS function for counting leading zeros on x86 architecture
- Fixed a compilation warning
Shared Memory
- Fixed FIFO availability check for sm transport
Documentation
- Fixed open-mpi clone instruction
Build
- Fixed enum-int-mismatch warnings with GCC 15
v1.18.1
1.18.1 (April 28, 2025)
Features:
CUDA
- Added config keys to update cuda_copy bandwidth for coherent platforms
- Improved cache invalidation of memory allocated using CUDA memory pool
AZP
- Added Ubuntu 24.04 to build and release pipeline
Bugfixes:
UCP
- Fixed assertion failure when maximum lane fragment is smaller than AM header
- Fixed potential active message user header use after free with protocol reconfiguration
CUDA
- Fixed registration of CUDA Fabric memory allocated by UCT
- Fixed VA recycling check of memory allocated using VMM and CUDA memory pool
RDMA CORE (IB, ROCE, etc.)
- Do not use ConnectX-8 SMI subdevices for communication
- Fixed remote access error by disabling ODP when the device supports DDP
- Fixed configuration logic by disabling DDP when AR is disabled
UCM
- Fixed crash with bistro hooks for CUDA 12.9 on amd64
v1.18.1 RC3
1.18.1-rc3 (April 17, 2025)
Bugfixes:
UCM
- Fixed crash with bistro hooks for CUDA 12.9 on amd64
v1.18.1 RC2
1.18.1-rc2 (April 9, 2025)
Features:
CUDA
- Added config keys to update cuda_copy bandwidth for coherent platforms
- Improved cache invalidation of memory allocated using CUDA memory pool
Bugfixes:
UCP
- Fixed assertion failure when maximum lane fragment is smaller than AM header
CUDA
- Fixed registration of CUDA Fabric memory allocated by UCT
- Fixed VA recycling check of memory allocated using VMM and CUDA memory pool
RDMA CORE (IB, ROCE, etc.)
- Do not use ConnectX-8 SMI subdevices for communication
- Fixed remote access error by disabling ODP when the device supports DDP
- Fixed configuration logic by disabling DDP when AR is disabled
v1.18.1 RC1
1.18.1-rc1 (February 20, 2025)
Features:
AZP
- Added Ubuntu 24.04 to build and release pipeline
Bugfixes:
UCP
- Fixed potential active message user header use after free with protocol reconfiguration