I have a question regarding PyTorch + UCC + UCX:
Background:
PyTorch can use UCC (Unified Collective Communication) as a backend for distributed training, and UCC can use UCX TCP transport as the underlying transport layer.
Issue:
UCX documentation mentions that the TCP transport is not thread-safe for concurrent access.
Question:
PyTorch allows multi-threaded execution of collective operations (e.g., allreduce). How does it ensure thread safety when using UCX TCP transport?
- Is it guaranteed by serializing calls at the PyTorch or UCC layer?
- Or is there some other mechanism that ensures safe multi-threaded usage?
Thanks!