Skip to content

How does PyTorch ensure thread safety when using UCX TCP transport? #11006

@TroyMitchell911

Description

@TroyMitchell911

I have a question regarding PyTorch + UCC + UCX:

Background:
PyTorch can use UCC (Unified Collective Communication) as a backend for distributed training, and UCC can use UCX TCP transport as the underlying transport layer.

Issue:
UCX documentation mentions that the TCP transport is not thread-safe for concurrent access.

Question:
PyTorch allows multi-threaded execution of collective operations (e.g., allreduce). How does it ensure thread safety when using UCX TCP transport?

  • Is it guaranteed by serializing calls at the PyTorch or UCC layer?
  • Or is there some other mechanism that ensures safe multi-threaded usage?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions