[prov/verbs] Compatability between RDMA CAS and CPU CAS #11610

youtive10 · 2025-11-12T09:24:13Z

youtive10
Nov 12, 2025

I could need some clarification when it comes to RDMA CAS and CPU CAS operations.

The scenario:

Node A wants to transmit data to Node B.
Writing to the memory of Node B is protected by an atomic variable and CAS operations:
- Node A performs RDMA CAS operations to the atomic variable of Node B
- Because the memory exists on Node B, it uses the C++ compare_exchange_* function.

Correct me if I'm wrong but because of the different data paths taken to perform the CAS operation the locking mechanism between Node A and B will not work properly.

I was thinking about letting Node B create an endpoint which is connected to itself to also perform the RDMA CAS operation so Node A and B use the same data path, but this seems unnecessarily complicated to me.

What is the best practice here? Maybe libfabric provides a solution which I'm not aware of?

Answered by aingerson

Nov 13, 2025

@youtive10 Thanks for the context!
Yes you're absolutely right. All your accesses to the atomic need to use the same mechanism
If you want to target the RDMA CAS directly, then I think you're going to have to open a separate connection for the local access to the variable, unfortunately.
If it's an option for you, you could use an RDM endpoint instead and use RxM to manage your atomics. RxM uses CPU atomics to simulate the CAS with the underlying provider so the remote and local access mechanisms would be aligned in that case.
https://github.com/ofiwg/libfabric/blob/main/prov/rxm/src/rxm_cq.c#L1154

View full answer

aingerson · 2025-11-13T14:47:11Z

aingerson
Nov 13, 2025
Collaborator

@youtive10 Can you elaborate a bit more on what you're trying to do and when the atomic variable and CAS variables are being accessed and why your case would not work properly?

1 reply

youtive10 Nov 13, 2025
Author

Can you elaborate a bit more on what you're trying to do

I guess I can try to put it into other words:

Node B provides data which has to be protected from parallel access and manipulation. The memory layout is as follows:
- [atomic_uint64] [data ...]
Node B and Node A want to manipulate and work with the data on Node B.
To protect the memory on Node B from parallel access the atomic variable and CAS operations are used.
- If nobody is currently working on the memory the atomic variable is 0
- If somebody wants to access the data on Node B a CAS operation will be executed and the 0 will be swapped with a 1. On success, the CAS caller will start working on the data.
- If the caller is done with its task, a 0 will be written to the atomic variable again.
Because the data resides on Node B it can use the C++ atomic compare functions.
Node A will use the fi_compare_atomic libfabric function.

why your case would not work properly

The scenario I described is the implementation I am currently working with. Debugging showed that the described locking mechanism does not work properly. It works often but not always - it is unreliable.

I've found others with similar questions/problems. The answers of these forum questions explain what the root cause of the problem is:

https://forums.developer.nvidia.com/t/does-cpu-cas-work-correctly-with-rdma-cas/325673
https://stackoverflow.com/questions/28793486/rdma-atomic-operations-implementation

aingerson · 2025-11-13T16:17:29Z

aingerson
Nov 13, 2025
Collaborator

@youtive10 Thanks for the context!
Yes you're absolutely right. All your accesses to the atomic need to use the same mechanism
If you want to target the RDMA CAS directly, then I think you're going to have to open a separate connection for the local access to the variable, unfortunately.
If it's an option for you, you could use an RDM endpoint instead and use RxM to manage your atomics. RxM uses CPU atomics to simulate the CAS with the underlying provider so the remote and local access mechanisms would be aligned in that case.
https://github.com/ofiwg/libfabric/blob/main/prov/rxm/src/rxm_cq.c#L1154

1 reply

youtive10 Nov 13, 2025
Author

Thank you for the confirmation of the problem!
I will keep the RDM endpoint idea in the back of my head.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[prov/verbs] Compatability between RDMA CAS and CPU CAS #11610

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[prov/verbs] Compatability between RDMA CAS and CPU CAS #11610

Uh oh!

youtive10 Nov 12, 2025

Replies: 2 comments · 2 replies

Uh oh!

aingerson Nov 13, 2025 Collaborator

Uh oh!

youtive10 Nov 13, 2025 Author

Uh oh!

aingerson Nov 13, 2025 Collaborator

Uh oh!

youtive10 Nov 13, 2025 Author

youtive10
Nov 12, 2025

Replies: 2 comments 2 replies

aingerson
Nov 13, 2025
Collaborator

youtive10 Nov 13, 2025
Author

aingerson
Nov 13, 2025
Collaborator

youtive10 Nov 13, 2025
Author