Ensuring reliable data delivery and CQ entry on receiver when closing sender endpoint #11180
Replies: 4 comments 2 replies
-
|
Which Libfabric provider are you using? The EFA provider has logic to ensure that the receive side gets a completion in the case you described. |
Beta Was this translation helpful? Give feedback.
-
|
Generally speaking, FI_TRANSMIT_COMPLETE does not guarantee that the receiver has received and processed the send, as FI_DELIVERY_COMPLETE would. There may be some providers that implement it this way, as @sunkuamzn mentions with efa, but it's not guaranteed or required. You would need to implement, in your higher layer protocol/app, a mechanism for this such as a barrier, oob sync, or acknowledgement. |
Beta Was this translation helpful? Give feedback.
-
|
I'm using the verbs provider. @sunkuamzn Would you be able to share some implementation details on how EFA ensures a receive completion in the case of a fast sender-side endpoint teardown? I’d be very interested to understand how that logic is structured internally. @ooststep Just to clarify. Does FI_DELIVERY_COMPLETE imply that a send-side completion entry will only be generated after the corresponding receive-side completion has been posted in the receiver’s CQ? Regarding the higher-layer synchronization mechanisms you mentioned: if I understand correctly, in order to implement a reliable mechanism (for example barrier) that guarantees the receiver has actually received and processed the data, I either need to use a provider that supports FI_DELIVERY_COMPLETE, or use a different library/framework that provides such delivery guarantees. Is that correct? |
Beta Was this translation helpful? Give feedback.
-
|
@piotrchmiel The EFA device guarantees delivery of packets. The Libfabric EFA provider can split an |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I’m using libfabric with endpoints operating in FI_TRANSMIT_COMPLETE mode (as FI_DELIVERY_COMPLETE is unavailable). I'm using fi_send/fi_recv.
I’d like to ask:
What is the correct and safe way to remove an endpoint on the sender side such that the receiver always:
receives the data, and
gets a completion entry in its receive CQ?
I observe that if the sender removes the endpoint (closes it) shortly after reading the completion entry for fi_send, then on the receiver side, the completion for the corresponding fi_recv sometimes does not appear — this happens nondeterministically.
Is there any recommended synchronization or flushing mechanism to ensure the data is delivered and acknowledged before destroying the endpoint?
Beta Was this translation helpful? Give feedback.
All reactions