prov/efa: Add support for blocking CQ read operations for RDM #11568

darrylabbate · 2025-10-31T20:53:34Z

This adds support for fi_cq_sread{,from}, fi_control and fi_signal for the efa fabric. This also expands EFA's fi_trywait() to handle wait objects for Util CQs.

Since the endpoint needs to progress in order to generate completions to read from, an exponential backoff strategy is employed to poll by driving progress periodically without excessively spinning the CPU for higher latency scenarios. The timeout specified by the user is used to cap the total time spent waiting for completions.

darrylabbate · 2025-10-31T20:54:14Z

CCing @j-xiong for the fi_tostr datatypes additions

j-xiong

The fi_tostr changes look good to me.

shijin-aws · 2025-11-05T04:06:16Z

can u you expand https://github.com/ofiwg/libfabric/blob/main/fabtests/pytest/efa/test_rdm.py#L165 and https://github.com/ofiwg/libfabric/blob/main/fabtests/pytest/efa/test_rma_bw.py#L90 to test sread/fd with efa fabric?

shijin-aws

May I know if this PR is following some existing code examples in other providers? The logic looks reasonable to me but I am actually not super familiar with the ofi_wait functions yet

shijin-aws · 2025-11-05T04:08:54Z

prov/efa/src/efa_fabric.c

-				return ret;
+			util_cq = container_of(fids[i], struct util_cq, cq_fid.fid);
+
+			/* RDM CQs use util wait objects, not hardware CQ events */


what are the benefits of using util wait instead of efa device interrupt to implement wait? Is it because there can be too many device level completions that is only for 1 libfabric completion

The major requirement of the blocking cq read is to avoid burning the CPU which was the busy loop that application will do when sread is not available. Are we confident this approach is meeting this requirement? cc @bwbarrett

My understanding is we'd need SHM to expose some kind of wait object (e.g. FI_WAIT_FD) in order to implement any kind of multiplexing with EFA device interrupts. I'd be more than happy to engineer a solution but I figured that would be out of scope given the context/urgency of the feature. Could we consider this a follow-up/future improvement?

Short of that, don't we need to periodically drive progress or risk missing completions? I figured the exponential backoff strategy with a minimum interval of 1ms was appropriate since it matches backoff strategies we employ elsewhere. The ofi_wait() calls block on the util CQ FD between retries, so it's not "spinning," per se.

darrylabbate · 2025-11-05T19:00:02Z

May I know if this PR is following some existing code examples in other providers? The logic looks reasonable to me but I am actually not super familiar with the ofi_wait functions yet

I referenced util, CXI and a few others. I also originally added some defensive logic for a wait object for the SHM CQ FD here, but I opted to exclude it from the PR since it's purely for future proofing - SHM doesn't support FI_GETWAIT on the CQ

shijin-aws · 2025-11-07T19:02:46Z

prov/efa/src/efa_fabric.c

+			/* RDM CQs use util wait objects, not hardware CQ events */
+			if (util_cq->wait) {
+				rdm_cq = container_of(util_cq, struct efa_rdm_cq, efa_cq.util_cq);
+				ret = efa_rdm_cq_trywait(rdm_cq);


How will it work on single node when shm is on? AFAICT it is only calling efa_rdm_cq_progress which doesn't progress shm cq.

efa_rdm_cq_trywait() should now be progressing the SHM CQ via fi_cq_read()

Now this seems to be contradicting to your earlier change on fabtests that adding the progress in the waitfd code path, if we think that calling a progress function outside fi_trywait for manual_progress provider is required, do we still need a progress function inside fi_trywait ?

fi_trywait won't process SHM completions and move them into the util CQ. My comment in efa_rdm_cq_trywait is probably misleading.

fi_trywait won't process SHM completions and move them into the util CQ.

Are u saying fi_trywait for shm cq (no efa) will not progress shm completions, so you need that logic in fabtests? With your PR, fi_trywait for efa cq (+shm as peer) will progress both, right?

shijin-aws · 2025-11-07T19:03:37Z

prov/efa/src/rdm/efa_rdm_cq.c

+		}
+
+		/* Progress then wait */
+		efa_rdm_cq_progress(util_cq);


This only progress efa, not shm.

This was a miss. I'll add it here.

If you change it to something like fi_cq_read(efa, NULL, 0) it should progress both efa and shm

Ack, replaced

Is it updated yet?

Sorry, I thought I was commenting on a different line. The efa_rdm_cq_readfrom() call at the top of the loop is already progressing and flushing the SHM completions into the util CQ. Then we progress the EFA CQ to generate completions.

yes, but you may want to be consistent, efa_rdm_cq_progress doesn't really progress the whole cq

Bump this... efa_rdm_cq_progress only progress efa endpoints. if we want to progress the whole state machine here, it should be a fi_cq_read(NULL, 0), if we do not need a progress here (because there is a cq_readfrom earlier already).. we can just remove it

darrylabbate · 2025-11-10T19:48:28Z

Added 65c9de6 after the previous changes exposed a potential use-after-free issue upon SHM CQ init failure. More of a "nice-to-have" since the Libfabric spec doesn't make guarantees about the value of the CQ FID upon cq_open failure

shijin-aws · 2025-11-10T20:02:03Z

prov/efa/src/rdm/efa_rdm_cq.c

+
+	domain = container_of(cq->efa_cq.util_cq.domain, struct efa_domain, util_domain);
+
+	if (cq->shm_cq) {


why not just do fi_cq_read(NULL, 0) for efa, it should progress both efa and shm

Ack, replaced.

darrylabbate · 2025-11-10T21:01:53Z

Currently, any fabtests that explicitly test fd completions (-c fd) will hang since they unconditionally block on the provided FD before calling fi_cq_read(). Probably need to modify the shared code to ~~call fi_cq_sread()~~ or spawn a progress thread when the provider advertises FI_PROGRESS_MANUAL.

shijin-aws · 2025-11-10T22:14:01Z

Currently, any fabtests that explicitly test fd completions (-c fd) will hang since they unconditionally block on the provided FD before calling fi_cq_read(). Probably need to modify the shared code to ~~call fi_cq_sread()~~ or spawn a progress thread when the provider advertises FI_PROGRESS_MANUAL.

If we do not need shm support for efa rdm blocking cq read right now, would it be easier to turn off shm when wait is requested in cq open?

darrylabbate · 2025-11-10T22:45:07Z

If we do not need shm support for efa rdm blocking cq read right now, would it be easier to turn off shm when wait is requested in cq open?

Sure, let's try that

darrylabbate · 2025-11-11T03:21:57Z

As I suspected, I don't think the SHM CQ is the issue here. Disabling SHM has no effect.

The actual issue is that AFAICT, FI_WAIT_FD is only possible for manual progress providers with a separate application thread driving progress. ft_fdwait_comp doesn't currently account for this. I ran into race condition issues with the CQ polling procedure trying to add this logic (surprise!), but I'll dig into it further.

The latest commits I've pushed disables fabtests in the EFA pytest suite with fd completion method and efa fabric. The fabtest code will need to be updated before this is merged.

darrylabbate · 2025-11-11T18:35:27Z

Looks like my changes to skip fabtests with -c fd -f efa didn't get picked up, so those failures are expected for now.

ETA: that'd be because I didn't add the logic for the RMA BW tests. I thought I implemented the skip for those but not RDM pingpong. Doesn't matter.

darrylabbate · 2025-11-11T22:50:44Z

CC @j-xiong and/or @aingerson for the fabtests change: ~~9e7eb4c~~ 1610850

shijin-aws · 2025-11-13T21:27:13Z

prov/efa/src/rdm/efa_rdm_cq.c

+		}
+
+		/* Progress then wait */
+		efa_rdm_cq_progress(util_cq);


Is it updated yet?

shijin-aws · 2025-11-14T20:17:44Z

fabtests/common/shared.c

 	while (total - *cur > 0) {
-		ret = fi_cq_sread(cq, &comp, 1, NULL, timeout);
+		if (ft_fdwait_need_progress)
+			ret = fi_cq_read(cq, &comp, 1);


I am not sure why we need this ... cq_sread should progress the completions for a manual_progress provider anyway. also, sread is designed for wait obj FI_WAIT_UNSPEC, we shouldn't have a fdwait leak here even if provider may use wait_fd to implement sread

This is solely a workaround for -U when the provider advertises manual progress. But now I'm thinking -U may not be valid for manual progress.

Not sure I follow, -U is run fabtests with FI_DELIVERY_COMPLETE set, is it relevant?

Delivery-complete sends only complete after the peer consumes them. If both EPs post sends and immediately block on their TX CQ via fi_cq_sread(), no thread ever calls into the RX CQ, so neither side makes progress. The periodic progress forces each side to re-enter libfabric, drain the RX CQ, and let the delivery-complete TX CQE appear. Under transmit-complete, the TX CQE is generated locally, so sread eventually returns without any extra help.

If both EPs post sends and immediately block on their TX CQ via fi_cq_sread(), no thread ever calls into the RX CQ, so neither side makes progress. The periodic progress forces each side to re-enter libfabric, drain the RX CQ, and let the delivery-complete TX CQE appear. Under transmit-complete, the TX CQE is generated locally, so sread eventually returns without any extra help.

How to implement DC is provider specific, application doesn't have to poll its RX CQ to make the TX CQ on peer side to get the delivery completion ack. EFA provider fill this gap by making the Libfabric TX CQ poll also poll the rx device CQ internally. So the behavior you described should never happen?

But you raised a good question. Today, the receiver needs to poll CQ so the sender can get completions for DC. I haven't checked in man page whether it is a legal behavior for manual progress.

The deadlock is technically compliant from the provider's perspective.

I see what happened here. I removed the unconditional exponential backoff when the wait object is FI_WAIT_FD after your initial comments re: spinning the CPU. For both fd and sread completion methods, the wait object is set to FI_WAIT_FD, so given an infinite timeout, it will hang forever after the initial manual progress. ~~Perhaps for an infinite timeout, we fallback to the exponential backoff?~~ On second thought, I think the deadlock may ultimately be more "correct" here. As you mentioned earlier, the app probably has a good reason to utilize the blocking CQ read for resource efficiency, so this seems like a necessary tradeoff to that end.

shijin-aws · 2025-11-14T20:32:12Z

prov/efa/src/efa_fabric.c

+			/* RDM CQs use util wait objects, not hardware CQ events */
+			if (util_cq->wait) {
+				rdm_cq = container_of(util_cq, struct efa_rdm_cq, efa_cq.util_cq);
+				ret = efa_rdm_cq_trywait(rdm_cq);


Now this seems to be contradicting to your earlier change on fabtests that adding the progress in the waitfd code path, if we think that calling a progress function outside fi_trywait for manual_progress provider is required, do we still need a progress function inside fi_trywait ?

shijin-aws · 2025-11-21T22:35:51Z

man/fi_efa.7.md

+  of RDM endpoint.
+
+  The `efa` fabric of RDM endpoint supports *FI_WAIT_FD* and *FI_WAIT_NONE* wait
+  objects for blocking CQ operations (*fi_cq_sread*). Applications should use


Also FI_WAIT_UNSPEC?

shijin-aws · 2025-11-21T22:38:17Z

prov/efa/src/rdm/efa_rdm_cq.c

+		}
+
+		/* Progress then wait */
+		efa_rdm_cq_progress(util_cq);


Bump this... efa_rdm_cq_progress only progress efa endpoints. if we want to progress the whole state machine here, it should be a fi_cq_read(NULL, 0), if we do not need a progress here (because there is a cq_readfrom earlier already).. we can just remove it

shijin-aws · 2025-11-21T22:54:43Z

prov/efa/src/rdm/efa_rdm_cq.c

+		efa_rdm_cq_progress(util_cq);
+
+		/* Check if progress produced completions before blocking */
+		if (OFI_LIKELY(!ofi_cirque_isempty(util_cq->cirq)))


Ugh, so you have an is empty check here ... then a progress is required before this?

Removed this and the preceding progress call

Signed-off-by: Darryl Abbate <[email protected]>

This detects when manual progress is required (FI_PROGRESS_MANUAL with FT_COMP_WAIT_FD, FT_COMP_SREAD, or FT_COMP_YIELD) and periodically calls fi_cq_read() to drive progress while waiting for completions. Signed-off-by: Darryl Abbate <[email protected]>

This prevents a dangling CQ FID pointer upon CQ open failure. While it's not strictly against the Libfabric spec, it's probably best not to leave the CQ FID set after a failure. Signed-off-by: Darryl Abbate <[email protected]>

This adds support for `fi_cq_sread{,from}`, `fi_control` and `fi_signal` for the `efa` fabric. This also expands EFA's `fi_trywait()` to handle wait objects for Util CQs. Since the endpoint needs to progress in order to generate completions to read from, an exponential backoff strategy is employed to poll by driving progress periodically without excessively spinning the CPU for higher latency scenarios. The `timeout` specified by the user is used to cap the total time spent waiting for completions. A CQ requested with FI_WAIT_FD indicates the app intends to drive progress itself as needed, since EFA advertises FI_PROGRESS_MANUAL. In this scenario, sreadfrom will block without periodically driving progress. Signed-off-by: Darryl Abbate <[email protected]>

Signed-off-by: Darryl Abbate <[email protected]>

shijin-aws

@j-xiong can u review 8d02d10

j-xiong · 2025-11-22T00:58:33Z

fabtests/common/shared.c

+	if (remaining < 0)
+		return FT_FDWAIT_PROGRESS_INTERVAL_MS;
+
+	interval = MIN(remaining, FT_FDWAIT_PROGRESS_INTERVAL_MS);
+	return interval > 0 ? interval : FT_FDWAIT_PROGRESS_INTERVAL_MS;


Can be simplified as:

if (remaining <= 0) return FT_FDWAIT_PROGRESS_INTERVAL_MS; return MIN(remaining, FT_FDWAIT_INTERVAL_MS);

And the interval variable can be removed,

j-xiong · 2025-11-22T01:02:24Z

fabtests/common/shared.c

+			int wait_timeout = ft_fdwait_need_progress
+				? ft_fdwait_poll_timeout(remaining)
+				: remaining;


the same condition is checked inside ft_fdwait_poll_timeout(). can eliminate this variable and use ft_fdwait_poll_timeout(remaining) directly.

j-xiong · 2025-11-22T01:09:44Z

fabtests/common/shared.c

-	return 0;
+	return ret;


There is no error path reaching here, can just return 0.

j-xiong · 2025-11-22T01:10:05Z

fabtests/common/shared.c

 		} else if (ret < 0 && ret != -FI_EAGAIN) {
 			return ret;
 		}
+		ret = 0;


This becomes unnecessary as well.

darrylabbate requested review from a team and j-xiong October 31, 2025 20:53

darrylabbate added the prov/EFA label Oct 31, 2025

j-xiong previously approved these changes Oct 31, 2025

View reviewed changes

shijin-aws reviewed Nov 5, 2025

View reviewed changes

shijin-aws reviewed Nov 7, 2025

View reviewed changes

darrylabbate dismissed j-xiong’s stale review via c791b06 November 8, 2025 00:15

darrylabbate force-pushed the feat/efa/rdm/cq-sread branch 3 times, most recently from 23446c1 to 5ac11a6 Compare November 10, 2025 19:45

shijin-aws reviewed Nov 10, 2025

View reviewed changes

darrylabbate marked this pull request as draft November 10, 2025 21:06

darrylabbate force-pushed the feat/efa/rdm/cq-sread branch from 5ac11a6 to ab5a0d1 Compare November 11, 2025 03:20

darrylabbate force-pushed the feat/efa/rdm/cq-sread branch from ab5a0d1 to 6bd3e96 Compare November 11, 2025 22:39

darrylabbate marked this pull request as ready for review November 11, 2025 22:49

darrylabbate force-pushed the feat/efa/rdm/cq-sread branch 3 times, most recently from b1a045d to bb5b409 Compare November 12, 2025 18:55

darrylabbate requested a review from shijin-aws November 13, 2025 21:03

darrylabbate requested a review from a team November 13, 2025 21:03

shijin-aws reviewed Nov 14, 2025

View reviewed changes

shijin-aws reviewed Nov 21, 2025

View reviewed changes

darrylabbate added 6 commits November 21, 2025 14:55

core: Add fi_tostr() support for wait conditions and objects

54211fb

Signed-off-by: Darryl Abbate <[email protected]>

prov/efa: Only set CQ FID after successful SHM CQ init

d638fb5

This prevents a dangling CQ FID pointer upon CQ open failure. While it's not strictly against the Libfabric spec, it's probably best not to leave the CQ FID set after a failure. Signed-off-by: Darryl Abbate <[email protected]>

prov/efa: Add unit tests for RDM CQ sread

79cc073

Signed-off-by: Darryl Abbate <[email protected]>

fabtests/efa: Enable sread tests for efa fabric

8b8e568

Signed-off-by: Darryl Abbate <[email protected]>

darrylabbate force-pushed the feat/efa/rdm/cq-sread branch from bb5b409 to 8b8e568 Compare November 21, 2025 23:18

shijin-aws approved these changes Nov 21, 2025

View reviewed changes

j-xiong reviewed Nov 22, 2025

View reviewed changes


		domain = container_of(cq->efa_cq.util_cq.domain, struct efa_domain, util_domain);

		if (cq->shm_cq) {

prov/efa: Add support for blocking CQ read operations for RDM #11568

Are you sure you want to change the base?

prov/efa: Add support for blocking CQ read operations for RDM #11568

Conversation

darrylabbate commented Oct 31, 2025

Uh oh!

darrylabbate commented Oct 31, 2025

Uh oh!

j-xiong left a comment

Choose a reason for hiding this comment

Uh oh!

shijin-aws commented Nov 5, 2025

Uh oh!

shijin-aws left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

darrylabbate commented Nov 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shijin-aws Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shijin-aws Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

darrylabbate commented Nov 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

darrylabbate commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shijin-aws commented Nov 10, 2025

Uh oh!

darrylabbate commented Nov 10, 2025

Uh oh!

darrylabbate commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

darrylabbate commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

darrylabbate commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

shijin-aws left a comment •

edited

Loading

shijin-aws Nov 14, 2025 •

edited

Loading

shijin-aws Nov 21, 2025 •

edited

Loading

darrylabbate commented Nov 10, 2025 •

edited

Loading

darrylabbate commented Nov 11, 2025 •

edited

Loading

darrylabbate commented Nov 11, 2025 •

edited

Loading

darrylabbate commented Nov 11, 2025 •

edited

Loading

darrylabbate Nov 21, 2025 •

edited

Loading

shijin-aws Nov 14, 2025 •

edited

Loading

shijin-aws Nov 21, 2025 •

edited

Loading