Skip to content

Conversation

@darrylabbate
Copy link
Member

This adds support for fi_cq_sread{,from}, fi_control and fi_signal for the efa fabric. This also expands EFA's fi_trywait() to handle wait objects for Util CQs.

Since the endpoint needs to progress in order to generate completions to read from, an exponential backoff strategy is employed to poll by driving progress periodically without excessively spinning the CPU for higher latency scenarios. The timeout specified by the user is used to cap the total time spent waiting for completions.

@darrylabbate darrylabbate requested review from a team and j-xiong October 31, 2025 20:53
@darrylabbate
Copy link
Member Author

CCing @j-xiong for the fi_tostr datatypes additions

j-xiong
j-xiong previously approved these changes Oct 31, 2025
Copy link
Contributor

@j-xiong j-xiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fi_tostr changes look good to me.

@shijin-aws
Copy link
Contributor

Copy link
Contributor

@shijin-aws shijin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I know if this PR is following some existing code examples in other providers? The logic looks reasonable to me but I am actually not super familiar with the ofi_wait functions yet

return ret;
util_cq = container_of(fids[i], struct util_cq, cq_fid.fid);

/* RDM CQs use util wait objects, not hardware CQ events */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the benefits of using util wait instead of efa device interrupt to implement wait? Is it because there can be too many device level completions that is only for 1 libfabric completion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The major requirement of the blocking cq read is to avoid burning the CPU which was the busy loop that application will do when sread is not available. Are we confident this approach is meeting this requirement? cc @bwbarrett

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is we'd need SHM to expose some kind of wait object (e.g. FI_WAIT_FD) in order to implement any kind of multiplexing with EFA device interrupts. I'd be more than happy to engineer a solution but I figured that would be out of scope given the context/urgency of the feature. Could we consider this a follow-up/future improvement?

Short of that, don't we need to periodically drive progress or risk missing completions? I figured the exponential backoff strategy with a minimum interval of 1ms was appropriate since it matches backoff strategies we employ elsewhere. The ofi_wait() calls block on the util CQ FD between retries, so it's not "spinning," per se.

@darrylabbate
Copy link
Member Author

May I know if this PR is following some existing code examples in other providers? The logic looks reasonable to me but I am actually not super familiar with the ofi_wait functions yet

I referenced util, CXI and a few others. I also originally added some defensive logic for a wait object for the SHM CQ FD here, but I opted to exclude it from the PR since it's purely for future proofing - SHM doesn't support FI_GETWAIT on the CQ

/* RDM CQs use util wait objects, not hardware CQ events */
if (util_cq->wait) {
rdm_cq = container_of(util_cq, struct efa_rdm_cq, efa_cq.util_cq);
ret = efa_rdm_cq_trywait(rdm_cq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will it work on single node when shm is on? AFAICT it is only calling efa_rdm_cq_progress which doesn't progress shm cq.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

efa_rdm_cq_trywait() should now be progressing the SHM CQ via fi_cq_read()

Copy link
Contributor

@shijin-aws shijin-aws Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this seems to be contradicting to your earlier change on fabtests that adding the progress in the waitfd code path, if we think that calling a progress function outside fi_trywait for manual_progress provider is required, do we still need a progress function inside fi_trywait ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fi_trywait won't process SHM completions and move them into the util CQ. My comment in efa_rdm_cq_trywait is probably misleading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fi_trywait won't process SHM completions and move them into the util CQ.

Are u saying fi_trywait for shm cq (no efa) will not progress shm completions, so you need that logic in fabtests? With your PR, fi_trywait for efa cq (+shm as peer) will progress both, right?

}

/* Progress then wait */
efa_rdm_cq_progress(util_cq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only progress efa, not shm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a miss. I'll add it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change it to something like fi_cq_read(efa, NULL, 0) it should progress both efa and shm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, replaced

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it updated yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I thought I was commenting on a different line. The efa_rdm_cq_readfrom() call at the top of the loop is already progressing and flushing the SHM completions into the util CQ. Then we progress the EFA CQ to generate completions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but you may want to be consistent, efa_rdm_cq_progress doesn't really progress the whole cq

Copy link
Contributor

@shijin-aws shijin-aws Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump this... efa_rdm_cq_progress only progress efa endpoints. if we want to progress the whole state machine here, it should be a fi_cq_read(NULL, 0), if we do not need a progress here (because there is a cq_readfrom earlier already).. we can just remove it

@darrylabbate darrylabbate force-pushed the feat/efa/rdm/cq-sread branch 3 times, most recently from 23446c1 to 5ac11a6 Compare November 10, 2025 19:45
@darrylabbate
Copy link
Member Author

Added 65c9de6 after the previous changes exposed a potential use-after-free issue upon SHM CQ init failure. More of a "nice-to-have" since the Libfabric spec doesn't make guarantees about the value of the CQ FID upon cq_open failure


domain = container_of(cq->efa_cq.util_cq.domain, struct efa_domain, util_domain);

if (cq->shm_cq) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just do fi_cq_read(NULL, 0) for efa, it should progress both efa and shm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, replaced.

@darrylabbate
Copy link
Member Author

darrylabbate commented Nov 10, 2025

Currently, any fabtests that explicitly test fd completions (-c fd) will hang since they unconditionally block on the provided FD before calling fi_cq_read(). Probably need to modify the shared code to call fi_cq_sread() or spawn a progress thread when the provider advertises FI_PROGRESS_MANUAL.

@darrylabbate darrylabbate marked this pull request as draft November 10, 2025 21:06
@shijin-aws
Copy link
Contributor

Currently, any fabtests that explicitly test fd completions (-c fd) will hang since they unconditionally block on the provided FD before calling fi_cq_read(). Probably need to modify the shared code to call fi_cq_sread() or spawn a progress thread when the provider advertises FI_PROGRESS_MANUAL.

If we do not need shm support for efa rdm blocking cq read right now, would it be easier to turn off shm when wait is requested in cq open?

@darrylabbate
Copy link
Member Author

If we do not need shm support for efa rdm blocking cq read right now, would it be easier to turn off shm when wait is requested in cq open?

Sure, let's try that

@darrylabbate
Copy link
Member Author

darrylabbate commented Nov 11, 2025

As I suspected, I don't think the SHM CQ is the issue here. Disabling SHM has no effect.

The actual issue is that AFAICT, FI_WAIT_FD is only possible for manual progress providers with a separate application thread driving progress. ft_fdwait_comp doesn't currently account for this. I ran into race condition issues with the CQ polling procedure trying to add this logic (surprise!), but I'll dig into it further.

The latest commits I've pushed disables fabtests in the EFA pytest suite with fd completion method and efa fabric. The fabtest code will need to be updated before this is merged.

@darrylabbate
Copy link
Member Author

darrylabbate commented Nov 11, 2025

Looks like my changes to skip fabtests with -c fd -f efa didn't get picked up, so those failures are expected for now.

ETA: that'd be because I didn't add the logic for the RMA BW tests. I thought I implemented the skip for those but not RDM pingpong. Doesn't matter.

@darrylabbate darrylabbate marked this pull request as ready for review November 11, 2025 22:49
@darrylabbate
Copy link
Member Author

darrylabbate commented Nov 11, 2025

CC @j-xiong and/or @aingerson for the fabtests change: 9e7eb4c 1610850

@darrylabbate darrylabbate force-pushed the feat/efa/rdm/cq-sread branch 3 times, most recently from b1a045d to bb5b409 Compare November 12, 2025 18:55
@darrylabbate darrylabbate requested a review from a team November 13, 2025 21:03
}

/* Progress then wait */
efa_rdm_cq_progress(util_cq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it updated yet?

while (total - *cur > 0) {
ret = fi_cq_sread(cq, &comp, 1, NULL, timeout);
if (ft_fdwait_need_progress)
ret = fi_cq_read(cq, &comp, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why we need this ... cq_sread should progress the completions for a manual_progress provider anyway. also, sread is designed for wait obj FI_WAIT_UNSPEC, we shouldn't have a fdwait leak here even if provider may use wait_fd to implement sread

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is solely a workaround for -U when the provider advertises manual progress. But now I'm thinking -U may not be valid for manual progress.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow, -U is run fabtests with FI_DELIVERY_COMPLETE set, is it relevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delivery-complete sends only complete after the peer consumes them. If both EPs post sends and immediately block on their TX CQ via fi_cq_sread(), no thread ever calls into the RX CQ, so neither side makes progress. The periodic progress forces each side to re-enter libfabric, drain the RX CQ, and let the delivery-complete TX CQE appear. Under transmit-complete, the TX CQE is generated locally, so sread eventually returns without any extra help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both EPs post sends and immediately block on their TX CQ via fi_cq_sread(), no thread ever calls into the RX CQ, so neither side makes progress. The periodic progress forces each side to re-enter libfabric, drain the RX CQ, and let the delivery-complete TX CQE appear. Under transmit-complete, the TX CQE is generated locally, so sread eventually returns without any extra help.

How to implement DC is provider specific, application doesn't have to poll its RX CQ to make the TX CQ on peer side to get the delivery completion ack. EFA provider fill this gap by making the Libfabric TX CQ poll also poll the rx device CQ internally. So the behavior you described should never happen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you raised a good question. Today, the receiver needs to poll CQ so the sender can get completions for DC. I haven't checked in man page whether it is a legal behavior for manual progress.

Copy link
Member Author

@darrylabbate darrylabbate Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deadlock is technically compliant from the provider's perspective.

I see what happened here. I removed the unconditional exponential backoff when the wait object is FI_WAIT_FD after your initial comments re: spinning the CPU. For both fd and sread completion methods, the wait object is set to FI_WAIT_FD, so given an infinite timeout, it will hang forever after the initial manual progress. Perhaps for an infinite timeout, we fallback to the exponential backoff? On second thought, I think the deadlock may ultimately be more "correct" here. As you mentioned earlier, the app probably has a good reason to utilize the blocking CQ read for resource efficiency, so this seems like a necessary tradeoff to that end.

/* RDM CQs use util wait objects, not hardware CQ events */
if (util_cq->wait) {
rdm_cq = container_of(util_cq, struct efa_rdm_cq, efa_cq.util_cq);
ret = efa_rdm_cq_trywait(rdm_cq);
Copy link
Contributor

@shijin-aws shijin-aws Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this seems to be contradicting to your earlier change on fabtests that adding the progress in the waitfd code path, if we think that calling a progress function outside fi_trywait for manual_progress provider is required, do we still need a progress function inside fi_trywait ?

man/fi_efa.7.md Outdated
of RDM endpoint.

The `efa` fabric of RDM endpoint supports *FI_WAIT_FD* and *FI_WAIT_NONE* wait
objects for blocking CQ operations (*fi_cq_sread*). Applications should use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also FI_WAIT_UNSPEC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

}

/* Progress then wait */
efa_rdm_cq_progress(util_cq);
Copy link
Contributor

@shijin-aws shijin-aws Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump this... efa_rdm_cq_progress only progress efa endpoints. if we want to progress the whole state machine here, it should be a fi_cq_read(NULL, 0), if we do not need a progress here (because there is a cq_readfrom earlier already).. we can just remove it

efa_rdm_cq_progress(util_cq);

/* Check if progress produced completions before blocking */
if (OFI_LIKELY(!ofi_cirque_isempty(util_cq->cirq)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, so you have an is empty check here ... then a progress is required before this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this and the preceding progress call

This detects when manual progress is required (FI_PROGRESS_MANUAL
with FT_COMP_WAIT_FD, FT_COMP_SREAD, or FT_COMP_YIELD) and periodically
calls fi_cq_read() to drive progress while waiting for completions.

Signed-off-by: Darryl Abbate <[email protected]>
This prevents a dangling CQ FID pointer upon CQ open failure. While it's
not strictly against the Libfabric spec, it's probably best not to leave
the CQ FID set after a failure.

Signed-off-by: Darryl Abbate <[email protected]>
This adds support for `fi_cq_sread{,from}`, `fi_control` and `fi_signal`
for the `efa` fabric. This also expands EFA's `fi_trywait()` to handle
wait objects for Util CQs.

Since the endpoint needs to progress in order to generate completions to
read from, an exponential backoff strategy is employed to poll by
driving progress periodically without excessively spinning the CPU for
higher latency scenarios. The `timeout` specified by the user is used to
cap the total time spent waiting for completions.

A CQ requested with FI_WAIT_FD indicates the app intends to drive
progress itself as needed, since EFA advertises FI_PROGRESS_MANUAL. In
this scenario, sreadfrom will block without periodically driving
progress.

Signed-off-by: Darryl Abbate <[email protected]>
Copy link
Contributor

@shijin-aws shijin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j-xiong can u review 8d02d10

Comment on lines +1910 to +1914
if (remaining < 0)
return FT_FDWAIT_PROGRESS_INTERVAL_MS;

interval = MIN(remaining, FT_FDWAIT_PROGRESS_INTERVAL_MS);
return interval > 0 ? interval : FT_FDWAIT_PROGRESS_INTERVAL_MS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be simplified as:

	if (remaining <= 0)
		return FT_FDWAIT_PROGRESS_INTERVAL_MS;

	return MIN(remaining, FT_FDWAIT_INTERVAL_MS);

And the interval variable can be removed,

Comment on lines +2814 to +2816
int wait_timeout = ft_fdwait_need_progress
? ft_fdwait_poll_timeout(remaining)
: remaining;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same condition is checked inside ft_fdwait_poll_timeout(). can eliminate this variable and use ft_fdwait_poll_timeout(remaining) directly.

Comment on lines -2796 to +2848
return 0;
return ret;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no error path reaching here, can just return 0.

} else if (ret < 0 && ret != -FI_EAGAIN) {
return ret;
}
ret = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes unnecessary as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants