Skip to content

Conversation

sunkuamzn
Copy link
Contributor

The EFA provider AV keeps track of peers that any endpoint bound to the AV has received messages from but were not explicitly inserted into the AV by the application. The provider limits the number of such peers to prevent runaway memory usage.

When the EFA provider has to remove such an implicitly inserted peer, it also needs to remove any unmatched entries in the SRX queues associated with the peer. The EFA provider calls free_entry to remove the rx entry. But free_entry does not remove the rx entry from the unspec_unexp_[msg,tag]_queue dlist which corrupts the dlist.

This commit adds logic to remove any entry that is about to be freed from the unspec_unexp_[msg,tag]_queue.

The EFA provider AV keeps track of peers that any endpoint bound to the
AV has received messages from but were not explicitly inserted into the
AV by the application. The provider limits the number of such peers to
prevent runaway memory usage.

When the EFA provider has to remove such an implicitly inserted peer,
it also needs to remove any unmatched entries in the SRX queues
associated with the peer. The EFA provider calls free_entry to remove
the rx entry. But free_entry does not remove the rx entry from the
unspec_unexp_[msg,tag]_queue dlist which corrupts the dlist.

This commit adds logic to remove any entry that is about to be freed
from the unspec_unexp_[msg,tag]_queue.

Signed-off-by: Sai Sunku <[email protected]>
@sunkuamzn sunkuamzn requested review from aingerson and a team September 11, 2025 00:55
* the queue does not get corrupted. */
if (entry->addr == FI_ADDR_UNSPEC) {
dlist_remove(&util_entry->d_entry);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The braces can be got rid of here.

/* If free_entry was called but the addr is still FI_ADDR_UNSPEC, the rx
* entry must be in the unspec_unexp_[msg,tag]_queue. Remove it so that
* the queue does not get corrupted. */
if (entry->addr == FI_ADDR_UNSPEC) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if an fi_trecv is called with FI_ADDR_UNSPEC, the util_rx_entry will be created with addr as FI_ADDR_UNSPEC as well and put in the expected rx queue. Then is this logic still correct? I forgot when free_entry will be called, it's before matching or after, or both

@sunkuamzn
Copy link
Contributor Author

Discussed offline with @aingerson. This PR also needs to handle the case when the rx entry is matched to a packet. This is related to @shijin-aws's comment. There doesn't seem to be an easy fix that covers all cases.

My previous PR (#11401) is enough to stabilize our tests on the v2.3.x branch. So I'm removing the v2.3.x label from this PR to unblock the v2.3 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants