Skip to content

Conversation

@a-szegel
Copy link
Contributor

@a-szegel a-szegel commented Sep 30, 2025

There is still some work to do, but this makes our story better. The biggest thing this PR is missing is UT's which I didn't have time to add yet.

prov/efa: Stop 1-sided emulated protocols from abort on RX error

Emulated write, read and atomic are single sided operations. They should
never cause an abort on an RX side error because the RX side is "should
not" be involved in the protocol that we are emulating.  We emulate due
to the NIC not supporting the protocols, and we need to emulate the
functionality as if the NIC did support the protocols.

This change modifies the emulated write protocols, emulated read
protocols, and emulated atomics to not abort when an error happens on
the RX side. Error CQ's should only be written to the target CQ if write
w/imm is used, otherwise, the operation should fail silently on the RX
side and a NACK should be delivered to the TX side to indicate a failure
of the operation depending on the TX's requested completion level.

Signed-off-by: Seth Zegelstein <[email protected]>

prov/efa: Slide recv-win on RTM/RTA error

In order to continue processing future packets correctly, we need to
progress the recv window when a RTM or RTA packet has been successfully
delivered. If we fail processing a packet, we need to write the error to
the CQ (when applicable), and continue to do work. Returning early here
makes our recv window index be off by 1.

Signed-off-by: Seth Zegelstein <[email protected]>

EFA_RDM_LONGREAD_RTA_PKT was supposed to be EFA_RDM_LONGREAD_RTW_PKT in
comment.

Signed-off-by: Seth Zegelstein <[email protected]>
In order to continue processing future packets correctly, we need to
progress the recv window when a RTM or RTA packet has been successfully
delivered. If we fail processing a packet, we need to write the error to
the CQ (when applicable), and continue to do work. Returning early here
makes our recv window index be off by 1.

Signed-off-by: Seth Zegelstein <[email protected]>
This commit adds bool should_abort to efa_base_ep_write_eq_error() which
allows the caller to dictate whether the function should abort in the
case that the user has not bound an EQ.  This commit sets
should_abort=true everywhere, so it should not be a functional change.

Signed-off-by: Seth Zegelstein <[email protected]>
Emulated write, read and atomic are single sided operations. They should
never cause an abort on an RX side error because the RX side is "should
not" be involved in the protocol that we are emulating.  We emulate due
to the NIC not supporting the protocols, and we need to emulate the
functionality as if the NIC did support the protocols.

This change modifies the emulated write protocols, emulated read
protocols, and emulated atomics to not abort when an error happens on
the RX side. Error CQ's should only be written to the target CQ if write
w/imm is used, otherwise, the operation should fail silently on the RX
side and a NACK should be delivered to the TX side to indicate a failure
of the operation depending on the TX's requested completion level.

Signed-off-by: Seth Zegelstein <[email protected]>
@a-szegel a-szegel requested a review from a team September 30, 2025 19:40
* @param[in] prov_errno provider error code
* @param[in] should_abort If true, aborts when no EQ is available
*/
void efa_base_ep_write_eq_error(struct efa_base_ep *ep, ssize_t err, ssize_t prov_errno)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we need this commit at all, we shouldn't even call this function when the tx error can be ignored

EFA_WARN(FI_LOG_CQ,
"Writing eq error for rxe from internal operations\n");
efa_base_ep_write_eq_error(&ep->base_ep, err, prov_errno, true);
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the last commit said, we shouldn't even call efa_base_ep_write_eq_error

ep_addr_str);

efa_base_ep_write_eq_error(&ep->base_ep, err, prov_errno, true);
if (efa_rdm_pkt_type_of(pkt_entry) == EFA_RDM_WRITE_RTA_PKT ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any rx error are unexpected, and I doubt whether you can really derive that pkt type here since the rx is failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the RX error to be handled for 1 sided emulated atomics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants