-
Notifications
You must be signed in to change notification settings - Fork 456
prov/efa: Improve emulated 1 sided protocol error case behavior #11452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
prov/efa: Improve emulated 1 sided protocol error case behavior #11452
Conversation
EFA_RDM_LONGREAD_RTA_PKT was supposed to be EFA_RDM_LONGREAD_RTW_PKT in comment. Signed-off-by: Seth Zegelstein <[email protected]>
In order to continue processing future packets correctly, we need to progress the recv window when a RTM or RTA packet has been successfully delivered. If we fail processing a packet, we need to write the error to the CQ (when applicable), and continue to do work. Returning early here makes our recv window index be off by 1. Signed-off-by: Seth Zegelstein <[email protected]>
This commit adds bool should_abort to efa_base_ep_write_eq_error() which allows the caller to dictate whether the function should abort in the case that the user has not bound an EQ. This commit sets should_abort=true everywhere, so it should not be a functional change. Signed-off-by: Seth Zegelstein <[email protected]>
Emulated write, read and atomic are single sided operations. They should never cause an abort on an RX side error because the RX side is "should not" be involved in the protocol that we are emulating. We emulate due to the NIC not supporting the protocols, and we need to emulate the functionality as if the NIC did support the protocols. This change modifies the emulated write protocols, emulated read protocols, and emulated atomics to not abort when an error happens on the RX side. Error CQ's should only be written to the target CQ if write w/imm is used, otherwise, the operation should fail silently on the RX side and a NACK should be delivered to the TX side to indicate a failure of the operation depending on the TX's requested completion level. Signed-off-by: Seth Zegelstein <[email protected]>
| * @param[in] prov_errno provider error code | ||
| * @param[in] should_abort If true, aborts when no EQ is available | ||
| */ | ||
| void efa_base_ep_write_eq_error(struct efa_base_ep *ep, ssize_t err, ssize_t prov_errno) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we need this commit at all, we shouldn't even call this function when the tx error can be ignored
| EFA_WARN(FI_LOG_CQ, | ||
| "Writing eq error for rxe from internal operations\n"); | ||
| efa_base_ep_write_eq_error(&ep->base_ep, err, prov_errno, true); | ||
| /* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the last commit said, we shouldn't even call efa_base_ep_write_eq_error
| ep_addr_str); | ||
|
|
||
| efa_base_ep_write_eq_error(&ep->base_ep, err, prov_errno, true); | ||
| if (efa_rdm_pkt_type_of(pkt_entry) == EFA_RDM_WRITE_RTA_PKT || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any rx error are unexpected, and I doubt whether you can really derive that pkt type here since the rx is failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the RX error to be handled for 1 sided emulated atomics.
There is still some work to do, but this makes our story better. The biggest thing this PR is missing is UT's which I didn't have time to add yet.