-
Notifications
You must be signed in to change notification settings - Fork 456
prov/efa: Fail efa_rdm_ep_open when user passes FI_RX_CQ_DATA #11668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
efa protocol does not support FI_RX_CQ_DATA. Signed-off-by: Jessie Yang <[email protected]>
Signed-off-by: Jessie Yang <[email protected]>
| int ret, retv, i; | ||
| enum fi_hmem_iface iface; | ||
|
|
||
| if (info->mode & FI_RX_CQ_DATA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't fail, we should ignore it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we should ignore it? We should definitely clear it on return from getinfo(). But at ep_open() point, it seems like the upper layer needs to know one way or the other and silently ignoring it seems wrong. But the man page is unhelpfully unclear here.
prov/efa/src/efa_base_ep.c
Outdated
| use_unsolicited_write_recv = tx_cq->unsolicited_write_recv_enabled && !(base_ep->info->mode & FI_RX_CQ_DATA); | ||
| if (EFA_INFO_TYPE_IS_DIRECT(base_ep->info)) { | ||
| /* If user intend to post rx buffer for cq data, we shouldn't enable unsolicited write recv */ | ||
| use_unsolicited_write_recv = tx_cq->unsolicited_write_recv_enabled && !(base_ep->info->mode & FI_RX_CQ_DATA); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic would be: unsolicited write recv should always be turned off when RM is enabled, and for efa-direct, it has to work with FI_RX_CQ_DATA together. So if RM_ENABLED is on while RX_CQ_DATA is off, we should fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| int ret, retv, i; | ||
| enum fi_hmem_iface iface; | ||
|
|
||
| if (info->mode & FI_RX_CQ_DATA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we should ignore it? We should definitely clear it on return from getinfo(). But at ep_open() point, it seems like the upper layer needs to know one way or the other and silently ignoring it seems wrong. But the man page is unhelpfully unclear here.
prov/efa/src/efa_user_info.c
Outdated
| hints->domain_attr->resource_mgmt == FI_RM_UNSPEC) | ||
| info->domain_attr->resource_mgmt = FI_RM_ENABLED; | ||
| if (!hints || !hints->domain_attr) | ||
| info->domain_attr->resource_mgmt = FI_RM_UNSPEC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is legal. UNSPEC is a valid input value, but does not appear to be a valid output value. I think the previous conditional was correct, but we need to default to FI_RM_DISABLED.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some code in protocol path today that uses this bit to determine whether we retry when hitting RNR. If it is disabled, it will generate cq error instead of retry. Today when application doesn't specify it, it retries. I do not think we want to change this behavior. It looks both MPICH and Ompi explicitly request RM_ENABLED, but ofi-nccl plugin doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that using UNSPEC is not a good output, just want to evaluate the impact if we default to DISABLED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then return ENABLED (as much as I dislike that option). But returning UNSPEC does not appear to be a valid choice.
| bool use_unsolicited_write_recv = true; | ||
|
|
||
| efa_base_ep_construct_ibv_qp_init_attr_ex(base_ep, &attr_ex, tx_cq->ibv_cq_ex, rx_cq->ibv_cq_ex); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a nit on the Disable unsolicited write recv with FI_RM_ENABLED commit message, while disabling unsolicited write is necessary for preventing CQ overruns, it is not sufficient for doing so. It's just one piece of a larger puzzle, and that should be called out in the commit message.
prov/efa/src/efa_base_ep.c
Outdated
| use_unsolicited_write_recv = tx_cq->unsolicited_write_recv_enabled && !(base_ep->info->mode & FI_RX_CQ_DATA); | ||
| if (EFA_INFO_TYPE_IS_DIRECT(base_ep->info)) { | ||
| /* If user intend to post rx buffer for cq data, we shouldn't enable unsolicited write recv */ | ||
| use_unsolicited_write_recv = tx_cq->unsolicited_write_recv_enabled && !(base_ep->info->mode & FI_RX_CQ_DATA); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
efa protocol does not support FI_RX_CQ_DATA.