-
Notifications
You must be signed in to change notification settings - Fork 456
Description
Describe the bug
When modifying the fi_recv_cancel fabtest by setting hints->tx_attr->op_flags to FI_DELIVERY_COMPLETE and removing the reposting of canceled recv and completion, the client hangs on the first fi_cq_read in ~99% of runs.
To Reproduce
Steps to reproduce the behavior:
- Take the fi_recv_cancel fabtest from libfabric.
- Modify the test:
Set hints->tx_attr->op_flags = FI_DELIVERY_COMPLETE. - Remove reposting of canceled recv and completion.
- Run the modified test setting up provider to shm.
A reproducer is available here (git clone -b dariuszs/cancel_reproducer [email protected]:dsciebu/libfabric.git).
Expected behavior
The test should complete successfully without the client hanging on fi_cq_read.
Output
The client hangs on the first fi_cq_read. No additional output is produced unless debug logs are enabled.
Environment
Libfabric version: v2.2.0
OS: Ubuntu 22.04
Provider: (please specify if known, e.g., sockets, verbs, etc.)
Endpoint type: (please specify if known)
Additional context
The issue occurs consistently (~99% of runs) after the described modifications. It seems related to the interaction between FI_DELIVERY_COMPLETE and canceled recv handling. The remaining 1% succeeds unpredictably and in such cases the cq entry shows up after several retries.
Worth noting is that, when the FI_DELIVERY_COMPLETE is NOT set but the repost removed, the error does NOT show up. Only the combination breaks things up.