Skip to content

prov/shm: FI_DELIVERY_COMPLETE breaks fi_cancel #11428

@dsciebu

Description

@dsciebu

Describe the bug
When modifying the fi_recv_cancel fabtest by setting hints->tx_attr->op_flags to FI_DELIVERY_COMPLETE and removing the reposting of canceled recv and completion, the client hangs on the first fi_cq_read in ~99% of runs.

To Reproduce
Steps to reproduce the behavior:

  1. Take the fi_recv_cancel fabtest from libfabric.
  2. Modify the test:
    Set hints->tx_attr->op_flags = FI_DELIVERY_COMPLETE.
  3. Remove reposting of canceled recv and completion.
  4. Run the modified test setting up provider to shm.

A reproducer is available here (git clone -b dariuszs/cancel_reproducer [email protected]:dsciebu/libfabric.git).

Expected behavior
The test should complete successfully without the client hanging on fi_cq_read.

Output
The client hangs on the first fi_cq_read. No additional output is produced unless debug logs are enabled.

Environment

Libfabric version: v2.2.0
OS: Ubuntu 22.04
Provider: (please specify if known, e.g., sockets, verbs, etc.)
Endpoint type: (please specify if known)

Additional context
The issue occurs consistently (~99% of runs) after the described modifications. It seems related to the interaction between FI_DELIVERY_COMPLETE and canceled recv handling. The remaining 1% succeeds unpredictably and in such cases the cq entry shows up after several retries.
Worth noting is that, when the FI_DELIVERY_COMPLETE is NOT set but the repost removed, the error does NOT show up. Only the combination breaks things up.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions