Skip to content

Conversation

Matthew-Whitlock
Copy link
Contributor

@Matthew-Whitlock Matthew-Whitlock commented Oct 9, 2025

Tested on a Slingshot 11 cluster with synthetic failures (SIGTERM). I'm not sure how consistent the error code will be across different libfabric backends or types of faults. It may be that we could treat more than just FI_EIO as a lost rank, but the error code documentation is a bit lacking.

Signed-off-by: Matthew Whitlock <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant