fabtests/common: minor fabtest fixes #11447

alexander-sannikov · 2025-09-26T18:53:07Z

Content:

"rocr" value is not properly handled by -D option.
Excessive recv prepost used for address exchange still occurs in case of out-of-band address exchange (-b/-E options).
As soon as no corresponding send operation is issued for this scenario, preposted recv consumes one of later messages with matching tag. To reproduce issue:
```
~> fi_rdm_tagged_bw -p "tcp" -b &
[1] 606285
~> fi_rdm_tagged_bw -p "tcp" -b 127.0.0.1
...
[error] fabtests:common/shared.c:3046: cq_readerr 265 (Truncation error), provider errno: 11 (Resource temporarily unavailable)
```
Problem is provider-agnostic.

shijin-aws · 2025-09-29T22:39:33Z

bot:aws:retest

shijin-aws · 2025-09-29T22:41:04Z

@alexander-sannikov thanks for the PR, it seems acaf975 is fixing #10118?

fabtests/common/shared.c

Signed-off-by: alexander-sannikov <[email protected]>

alexander-sannikov · 2025-09-30T12:25:09Z

@shijin-aws exactly, it is fixing #10118.
For other providers it may be less relevant, but according to this #10148, oob-sync/address exchange is mandatory for UCX provider which makes it critical for that case.

shijin-aws · 2025-09-30T20:38:54Z

It seem this PR breaks the AWS CI with shm provider

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.100 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr11447/install/fabtests/bin/fi_rdm_atomic -I 5 -U -p shm -E=9232'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.100 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr11447/install/fabtests/bin/fi_rdm_atomic -I 5 -U -p shm -E=9232 172.31.45.100'"'"''
client_stdout:
libfabric:244648:1759236106::core:core:cuda_gdrcopy_hmem_init():201<warn> gdr_open failed!
libfabric:244648:1759236106::core:core:cuda_hmem_init():816<warn> gdrcopy initialization failed! gdrcopy will not be used.
libfabric:244648:1759236107::efa:core:efa_hmem_info_check_p2p_support_cuda():160<warn> Failed to register CUDA buffer with the EFA device, FI_HMEM transfers that require peer to peer support will fail.
libfabric:244648:1759236107::shm:mr:ofi_mr_map_verify():123<warn> unknown key: 0

client returncode: 124
server_stdout:
libfabric:244596:1759236105::core:core:cuda_gdrcopy_hmem_init():201<warn> gdr_open failed!
libfabric:244596:1759236105::core:core:cuda_hmem_init():816<warn> gdrcopy initialization failed! gdrcopy will not be used.
libfabric:244596:1759236106::efa:core:efa_hmem_info_check_p2p_support_cuda():160<warn> Failed to register CUDA buffer with the EFA device, FI_HMEM transfers that require peer to peer support will fail.
libfabric:244596:1759236107::shm:mr:ofi_mr_map_verify():123<warn> unknown key: 0

alexander-sannikov · 2025-09-30T21:13:53Z

@shijin-aws looks pretty strange. Let me check that with shm provider on my side.

shijin-aws · 2025-10-01T01:50:31Z

The error

libfabric:249238:1759237910::shm:mr:ofi_mr_map_verify():123<warn> unknown key: 0

Indicate that the key exchange was not successful. The key exchange today is implemented as inband exchange.

Also I see this error with fi_rdm simple t est only

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.100 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr11447-debug/install/fabtests/bin/fi_rdm -U -p shm -E=9232'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.100 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr11447-debug/install/fabtests/bin/fi_rdm -U -p shm -E=9232 172.31.45.100'"'"''
client_stdout:
libfabric:254437:1759239004::core:core:cuda_gdrcopy_hmem_init():201<warn> gdr_open failed!
libfabric:254437:1759239004::core:core:cuda_hmem_init():816<warn> gdrcopy initialization failed! gdrcopy will not be used.
libfabric:254437:1759239005::efa:core:efa_hmem_info_check_p2p_support_cuda():160<warn> Failed to register CUDA buffer with the EFA device, FI_HMEM transfers that require peer to peer support will fail.

client returncode: 124
server_stdout:
libfabric:254263:1759239003::core:core:cuda_gdrcopy_hmem_init():201<warn> gdr_open failed!
libfabric:254263:1759239003::core:core:cuda_hmem_init():816<warn> gdrcopy initialization failed! gdrcopy will not be used.
libfabric:254263:1759239004::efa:core:efa_hmem_info_check_p2p_support_cuda():160<warn> Failed to register CUDA buffer with the EFA device, FI_HMEM transfers that require peer to peer support will fail.
Received length does not match expected length.
Waiting for message from client...

server returncode: 1

It just means the recv buffer was too small to hold the incoming send message

…lution case Preposted recieve for peer address is not necessary in case of out-of-band address resolution(-b/-E options). Corresponding send won't be issues and random message with matching tag will be consumed instead, which leads to crash or hang. Signed-off-by: alexander-sannikov <[email protected]>

alexander-sannikov · 2025-10-01T15:10:13Z

@shijin-aws pre-posted from ft_enable_ep_recv which is used for inband address exchange, still used by tests. Thus it is now disabled for relevant benchmarks only, but still posted for test cases.

shijin-aws · 2025-10-03T20:50:23Z

AWS CI is still failing at shm provider

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.44 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr11447/install/fabtests/bin/fi_rdm_tagged_bw --data-progress manual --control-progress unified -I 5 -U -v -D cuda -i 0 -p shm -E=9234'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.44 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr11447/install/fabtests/bin/fi_rdm_tagged_bw --data-progress manual --control-progress unified -I 5 -U -v -D cuda -i 0 -p shm -E=9234 172.31.45.44'"'"''
client_stdout:
libfabric:18658:1759332433::efa:core:efa_hmem_info_check_p2p_support_cuda():161<warn> Failed to register CUDA buffer with the EFA device, FI_HMEM transfers that require peer to peer support will fail.

client returncode: 124
server_stdout:
libfabric:18352:1759332432::efa:core:efa_hmem_info_check_p2p_support_cuda():161<warn> Failed to register CUDA buffer with the EFA device, FI_HMEM transfers that require peer to peer support will fail.

server returncode: 124

shijin-aws mentioned this pull request Sep 29, 2025

rdm_tagged_bw is broken with OOB sync #10118

Closed

shijin-aws reviewed Sep 29, 2025

View reviewed changes

fabtests/common/shared.c Outdated Show resolved Hide resolved

fabtests/common: enable rocr device interface for fabtests

79bfea2

Signed-off-by: alexander-sannikov <[email protected]>

alexander-sannikov force-pushed the fabtest-fixes branch from acaf975 to b5af7f8 Compare September 30, 2025 12:16

shijin-aws previously approved these changes Sep 30, 2025

View reviewed changes

darrylabbate previously approved these changes Sep 30, 2025

View reviewed changes

alexander-sannikov dismissed stale reviews from darrylabbate and shijin-aws via e19f0a9 October 1, 2025 15:04

alexander-sannikov force-pushed the fabtest-fixes branch from b5af7f8 to e19f0a9 Compare October 1, 2025 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fabtests/common: minor fabtest fixes #11447

fabtests/common: minor fabtest fixes #11447

alexander-sannikov commented Sep 26, 2025

Uh oh!

shijin-aws commented Sep 29, 2025

Uh oh!

shijin-aws commented Sep 29, 2025

Uh oh!

Uh oh!

alexander-sannikov commented Sep 30, 2025

Uh oh!

shijin-aws commented Sep 30, 2025

Uh oh!

alexander-sannikov commented Sep 30, 2025

Uh oh!

shijin-aws commented Oct 1, 2025 •

edited

Loading

Uh oh!

alexander-sannikov commented Oct 1, 2025 •

edited

Loading

Uh oh!

shijin-aws commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fabtests/common: minor fabtest fixes #11447

Are you sure you want to change the base?

fabtests/common: minor fabtest fixes #11447

Conversation

alexander-sannikov commented Sep 26, 2025

Uh oh!

shijin-aws commented Sep 29, 2025

Uh oh!

shijin-aws commented Sep 29, 2025

Uh oh!

Uh oh!

alexander-sannikov commented Sep 30, 2025

Uh oh!

shijin-aws commented Sep 30, 2025

Uh oh!

alexander-sannikov commented Sep 30, 2025

Uh oh!

shijin-aws commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexander-sannikov commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shijin-aws commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shijin-aws commented Oct 1, 2025 •

edited

Loading

alexander-sannikov commented Oct 1, 2025 •

edited

Loading