Skip to content

Conversation

@alexander-sannikov
Copy link

Content:

  1. "rocr" value is not properly handled by -D option.
  2. Excessive recv prepost used for address exchange still occurs in case of out-of-band address exchange (-b/-E options).
    As soon as no corresponding send operation is issued for this scenario, preposted recv consumes one of later messages with matching tag. To reproduce issue:
    ~> fi_rdm_tagged_bw -p "tcp" -b &
    [1] 606285
    ~> fi_rdm_tagged_bw -p "tcp" -b 127.0.0.1
    ...
    [error] fabtests:common/shared.c:3046: cq_readerr 265 (Truncation error), provider errno: 11 (Resource temporarily unavailable)
    
    Problem is provider-agnostic.

@shijin-aws
Copy link
Contributor

bot:aws:retest

@shijin-aws
Copy link
Contributor

@alexander-sannikov thanks for the PR, it seems acaf975 is fixing #10118?

@alexander-sannikov
Copy link
Author

@shijin-aws exactly, it is fixing #10118.
For other providers it may be less relevant, but according to this #10148, oob-sync/address exchange is mandatory for UCX provider which makes it critical for that case.

shijin-aws
shijin-aws previously approved these changes Sep 30, 2025
darrylabbate
darrylabbate previously approved these changes Sep 30, 2025
@shijin-aws
Copy link
Contributor

It seem this PR breaks the AWS CI with shm provider

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.100 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr11447/install/fabtests/bin/fi_rdm_atomic -I 5 -U -p shm -E=9232'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.100 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr11447/install/fabtests/bin/fi_rdm_atomic -I 5 -U -p shm -E=9232 172.31.45.100'"'"''
client_stdout:
libfabric:244648:1759236106::core:core:cuda_gdrcopy_hmem_init():201<warn> gdr_open failed!
libfabric:244648:1759236106::core:core:cuda_hmem_init():816<warn> gdrcopy initialization failed! gdrcopy will not be used.
libfabric:244648:1759236107::efa:core:efa_hmem_info_check_p2p_support_cuda():160<warn> Failed to register CUDA buffer with the EFA device, FI_HMEM transfers that require peer to peer support will fail.
libfabric:244648:1759236107::shm:mr:ofi_mr_map_verify():123<warn> unknown key: 0

client returncode: 124
server_stdout:
libfabric:244596:1759236105::core:core:cuda_gdrcopy_hmem_init():201<warn> gdr_open failed!
libfabric:244596:1759236105::core:core:cuda_hmem_init():816<warn> gdrcopy initialization failed! gdrcopy will not be used.
libfabric:244596:1759236106::efa:core:efa_hmem_info_check_p2p_support_cuda():160<warn> Failed to register CUDA buffer with the EFA device, FI_HMEM transfers that require peer to peer support will fail.
libfabric:244596:1759236107::shm:mr:ofi_mr_map_verify():123<warn> unknown key: 0

@alexander-sannikov
Copy link
Author

@shijin-aws looks pretty strange. Let me check that with shm provider on my side.

@shijin-aws
Copy link
Contributor

shijin-aws commented Oct 1, 2025

The error

libfabric:249238:1759237910::shm:mr:ofi_mr_map_verify():123<warn> unknown key: 0

Indicate that the key exchange was not successful. The key exchange today is implemented as inband exchange.

Also I see this error with fi_rdm simple t est only

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.100 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr11447-debug/install/fabtests/bin/fi_rdm -U -p shm -E=9232'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.100 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr11447-debug/install/fabtests/bin/fi_rdm -U -p shm -E=9232 172.31.45.100'"'"''
client_stdout:
libfabric:254437:1759239004::core:core:cuda_gdrcopy_hmem_init():201<warn> gdr_open failed!
libfabric:254437:1759239004::core:core:cuda_hmem_init():816<warn> gdrcopy initialization failed! gdrcopy will not be used.
libfabric:254437:1759239005::efa:core:efa_hmem_info_check_p2p_support_cuda():160<warn> Failed to register CUDA buffer with the EFA device, FI_HMEM transfers that require peer to peer support will fail.

client returncode: 124
server_stdout:
libfabric:254263:1759239003::core:core:cuda_gdrcopy_hmem_init():201<warn> gdr_open failed!
libfabric:254263:1759239003::core:core:cuda_hmem_init():816<warn> gdrcopy initialization failed! gdrcopy will not be used.
libfabric:254263:1759239004::efa:core:efa_hmem_info_check_p2p_support_cuda():160<warn> Failed to register CUDA buffer with the EFA device, FI_HMEM transfers that require peer to peer support will fail.
Received length does not match expected length.
Waiting for message from client...

server returncode: 1

It just means the recv buffer was too small to hold the incoming send message

…lution case

Preposted recieve for peer address is not necessary in case of out-of-band address resolution(-b/-E options).
Corresponding send won't be issues and random message with matching tag will be consumed instead,
which leads to crash or hang.

Signed-off-by: alexander-sannikov <[email protected]>
@alexander-sannikov
Copy link
Author

alexander-sannikov commented Oct 1, 2025

@shijin-aws pre-posted from ft_enable_ep_recv which is used for inband address exchange, still used by tests. Thus it is now disabled for relevant benchmarks only, but still posted for test cases.

@shijin-aws
Copy link
Contributor

AWS CI is still failing at shm provider

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.44 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr11447/install/fabtests/bin/fi_rdm_tagged_bw --data-progress manual --control-progress unified -I 5 -U -v -D cuda -i 0 -p shm -E=9234'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.45.44 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr11447/install/fabtests/bin/fi_rdm_tagged_bw --data-progress manual --control-progress unified -I 5 -U -v -D cuda -i 0 -p shm -E=9234 172.31.45.44'"'"''
client_stdout:
libfabric:18658:1759332433::efa:core:efa_hmem_info_check_p2p_support_cuda():161<warn> Failed to register CUDA buffer with the EFA device, FI_HMEM transfers that require peer to peer support will fail.

client returncode: 124
server_stdout:
libfabric:18352:1759332432::efa:core:efa_hmem_info_check_p2p_support_cuda():161<warn> Failed to register CUDA buffer with the EFA device, FI_HMEM transfers that require peer to peer support will fail.

server returncode: 124

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants