Conversation
If we have an error when we try to read from sysfs file, this means we shouldnt return "UCS_OK", because we didnt get query any useful information. This is why if read was unsuccesful we return "UCS_ERR_INVALID_PARAM".
If we go through all gid table entries to find accessible gid indexes, we may query gid by index where is no entry exist, or this index is in the different netnamespace (RoCE), so our query call will return "UCX_ERR" status. We shouldnt exit from function because of it, this only means that this index is not good for us, so we should look for the next one. This is why we continue function execution even if we cant query gid table entry by some gid index (we try another).
If we go through all gid table entries to find accessible gid indexes, we may query gid by index where is no entry exist, or this index is in the different netnamespace (RoCE), so our query call will return "UCX_ERR" status. We shouldnt exit from function because of it, this only means that this index is not good for us, so we should look for the next one. This is why we continue function execution even if we cant query gid table entry by some gid index (we try another).
If user doesnt configure UCX_IB_GID_INDEX parameter, than function "device_port_check" will use default value and that gid index can be from different netnamespace (RoCE). Which will trigger the error in query gid info function call. But there is nothing wrong with the port, it is we, who choose wrong gid index. This is why, we try to find accessible gid index by iterating gid table elements, and only we nothing from it is good for us, only than we will report the error on port.
d6b4c49 to
1600c7b
Compare
|
I think i understand why my PR cant go through the "EFA_Tests EFA on rhel90_ib"... Its because there is no IB on it, and it used the "dull" version of verbs, like: So of course the "query gid" will fail... And in this scenario its okay, i think this is why the return in this case is "IB/RoCEv1"... I dont know if my PR be helpfull if this "bug" is like a "feature" now. If the UCX team want to solve it and make it right, i will help with my best efforts! But if not, maybe it is what it is! Dont want to waste anyone time on something unimportant <3 |
|
Maybe we can change "uct_ib_device_query_gid_info()" function from sysfs read, to an ibverbs "query_gid_ex()"? This means that we can implement in a Transform this (old): To that (new): This will make sure that "no ib" test will complete correct, and the real IB tests will run accordingly. UPD: I check "ibv_query_gid_ex()" function in "rdma_core" and its relatively new (2020). This means that if we use it, people on older systems cant use new UCX : - ( |
What?
Describe what this PR is doing.
I look at the implementation of function "uct_ib_device_query_gid_info" and see
something strange, almost like someone Copy&Pasted and forget to change
(nothing wrong with that, everyone can make mistake):
So i try to fix the incorrect behaviour of function and find out that the other
functions are too comfortable with input incorrect parameters (gid index) into
"uct_ib_device_query_gid_info", and i fix that too.
Why?
Justification for the PR. If there is an existing issue/bug, please reference it. For
bug fixes, the 'Why?' and 'What?' can be merged into a single item.
I have a host with a RoCE setup and dom0/container network namespaces. When
i try to debug some problem, i look at the UCX debug logs and start to see some
strange behaviour...
The gid elemets from my netns is correctly hashed:
The gid elemets that are not existed is correctly discarded:
BUT FOR SOME REASON gids from all other net namespaces are hashed too! Even thou they are not accessible for me (MELLANOX driver have verify function for it):
How?
It is optional, but for complex PRs, please provide information about the design,
architecture, approach, etc.
I think this error are starting from this commit: ad2c408
The old logic was (simplified), if we can "query_gid" ONLY THAN we will configure the roce version:
The MELLANOX driver will not us query gid table element from different namespace.
New logic change it (simplified):
And after that its only become worse, the initial "query_gid" call was
discarded and etc.