Utility providers #10346
Replies: 3 comments 6 replies
-
|
I gave a presentation (video) about 5 years ago with a general overview of the utility providers (rxm, rxd, shm). It is a bit out of date and doesn't go into hardware-specific details like you're talking like queue pairs, mlx, etc, but it might help a little? The big benefit of using mlx5_0_dgram with rxd would come into effect for large scale workloads. So the idea would be that you would use mlx5_0 with rxm until a certain size, until the resources get strained, and then switch to mlx5_dgram with rxd. Right now, that ability to switch internally doesn't exist in OFI but we're looking at adding that as part of the peer provider enhancements that we're currently using to target intranode offload for shm but it could be expanded to handle rxm+rxd+shm integrated all together. As it is right now, however, there isn't a real use implementation for it and rxd in reality is not optimized (or maintained) to be able to be used performantly. I would recommend just sticking with mlx5_0 + rxm as we have tested that for fairly large scale jobs without hitting the limitation, but stay tuned for rxd improvements and offload! |
Beta Was this translation helpful? Give feedback.
-
|
@aingerson Thanks so much for sharing the presentation – even if it's a few years old, there were a lot of interesting insights that would actually be great to capture on the RXM and RDM pages in the documentation. It would definitely help clarify a lot of concepts for people getting into the details of the different utility providers. I was especially interested in your point about the benefit of switching to mlx5_0_dgram with rxd for large-scale workloads. Just to get a better idea – what would you consider “large scale” in this context? Are we talking hundreds of nodes? Thousands? Just trying to understand where that threshold might lie in practice. Also, you mentioned that rxd isn’t currently optimized or maintained to perform well – are there any concrete plans to pick that work back up, or is it more of a “maybe someday” kind of thing? Thanks again for all the info – really helpful. |
Beta Was this translation helpful? Give feedback.
-
|
One more thing I wanted to ask about – there's a note in the RxM documentation:
Do you happen to have any example or test case where this behavior can actually be observed in practice? I'm curious how to detect when this situation occurs and how to distinguish whether a CQ entry corresponds to a rendezvous protocol step versus the actual application-level message completion. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Is there a description somewhere of how utility rxm - rxd providers work? The description https://ofiwg.github.io/libfabric/v1.22.0/man/fi_rxm.7.html in the man page is too laconic for me.
Beta Was this translation helpful? Give feedback.
All reactions