Low bandwidth with 32 EFA NICs #10658
-
Hi, I'm writing a prototype with libfabric on EFA. I ran into some unexpected behavior, so I'd like to ask for some ideas. The prototype does something very simple -- (1) Register some CUDA memory on both machine. (2) One machine RMA WRITE to the other. Each message is 65536 bytes and the sender submits 16000 ops. For this single NIC test, I can get 95Gbps, which I think is decent. The problem shows up when I started to play with multiple NICs. I extended the prototype to use 32 NICs (each 4 binds to one GPU). I'm still doing RMA WRITE of 65536 bytes. 128000 ops in total (16000 ops per GPU, or 4000 ops per NIC). However, I only get 179Gbps in aggregation. I added the topology detection and found that the default enumeration order is correct. I also tried DMA-BUF, but it didn't seem to help. Do you have any guess on where I got wrong? Here's the code just in case you'd like to take a look: https://gist.github.com/abcdabcd987/ad02c376b60acedbca8a1f7c635fbf7f And here's the example run: Compile:g++ -Wall -Werror -std=c++17 -O2 -g -I./build/libfabric/include -I/usr/local/cuda/include -o build/6_topo src/6_topo.cpp -Wl,-rpath,'$ORIGIN' -L./build -L/usr/local/cuda/lib64 -lfabric -lpthread -lcudart -lcuda 1 NIC + 1 GPU:
32 NICs + 8 GPUs:
Thanks for all the help! |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 17 replies
-
I get the same performance as you reported when running on 2 p5, looking at your code closely now |
Beta Was this translation helpful? Give feedback.
-
Can you try increasing the msg size to like 1M ? |
Beta Was this translation helpful? Give feedback.
-
I have a few suspicions about the performance issue, but I'm not able to confirm yet:
|
Beta Was this translation helpful? Give feedback.
-
A couple of findings in your app: when you progress the pending op, did you poll cq when hitting EAGAIN? This is recommended per manpage as efa provider currently requires manual progress that can release the resources.
Can you give me a big picture how do you post the write/ poll the completion and how did you calculate the bandwidth? |
Beta Was this translation helpful? Give feedback.
-
I updated the code a bit: https://gist.github.com/abcdabcd987/ad02c376b60acedbca8a1f7c635fbf7f
With these modification, I can get much better results:
However, as you can see in the table, One thing I realize is that, perhaps instead of hard coded round-robin and submitting ops all at once, maybe it could be better if I wrote a round-robin scheduler for sending these non-contiguous blocks. I'll keep experimenting. |
Beta Was this translation helpful? Give feedback.
-
Ah. Finally able to reach close to full bandwidth! Thanks for the help! @shijin-aws Perhaps the most important things are:
Updated code: https://gist.github.com/abcdabcd987/ad02c376b60acedbca8a1f7c635fbf7f Performance Numbers (64KiB message size):
Performance Numbers (8 GPUs, 32 NICs):
Screen.Recording.2024-12-24.at.4.23.22.PM.mp4 |
Beta Was this translation helpful? Give feedback.
-
Small packet performance is lower than my expectation though. It's around 0.22 Mpps per NIC. Do you have a reference number for small message pps? @shijin-aws |
Beta Was this translation helpful? Give feedback.
-
@abcdabcd987 Hmm I would expect something larger. I use
Between 2 GPU buffers on 2 p5 through 1 NIC, I get ~ 0.77M msg rate (pkt rate) for small message. I used a window size as 64. Is that the reference you are looking for? Do you have data for 1 gpu + 1 nic four benchmark? |
Beta Was this translation helpful? Give feedback.
Ah. Finally able to reach close to full bandwidth! Thanks for the help! @shijin-aws
Perhaps the most important things are:
Updated code: https://gist.github.com/abcdabcd987/ad02c376b60acedbca8a1f7c635fbf7f
Performance Numbers (64KiB message size):