Low bandwidth with 32 EFA NICs #10658

abcdabcd987 · 2024-12-23T05:23:02Z

abcdabcd987
Dec 23, 2024

Hi,

I'm writing a prototype with libfabric on EFA. I ran into some unexpected behavior, so I'd like to ask for some ideas.

The prototype does something very simple -- (1) Register some CUDA memory on both machine. (2) One machine RMA WRITE to the other. Each message is 65536 bytes and the sender submits 16000 ops.

For this single NIC test, I can get 95Gbps, which I think is decent.

The problem shows up when I started to play with multiple NICs. I extended the prototype to use 32 NICs (each 4 binds to one GPU). I'm still doing RMA WRITE of 65536 bytes. 128000 ops in total (16000 ops per GPU, or 4000 ops per NIC). However, I only get 179Gbps in aggregation.

I added the topology detection and found that the default enumeration order is correct. I also tried DMA-BUF, but it didn't seem to help.

Do you have any guess on where I got wrong?

Here's the code just in case you'd like to take a look: https://gist.github.com/abcdabcd987/ad02c376b60acedbca8a1f7c635fbf7f

And here's the example run:

Compile:

g++ -Wall -Werror -std=c++17 -O2 -g -I./build/libfabric/include -I/usr/local/cuda/include -o build/6_topo src/6_topo.cpp -Wl,-rpath,'$ORIGIN' -L./build -L/usr/local/cuda/lib64 -lfabric -lpthread -lcudart -lcuda

1 NIC + 1 GPU:

server$ ./build/6_topo single
GPUs: 1, NICs: 1, Total Bandwidth: 100 Gbps
Topology:
  cuda:0 rdmap79s0 
Run client with the following command:
  ./build/6_topo single fe8000000000000008e7effffeeee81d000000008e157a540000000000000000 [page_size num_pages]
Registered MR from cuda:0
------
Received CONNECT message from client: num_gpus=1, num_nets=1, num_mr=2
Received RandomFill request from client:
  remote_context: 0x00000123
  seed: 0xb584035fabe6ce9b
  page_size: 65536
  num_pages: 8000
Generating random data..
Started RDMA WRITE to the remote GPU memory.
WRITE: progress=100%, ops=16000/16000, bytes=1048576000/1048576000, bandwidth=94.889Gbps
Finished all RDMA WRITEs to the remote GPU memory.
------
^C


client$ ./build/6_topo single fe8000000000000008e7effffeeee81d000000008e157a540000000000000000
GPUs: 1, NICs: 1, Total Bandwidth: 100 Gbps
Topology:
  cuda:0 rdmap79s0 
Registered MR from cuda:0
Page size: 65536
Max pages: 16384
Sent CONNECT message to server
Sent RandomFillRequest to server. num_pages=8000
Received RDMA WRITE to local GPU memory.
Verifying.
Data is correct

32 NICs + 8 GPUs:

server$ ./build/6_topo 
GPUs: 8, NICs: 32, Total Bandwidth: 3200 Gbps
Topology:
  cuda:0 rdmap79s0  rdmap80s0  rdmap81s0  rdmap82s0 
  cuda:1 rdmap96s0  rdmap97s0  rdmap98s0  rdmap99s0 
  cuda:2 rdmap113s0 rdmap114s0 rdmap115s0 rdmap116s0
  cuda:3 rdmap130s0 rdmap131s0 rdmap132s0 rdmap133s0
  cuda:4 rdmap147s0 rdmap148s0 rdmap149s0 rdmap150s0
  cuda:5 rdmap164s0 rdmap165s0 rdmap166s0 rdmap167s0
  cuda:6 rdmap181s0 rdmap182s0 rdmap183s0 rdmap184s0
  cuda:7 rdmap198s0 rdmap199s0 rdmap200s0 rdmap201s0
Run client with the following command:
  ./build/6_topo all fe8000000000000008e7effffeeee81d00000000146d0d200000000000000000 [page_size num_pages]
Registered MR from cuda:0 cuda:1 cuda:2 cuda:3 cuda:4 cuda:5 cuda:6 cuda:7
------
Received CONNECT message from client: num_gpus=8, num_nets=32, num_mr=64
Received RandomFill request from client:
  remote_context: 0x00000123
  seed: 0xb584035fabe6ce9b
  page_size: 65536
  num_pages: 8000
Generating random data................
Started RDMA WRITE to the remote GPU memory.
WRITE: progress=100%, ops=128000/128000, bytes=8388608000/8388608000, bandwidth=179.591Gbps
Finished all RDMA WRITEs to the remote GPU memory.
------
^C


client$ ./build/6_topo all fe8000000000000008e7effffeeee81d00000000146d0d200000000000000000
GPUs: 8, NICs: 32, Total Bandwidth: 3200 Gbps
Topology:
  cuda:0 rdmap79s0  rdmap80s0  rdmap81s0  rdmap82s0 
  cuda:1 rdmap96s0  rdmap97s0  rdmap98s0  rdmap99s0 
  cuda:2 rdmap113s0 rdmap114s0 rdmap115s0 rdmap116s0
  cuda:3 rdmap130s0 rdmap131s0 rdmap132s0 rdmap133s0
  cuda:4 rdmap147s0 rdmap148s0 rdmap149s0 rdmap150s0
  cuda:5 rdmap164s0 rdmap165s0 rdmap166s0 rdmap167s0
  cuda:6 rdmap181s0 rdmap182s0 rdmap183s0 rdmap184s0
  cuda:7 rdmap198s0 rdmap199s0 rdmap200s0 rdmap201s0
Registered MR from cuda:0 cuda:1 cuda:2 cuda:3 cuda:4 cuda:5 cuda:6 cuda:7
Page size: 65536
Max pages: 16384
Sent CONNECT message to server
Sent RandomFillRequest to server. num_pages=8000
Received RDMA WRITE to local GPU memory.
Verifying........
Data is correct

Thanks for all the help!

Answered by abcdabcd987

Dec 25, 2024

Ah. Finally able to reach close to full bandwidth! Thanks for the help! @shijin-aws

Perhaps the most important things are:

Use 1 CPU per GPU (4 NICs).
Pin thread to a CPU core!!!! This gives the most significant boost.
Send one warmup message before starting benchmark. Establishing connection is lazy and takes time. This gives another significant boost.
Interleave op submission and cq polling.

Updated code: https://gist.github.com/abcdabcd987/ad02c376b60acedbca8a1f7c635fbf7f

Performance Numbers (64KiB message size):

GPUs	NICs	Bandwidth	Util	Packet/s
1	1	97.821 Gbps	97.8%	0.187 Mpps
2	2	195.565 Gbps	97.8%	0.373 Mpps
4	4	390.984 Gbps	97.7%	0.746 Mpps
8	8	782.438 Gbps	97.8%	1.…

View full answer

shijin-aws · 2024-12-23T18:45:53Z

shijin-aws
Dec 23, 2024
Maintainer

I get the same performance as you reported when running on 2 p5, looking at your code closely now

0 replies

shijin-aws · 2024-12-23T18:54:57Z

shijin-aws
Dec 23, 2024
Maintainer

Can you try increasing the msg size to like 1M ?

2 replies

abcdabcd987 Dec 23, 2024
Author

Yes. It didn't help.

server$ ./build/6_topo 
GPUs: 8, NICs: 32, Total Bandwidth: 3200 Gbps
Topology:
  cuda:0 rdmap79s0  rdmap80s0  rdmap81s0  rdmap82s0 
  cuda:1 rdmap96s0  rdmap97s0  rdmap98s0  rdmap99s0 
  cuda:2 rdmap113s0 rdmap114s0 rdmap115s0 rdmap116s0
  cuda:3 rdmap130s0 rdmap131s0 rdmap132s0 rdmap133s0
  cuda:4 rdmap147s0 rdmap148s0 rdmap149s0 rdmap150s0
  cuda:5 rdmap164s0 rdmap165s0 rdmap166s0 rdmap167s0
  cuda:6 rdmap181s0 rdmap182s0 rdmap183s0 rdmap184s0
  cuda:7 rdmap198s0 rdmap199s0 rdmap200s0 rdmap201s0
Run client with the following command:
  ./build/6_topo all fe80000000000000085461fffe543467000000002e170d2a0000000000000000 [page_size num_pages]
Registered MR from cuda:0 cuda:1 cuda:2 cuda:3 cuda:4 cuda:5 cuda:6 cuda:7
------
Received CONNECT message from client: num_gpus=8, num_nets=32, num_mr=64
Received RandomFill request from client:
  remote_context: 0x00000123
  seed: 0xb584035fabe6ce9b
  page_size: 1048576
  num_pages: 1000
Generating random data................
Started RDMA WRITE to the remote GPU memory.
WRITE: progress=100%, ops=16000/16000, bytes=16777216000/16777216000, bandwidth=417.678Gbps
Finished all RDMA WRITEs to the remote GPU memory.
------
^C
 
client$ ./build/6_topo all fe80000000000000085461fffe543467000000002e170d2a0000000000000000 1048576 1000
GPUs: 8, NICs: 32, Total Bandwidth: 3200 Gbps
Topology:
  cuda:0 rdmap79s0  rdmap80s0  rdmap81s0  rdmap82s0 
  cuda:1 rdmap96s0  rdmap97s0  rdmap98s0  rdmap99s0 
  cuda:2 rdmap113s0 rdmap114s0 rdmap115s0 rdmap116s0
  cuda:3 rdmap130s0 rdmap131s0 rdmap132s0 rdmap133s0
  cuda:4 rdmap147s0 rdmap148s0 rdmap149s0 rdmap150s0
  cuda:5 rdmap164s0 rdmap165s0 rdmap166s0 rdmap167s0
  cuda:6 rdmap181s0 rdmap182s0 rdmap183s0 rdmap184s0
  cuda:7 rdmap198s0 rdmap199s0 rdmap200s0 rdmap201s0
Registered MR from cuda:0 cuda:1 cuda:2 cuda:3 cuda:4 cuda:5 cuda:6 cuda:7
Page size: 1048576
Max pages: 1024
Sent CONNECT message to server
Sent RandomFillRequest to server. num_pages=1000
Received RDMA WRITE to local GPU memory.
Verifying........
Data is correct

And in our application, the message is scattered on non-contiguous memory addresses. Each contiguous block is around 2KB~64KB. So, in reality, we don't expect to use 1MB blocks. It seems to me that EFA provider does not support multiple scatter-gather buffers, so I have to send them as different WRITE ops?

shijin-aws Dec 23, 2024
Maintainer

I see, thanks

It seems to me that EFA provider does not support multiple scatter-gather buffers, so I have to send them as different WRITE ops?

EFA device currently only supports 2 iov, and libfabric uses 1 iov for internal pkt hdr, so if you use multiple iovs, libfabric had to do copy to aggregate them into internal bounce buffer which hurts performance

abcdabcd987 · 2024-12-23T19:37:35Z

abcdabcd987
Dec 23, 2024
Author

I have a few suspicions about the performance issue, but I'm not able to confirm yet:

Did the transfer take a detour? I'd hope the transfer goes this way: local GPU - local EFA on the same PCIe switch - remote EFA - remote GPU on the same PCIe switch. I hope the transfer does not go to another PCIe switch or even to CPU memory.
Was the CUDA memory registered correctly? Is there API calls that I missed to enable the GPU Direct RDMA?
Is the topology correct? The detected topology used in the prototype looks the same as lspci -tv.
Is single core enough? I tried multiple threads for polling the queue. It seems a bit worse (perhaps due to NUMA).
Is the PCIe switch bandwidth saturated?

2 replies

shijin-aws Dec 23, 2024
Maintainer

Did the transfer take a detour? I'd hope the transfer goes this way: local GPU - local EFA on the same PCIe switch - remote EFA - remote GPU on the same PCIe switch. I hope the transfer does not go to another PCIe switch or even to CPU memory.

It should work as what you described, though I haven't been able to confirm that your application is doing that way.

Was the CUDA memory registered correctly? Is there API calls that I missed to enable the GPU Direct RDMA?

GDR should be used as I didn't see any warning from FI_LOG_LEVEL=warn

Is the topology correct? The detected topology used in the prototype looks the same as lspci -tv.

Your topology looks correct, i checked via lstopo

Is single core enough? I tried multiple threads for polling the queue. It seems a bit worse (perhaps due to NUMA).

Ah that can be an issue. Though libfabric calls you used are not the blocking/synchronized calls, there is still extra overhead when you call all nics in a for loop in single process. Also the core-NIC binding may also caused an issue here. Can we try to make your code written as an MPI program? And you have each rank bind to 1 GPU.

Is the PCIe switch bandwidth saturated?

Not an expert on this, I will reach out to related team on this

shijin-aws Dec 23, 2024
Maintainer

I turned on info level logging and I can see you registered CUDA buffer on all 32 EFA nics per node.

shijin-aws · 2024-12-23T21:12:40Z

shijin-aws
Dec 23, 2024
Maintainer

A couple of findings in your app: when you progress the pending op, did you poll cq when hitting EAGAIN? This is recommended per manpage as efa provider currently requires manual progress that can release the resources.

~~Also, it seems you make the write post and cq poll in the same loop? That means your your write to nic2 will not start before you finished the write through nic1?~~ I was wrong when reading the code, ignore it.

Can you give me a big picture how do you post the write/ poll the completion and how did you calculate the bandwidth?

7 replies

shijin-aws Dec 23, 2024
Maintainer

So you poll completion after you posted all write requests to all nics?

abcdabcd987 Dec 23, 2024
Author

So you poll completion after you posted all write requests to all nics?

PostWrite() calls ProgressPendingOps() which calls fi_writemsg(). So the first thousand ops are submitted to NIC ASAP. However, CQ won't be polled under all ops are queued.

I think your comment makes sense. I should try to poll cq while adding the WRITE ops to my queue. (maybe poll cq every 1024 ops, or something like that.)

shijin-aws Dec 23, 2024
Maintainer

Remember there is a maximal tx request number in each efa ep (qp) as 4096, more requests in flight will cause EAGAIN before the queue released (by getting tx completions)

shijin-aws Dec 23, 2024
Maintainer

The bandwidth should be calculated as the total_bytes / total_duration, where the total_duration should be measured between the start of your fi_write call, and the end of your completion polling call. One example in my head is like

a = time()

for (i=0; i<num_nics; i++) {
   // poll cq & retry when hitting eagain
   ret = fi_write(ep[i], ....);
}

while(total_finished > finished) {
   for (i=0; i<num_nics; i++) {
    ret = fi_cq_read(cq[i], ...);
    finished += ret;
  }
}
b = time();

bw = total_bytes / (b - a);

abcdabcd987 Dec 23, 2024
Author

Yeah. This bandwidth calculation makes sense to me. I think it also what I'm doing in the code.

a = time()   // L930: write_start_at = std::chrono::high_resolution_clock::now();

for (i=0; i<num_nics; i++) {
   // poll cq & retry when hitting eagain  // L965: net.PollCompletion();
   ret = fi_write(ep[i], ....);            // L959: net.PostWrite(...)
}

while(total_finished > finished) {
   for (i=0; i<num_nics; i++) {
    ret = fi_cq_read(cq[i], ...);          // L1022: net.PollCompletion();
    finished += ret;
  }
}
b = time();  // L976: auto now = std::chrono::high_resolution_clock::now();

bw = total_bytes / (b - a);  // L978: 8.0 * write_op_size * finished / elapsed_nanos

abcdabcd987 · 2024-12-23T23:12:53Z

abcdabcd987
Dec 23, 2024
Author

I updated the code a bit: https://gist.github.com/abcdabcd987/ad02c376b60acedbca8a1f7c635fbf7f

Skip generating random data. The correctness check is irrelevant to the performance issue.
Use DMA-BUF instead of iovec when registering the MR.
The server now accepts the following args: ./6_topo num_gpus num_nics. We can study different configurations.
Repeat each RMA WRITE op 100x. So we have more data to send, and can somewhat smooth out the potential slow start.
Submitting the ops: Move the the loop over num_gpus (the i loop) to the innermost.
Skip calling fi_writemsg() in net.PostWrite().
Instead, poll cq and submit ops every 64 / nics_per_gpu ops for each NIC. (See if (k % 64 == 0) in the code.)

With these modification, I can get much better results:

GPUs	NICs	Bandwidth
1	1	97.852Gbps
4	4	389.414Gbps
8	8	679.145Gbps
1	4	383.516Gbps
2	8	723.695Gbps
4	16	720.178Gbps
8	32	546.773Gbps

However, as you can see in the table, (8,8), (4,16), (8,32) are still far from satisfactory.

One thing I realize is that, perhaps instead of hard coded round-robin and submitting ops all at once, maybe it could be better if I wrote a round-robin scheduler for sending these non-contiguous blocks. I'll keep experimenting.

5 replies

shijin-aws Dec 23, 2024
Maintainer

Skip calling fi_writemsg() in net.PostWrite().

I saw you commented out PollCompletion in postwrite()

void Network::PostWrite(RdmaWriteOp &&write,
                        std::function<void(Network &, RdmaOp &)> &&callback) {
  auto *op = new RdmaOp{
      .type = RdmaOpType::kWrite,
      .write = write,
      .callback = std::move(callback),
  };
  pending_ops.push_back(op);
  // auto ret = ProgressPendingOps();
  // if (ret == -EAGAIN) {
  //   PollCompletion();
  // }
}

I am still trying to figure out the code logic as I am not familiar with c++.. Now it seems ProgressPendingOps is only in PollCompletion which happens for every k %64 == 0... Can you explain to me what PostWrite does right now? It seems you insert some requests into the queue and registered your call back, the actuall fi_writemsg and cq_read only happens in the pollcompletion?

abcdabcd987 Dec 24, 2024
Author

Your understanding is correct! Yes, I commented out ProgressPendingOps() in PostWrite(). Yes, ProgressPendingOps() is called every k % 64 ==0 (and in the main loop).

What PostWrite() does right now is just adding the op to my queue. The actual fi_writemsg() is called in ProgressPendingOps().

I found that skipping ProgressPendingOps() in PostWrite() is faster.

shijin-aws Dec 24, 2024
Maintainer

So PollCompletion first read cq until EAGAIN (empty), and then call ProgressPendingOps(). ProgressPendingOps scan the pending ops that will post fi_writemsg. If that fi_writemsg returns EAGAIN, it will be requeued into the front of the pending list and break the while loop

while (!pending_ops.empty() && inflight_ops < max_inflight_ops) {

Then these pending ops list will not be flushed until the next pollCompletion call.

Am I understanding correct?

shijin-aws Dec 24, 2024
Maintainer

Your code looks reasonable and seems to be in the same pattern as what NCCL plugin does for the rdma protocol. My suspicion is still on the number of cores used. For the cases you called out

However, as you can see in the table, (8,8), (4,16), (8,32) are still far from satisfactory.

I think it worth running (4,16) in 4 processes in parallel and each process binds 1 GPU to see whether there is a difference

abcdabcd987 Dec 24, 2024
Author

Correct. Basically something like this:

def PollCompletionForAllNets():
  for i in range(len(nets)):
    while not -EAGAIN:
      fi_cq_read(net[i].cq)
    while net.pending_ops and inflight_ops < max_inflight_ops:
      ret = fi_writemsg(net.pending_ops.front())
      if ret == 0:
        net.pending_ops.pop_front()


for op in all ops that needs to submit:
  nets[?].pending_ops.push_back(op)
  if k % 64 == 0:
    PollCompletionForAllNets()

while not done:
  PollCompletionForAllNets()

abcdabcd987 · 2024-12-25T00:32:54Z

abcdabcd987
Dec 25, 2024
Author

Ah. Finally able to reach close to full bandwidth! Thanks for the help! @shijin-aws

Perhaps the most important things are:

Use 1 CPU per GPU (4 NICs).
Pin thread to a CPU core!!!! This gives the most significant boost.
Send one warmup message before starting benchmark. Establishing connection is lazy and takes time. This gives another significant boost.
Interleave op submission and cq polling.

Updated code: https://gist.github.com/abcdabcd987/ad02c376b60acedbca8a1f7c635fbf7f

Performance Numbers (64KiB message size):

GPUs	NICs	Bandwidth	Util	Packet/s
1	1	97.821 Gbps	97.8%	0.187 Mpps
2	2	195.565 Gbps	97.8%	0.373 Mpps
4	4	390.984 Gbps	97.7%	0.746 Mpps
8	8	782.438 Gbps	97.8%	1.492 Mpps
1	4	391.074 Gbps	97.8%	0.746 Mpps
2	8	782.217 Gbps	97.8%	1.492 Mpps
4	16	1563.687 Gbps	97.7%	2.982 Mpps
8	32	3052.875 Gbps	95.4%	5.823 Mpps

Performance Numbers (8 GPUs, 32 NICs):

Size	Bandwidth	Util	Packet/s
2KiB	113.577 Gbps	3.5%	6.932 Mpps
4KiB	229.627 Gbps	7.2%	7.008 Mpps
8KiB	459.970 Gbps	14.4%	7.019 Mpps
16KiB	912.591 Gbps	28.5%	6.963 Mpps
32KiB	1802.304 Gbps	56.3%	6.875 Mpps
64KiB	3103.137 Gbps	97.0%	5.919 Mpps
128KiB	3084.303 Gbps	96.4%	2.941 Mpps
256KiB	3105.597 Gbps	97.0%	1.481 Mpps
512KiB	3103.603 Gbps	97.0%	0.740 Mpps

Screen.Recording.2024-12-24.at.4.23.22.PM.mp4

0 replies

abcdabcd987 · 2024-12-25T00:39:41Z

abcdabcd987
Dec 25, 2024
Author

Small packet performance is lower than my expectation though. It's around 0.22 Mpps per NIC. Do you have a reference number for small message pps? @shijin-aws

0 replies

shijin-aws · 2024-12-25T02:49:53Z

shijin-aws
Dec 25, 2024
Maintainer

@abcdabcd987 Hmm I would expect something larger. I use fi_rma_bw (fabtests) which benchmarks the fi_writedata uni bandwidth and msg rate for 1 NIC/GPU binding:

$ fi_rma_bw -p efa -o writedata -E -D cuda -S all
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
1       20k     19k         0.03s      0.76       1.32       0.76
2       20k     39k         0.03s      1.53       1.31       0.77
3       20k     58k         0.03s      2.31       1.30       0.77
4       20k     78k         0.03s      3.08       1.30       0.77
6       20k     117k        0.03s      4.62       1.30       0.77
8       20k     156k        0.03s      6.15       1.30       0.77
12      20k     234k        0.03s      9.25       1.30       0.77
16      20k     312k        0.03s     12.33       1.30       0.77
24      20k     468k        0.03s     18.47       1.30       0.77
32      20k     625k        0.03s     24.65       1.30       0.77
48      20k     937k        0.03s     36.89       1.30       0.77
64      20k     1.2m        0.03s     49.09       1.30       0.77
96      20k     1.8m        0.03s     73.95       1.30       0.77
128     20k     2.4m        0.03s     98.39       1.30       0.77
192     20k     3.6m        0.03s    147.68       1.30       0.77
256     20k     4.8m        0.03s    197.24       1.30       0.77
384     20k     7.3m        0.03s    294.95       1.30       0.77
512     20k     9.7m        0.03s    393.27       1.30       0.77
768     20k     14m         0.03s    588.10       1.31       0.77
1k      20k     19m         0.03s    786.63       1.30       0.77
1.5k    20k     29m         0.03s   1174.54       1.31       0.76
2k      20k     39m         0.03s   1554.28       1.32       0.76
3k      20k     58m         0.03s   2310.99       1.33       0.75
4k      20k     78m         0.03s   3030.82       1.35       0.74
6k      20k     117m        0.03s   4388.88       1.40       0.71
8k      20k     156m        0.03s   5597.54       1.46       0.68
12k     20k     234m        0.04s   6893.11       1.78       0.56
16k     20k     312m        0.04s   7871.44       2.08       0.48
24k     20k     468m        0.06s   8928.45       2.75       0.36
32k     20k     625m        0.07s   9586.05       3.42       0.29
48k     20k     937m        0.10s  10194.66       4.82       0.21
64k     2k      125m        0.01s  10568.62       6.20       0.16
96k     2k      187m        0.02s  11000.90       8.94       0.11
128k    2k      250m        0.02s  11276.47      11.62       0.09
192k    2k      375m        0.03s  11545.14      17.03       0.06
256k    2k      500m        0.04s  11698.94      22.41       0.04
384k    2k      750m        0.07s  11839.58      33.21       0.03
512k    2k      1000m       0.09s  11915.37      44.00       0.02
768k    2k      1.4g        0.13s  11992.59      65.58       0.02
1m      200     200m        0.02s  11498.80      91.19       0.01
1.5m    200     300m        0.03s  11546.07     136.23       0.01
2m      200     400m        0.04s  11564.75     181.34       0.01
3m      200     600m        0.05s  11588.40     271.45       0.00
4m      200     800m        0.07s  11600.90     361.55       0.00
6m      200     1.1g        0.11s  11612.03     541.80       0.00
8m      200     1.5g        0.14s  11619.29     721.96       0.00

Between 2 GPU buffers on 2 p5 through 1 NIC, I get ~ 0.77M msg rate (pkt rate) for small message. I used a window size as 64.

Is that the reference you are looking for? Do you have data for 1 gpu + 1 nic four benchmark?

1 reply

abcdabcd987 Jan 1, 2025
Author

Thanks! I'll check my code.

Low bandwidth with 32 EFA NICs #10658

abcdabcd987 Dec 23, 2024

Replies: 8 comments · 17 replies

shijin-aws Dec 23, 2024 Maintainer

shijin-aws Dec 23, 2024 Maintainer

abcdabcd987 Dec 23, 2024 Author

shijin-aws Dec 23, 2024 Maintainer

abcdabcd987 Dec 23, 2024 Author

shijin-aws Dec 23, 2024 Maintainer

shijin-aws Dec 23, 2024 Maintainer

shijin-aws Dec 23, 2024 Maintainer

shijin-aws Dec 23, 2024 Maintainer

abcdabcd987 Dec 23, 2024 Author

shijin-aws Dec 23, 2024 Maintainer

shijin-aws Dec 23, 2024 Maintainer

abcdabcd987 Dec 23, 2024 Author

abcdabcd987 Dec 23, 2024 Author

shijin-aws Dec 23, 2024 Maintainer

abcdabcd987 Dec 24, 2024 Author

shijin-aws Dec 24, 2024 Maintainer

shijin-aws Dec 24, 2024 Maintainer

abcdabcd987 Dec 24, 2024 Author

abcdabcd987 Dec 25, 2024 Author

abcdabcd987 Dec 25, 2024 Author

shijin-aws Dec 25, 2024 Maintainer

abcdabcd987 Jan 1, 2025 Author

abcdabcd987
Dec 23, 2024

Replies: 8 comments 17 replies

shijin-aws
Dec 23, 2024
Maintainer

shijin-aws
Dec 23, 2024
Maintainer

abcdabcd987 Dec 23, 2024
Author

shijin-aws Dec 23, 2024
Maintainer

abcdabcd987
Dec 23, 2024
Author

shijin-aws Dec 23, 2024
Maintainer

shijin-aws Dec 23, 2024
Maintainer

shijin-aws
Dec 23, 2024
Maintainer

shijin-aws Dec 23, 2024
Maintainer

abcdabcd987 Dec 23, 2024
Author

shijin-aws Dec 23, 2024
Maintainer

shijin-aws Dec 23, 2024
Maintainer

abcdabcd987 Dec 23, 2024
Author

abcdabcd987
Dec 23, 2024
Author

shijin-aws Dec 23, 2024
Maintainer

abcdabcd987 Dec 24, 2024
Author

shijin-aws Dec 24, 2024
Maintainer

shijin-aws Dec 24, 2024
Maintainer

abcdabcd987 Dec 24, 2024
Author

abcdabcd987
Dec 25, 2024
Author

abcdabcd987
Dec 25, 2024
Author

shijin-aws
Dec 25, 2024
Maintainer

abcdabcd987 Jan 1, 2025
Author