MPI-IO (ROMIO) collective buffering performance bottleneck on Aurora

I have a test using IOR that shows a major performance bottleneck in the communication layer on Aurora for MPI-IO collective buffering, particularly on the read, possibly related to the size of the completion queues in cxi.  I have the reproducer here on aurora:
/lus/flare/projects/Aurora_deployment/pkcoff/tickets/ior-cbperf

This reproducer runs on 256 nodes at 64ppn against DAOS where I have added hzhou to the perftesting pool.  To run:
1.) create your container (do this from a uan just once)
module use /soft/modulefiles
module load daos
daos container create --type=POSIX --dir-oclass=RP_4G1 --file-oclass=EC_16P3GX --chunk-size=2097152  --properties=rd_fac:3,ec_cell_sz:131072,cksum:crc32,srv_cksum:on  perftesting hzhou-ec16p3gx-crc32

2.) run the IOR reproducer script against that container on 256 nodes at 64ppn with the following parameters:
cd /lus/flare/projects/Aurora_deployment/pkcoff/tickets/ior-cbperf/rundir
qsub -lselect=256 -q prod -v DAOS_POOL=perftesting,DAOS_CONT=hzhou-ec16p3gx-crc32,rpn=64,numthreads=list:4:56:5:57:6:58:7:59:8:60:9:61:10:62:11:63:12:64:13:65:14:66:15:67:16:68:17:69:18:70:19:71:20:72:21:73:22:74:23:75:24:76:25:77:26:78:27:79:28:80:29:81:30:82:31:83:32:84:33:85:34:86:35:87,blocksize=64M,transfersize=64M,iterations=5,numamode=ddr,cb=1 ../ior-mpiio-daos.pbs

Look at the results in stdout, here is a sample run in /lus/flare/projects/Aurora_deployment/pkcoff/tickets/ior-cbperf/rundir/ior-mpiio-daos.pbs.o8107276 (you can look at the stderr in /lus/flare/projects/Aurora_deployment/pkcoff/tickets/ior-cbperf/rundir/ior-mpiio-daos.pbs.e8107276 to see the commands for running interactively):

```
-----------------------------------------------------
ior MPIIO write then read using cpu buffers
-----------------------------------------------------
IOR-4.1.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Thu Oct 23 22:07:47 2025
Command line        : /lus/flare/projects/Aurora_deployment/pkcoff/tickets/ior-cbperf/ior -a MPIIO -b 64M -t 64M -c -w -W -r -C -g -i 5
Machine             : Linux x4113c2s7b0n0
TestID              : 0
StartTime           : Thu Oct 23 22:07:47 2025
Path                : testFile
FS                  : 89.2 TiB   Used FS: 10.7%   Inodes: -0.0 Mi   Used Inodes: 0.0%

Options:
api                 : MPIIO
apiVersion          : (4.1)
test filename       : testFile
access              : single-shared-file
type                : collective
segments            : 1
ordering in a file  : sequential
ordering inter file : constant task offset
task offset         : 1
nodes               : 256
tasks               : 16384
clients per node    : 64
memoryBuffer        : CPU
dataAccess          : CPU
GPUDirect           : 0
repetitions         : 5
xfersize            : 64 MiB
blocksize           : 64 MiB
aggregate filesize  : 1 TiB

Results:

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     119910     4627       3.54        65536      65536      5.20       3.54       0.001723   8.74       0
read      333408     6828       2.40        65536      65536      0.743157   2.40       0.002211   3.15       0
write     153903     4382       3.74        65536      65536      3.07       3.74       0.002255   6.81       1
read      367707     7610       2.15        65536      65536      0.696465   2.15       0.002254   2.85       1
write     165389     4400       3.72        65536      65536      2.61       3.72       0.002183   6.34       2
read      361582     7247       2.26        65536      65536      0.636929   2.26       0.002247   2.90       2
write     175064     4923       3.33        65536      65536      2.66       3.33       0.002270   5.99       3
read      356129     7239       2.26        65536      65536      0.678813   2.26       0.002252   2.94       3
write     163665     4482       3.65        65536      65536      2.75       3.66       0.002301   6.41       4
read      340391     6724       2.44        65536      65536      0.641550   2.44       0.002283   3.08       4

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write      175063.57  119909.89  155586.28   19059.75    2735.37    1873.59    2431.04     297.81    6.85890         NA            NA     0  16384  64    5   0     1        1         0    0      1 67108864 67108864 1048576.0 MPIIO      0
read       367707.11  333407.74  351843.32   12929.67    5745.42    5209.50    5497.55     202.03    2.98431         NA            NA     0  16384  64    5   0     1        1         0    0      1 67108864 67108864 1048576.0 MPIIO      0
Finished            : Thu Oct 23 22:09:02 2025
-----------------------------------------------------
ior MPIIO write then read using alltoall transfer
-----------------------------------------------------
IOR-4.1.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Thu Oct 23 22:09:45 2025
Command line        : /lus/flare/projects/Aurora_deployment/pkcoff/tickets/ior-cbperf/ior-alltoalltransfer -a MPIIO -b 64M -t 64M -c -w -W -r -C -g -i 5 --mpiio.useStridedDatatype --mpiio.useFileView
Machine             : Linux x4113c2s7b0n0
TestID              : 0
StartTime           : Thu Oct 23 22:09:45 2025
Path                : testFile
FS                  : 89.2 TiB   Used FS: 26.4%   Inodes: -0.0 Mi   Used Inodes: 0.0%

Options:
api                 : MPIIO
apiVersion          : (4.1)
test filename       : testFile
access              : single-shared-file
type                : collective
segments            : 1
ordering in a file  : sequential
ordering inter file : constant task offset
task offset         : 1
nodes               : 256
tasks               : 16384
clients per node    : 64
memoryBuffer        : CPU
dataAccess          : CPU
GPUDirect           : 0
repetitions         : 5
xfersize            : 64 MiB
blocksize           : 64 MiB
aggregate filesize  : 1 TiB

Results:

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
write     92835      2452.79    6.68        65536      65536      4.61       6.68       0.005492   11.30      0
read      74606      1268.57    12.91       65536      65536      1.14       12.92      0.002259   14.05      0
write     115819     2564.67    6.39        65536      65536      2.66       6.39       0.002228   9.05       1
read      80126      1320.26    12.41       65536      65536      0.675007   12.41      0.002122   13.09      1
write     109355     2394.21    6.84        65536      65536      2.74       6.84       0.002123   9.59       2
read      87435      1448.18    11.31       65536      65536      0.676990   11.31      0.002198   11.99      2
write     117646     2638.15    6.21        65536      65536      2.70       6.21       0.002328   8.91       3
read      37308      1339.55    12.23       65536      65536      15.87      12.23      0.002044   28.11      3
write     114811     2553.69    6.41        65536      65536      2.72       6.42       0.002190   9.13       4
read      87951      1449.62    11.30       65536      65536      0.617967   11.30      0.002182   11.92      4

Summary of all tests:
Operation   Max(MiB)   Min(MiB)  Mean(MiB)     StdDev   Max(OPs)   Min(OPs)  Mean(OPs)     StdDev    Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt   blksiz    xsize aggs(MiB)   API RefNum
write      117646.18   92834.53  110093.20    9060.87    1838.22    1450.54    1720.21     141.58    9.59668         NA            NA     0  16384  64    5   0     1        1         0    0      1 67108864 67108864 1048576.0 MPIIO      0
read        87951.41   37307.84   73485.08   18751.07    1374.24     582.93    1148.20     292.99   15.83250         NA            NA     0  16384  64    5   0     1        1         0    0      1 67108864 67108864 1048576.0 MPIIO      0
Finished            : Thu Oct 23 22:13:03 2025
```

This script runs 2 versions of IOR,
1. the first run denoted with 'ior MPIIO write then read using cpu buffers'  is the regular IOR where each rank writes 64M in 1 contiguous block to a single shared file with collective MPIIO (MPI_File_write_all) and then reads it back (MPI_File_read_all).  It runs it with ROMIO collective buffering enabled with 16MB collective buffers and 8 aggregators per node specified with the 'cb=1' parameter where the script creates a ROMIO hints file.  It runs 5 iterations as denoted with the 'iterations=5' parameter.  
2. The script then does the same run denoted with 'ior MPIIO write then read using alltoall transfer' which is a different binary where instead of each rank writing 64MB in 1 contiguous chunk it splits the buffer into 16384 4k segments and writes them discontiguously across the entire shared file, creating the MPI_File with a file view and vector data type.  

So 
1. on the write in the collective buffering phase for the regular IOR 1 rank sends 1 16MB contiguous chunk to the 16MB collective buffer, and on the read each collective buffer sends its entire 16MB contents to a rank.  
2. For the alltoalltransfer version
    * for the write each rank segments the buffer and sends 4k of data to 512 collective buffers, 
    * but on the read each collective buffer sends 4k of data to 4096 ranks. 

The 'wr/rd(s)' column shows the latency for the MPI_File_write_all in the case of write and MPI_File_read_all in the case of read.  You can see comparing the 'ior MPIIO write then read using cpu buffers' with the 'ior MPIIO write then read using alltoall transfer' that the write latency doubles and the read latency is about 5x.  

From a DAOS perspective the actual file io is exactly the same, so the delta is all in the collective buffering.  

All the data movement is `MPI_Send/Recv`, so in the 'ior MPIIO write then read using alltoall transfer' case of the write each rank has a completion queue for 512 MPI_Sends, and on the read each collective buffer has a completion queue of 4096 entries for all the MPI_Sends back to the ranks.  For the 'ior MPIIO write then read using cpu buffers' case there is just 1 entry on the completion queue.  So the size of the completion queue roughly corresponds with the increase in write and read latency, as there is just as many bytes of data movement in the write as the read, the size of the completion queue is the major difference which is why the read is so much worse than the write for the alltoalltransfer case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MPI-IO (ROMIO) collective buffering performance bottleneck on Aurora #7645

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MPI-IO (ROMIO) collective buffering performance bottleneck on Aurora #7645

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions