-
Notifications
You must be signed in to change notification settings - Fork 311
Description
I have a test using IOR that shows a major performance bottleneck in the communication layer on Aurora for MPI-IO collective buffering, particularly on the read, possibly related to the size of the completion queues in cxi. I have the reproducer here on aurora:
/lus/flare/projects/Aurora_deployment/pkcoff/tickets/ior-cbperf
This reproducer runs on 256 nodes at 64ppn against DAOS where I have added hzhou to the perftesting pool. To run:
1.) create your container (do this from a uan just once)
module use /soft/modulefiles
module load daos
daos container create --type=POSIX --dir-oclass=RP_4G1 --file-oclass=EC_16P3GX --chunk-size=2097152 --properties=rd_fac:3,ec_cell_sz:131072,cksum:crc32,srv_cksum:on perftesting hzhou-ec16p3gx-crc32
2.) run the IOR reproducer script against that container on 256 nodes at 64ppn with the following parameters:
cd /lus/flare/projects/Aurora_deployment/pkcoff/tickets/ior-cbperf/rundir
qsub -lselect=256 -q prod -v DAOS_POOL=perftesting,DAOS_CONT=hzhou-ec16p3gx-crc32,rpn=64,numthreads=list:4:56:5:57:6:58:7:59:8:60:9:61:10:62:11:63:12:64:13:65:14:66:15:67:16:68:17:69:18:70:19:71:20:72:21:73:22:74:23:75:24:76:25:77:26:78:27:79:28:80:29:81:30:82:31:83:32:84:33:85:34:86:35:87,blocksize=64M,transfersize=64M,iterations=5,numamode=ddr,cb=1 ../ior-mpiio-daos.pbs
Look at the results in stdout, here is a sample run in /lus/flare/projects/Aurora_deployment/pkcoff/tickets/ior-cbperf/rundir/ior-mpiio-daos.pbs.o8107276 (you can look at the stderr in /lus/flare/projects/Aurora_deployment/pkcoff/tickets/ior-cbperf/rundir/ior-mpiio-daos.pbs.e8107276 to see the commands for running interactively):
-----------------------------------------------------
ior MPIIO write then read using cpu buffers
-----------------------------------------------------
IOR-4.1.0+dev: MPI Coordinated Test of Parallel I/O
Began : Thu Oct 23 22:07:47 2025
Command line : /lus/flare/projects/Aurora_deployment/pkcoff/tickets/ior-cbperf/ior -a MPIIO -b 64M -t 64M -c -w -W -r -C -g -i 5
Machine : Linux x4113c2s7b0n0
TestID : 0
StartTime : Thu Oct 23 22:07:47 2025
Path : testFile
FS : 89.2 TiB Used FS: 10.7% Inodes: -0.0 Mi Used Inodes: 0.0%
Options:
api : MPIIO
apiVersion : (4.1)
test filename : testFile
access : single-shared-file
type : collective
segments : 1
ordering in a file : sequential
ordering inter file : constant task offset
task offset : 1
nodes : 256
tasks : 16384
clients per node : 64
memoryBuffer : CPU
dataAccess : CPU
GPUDirect : 0
repetitions : 5
xfersize : 64 MiB
blocksize : 64 MiB
aggregate filesize : 1 TiB
Results:
access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ----
write 119910 4627 3.54 65536 65536 5.20 3.54 0.001723 8.74 0
read 333408 6828 2.40 65536 65536 0.743157 2.40 0.002211 3.15 0
write 153903 4382 3.74 65536 65536 3.07 3.74 0.002255 6.81 1
read 367707 7610 2.15 65536 65536 0.696465 2.15 0.002254 2.85 1
write 165389 4400 3.72 65536 65536 2.61 3.72 0.002183 6.34 2
read 361582 7247 2.26 65536 65536 0.636929 2.26 0.002247 2.90 2
write 175064 4923 3.33 65536 65536 2.66 3.33 0.002270 5.99 3
read 356129 7239 2.26 65536 65536 0.678813 2.26 0.002252 2.94 3
write 163665 4482 3.65 65536 65536 2.75 3.66 0.002301 6.41 4
read 340391 6724 2.44 65536 65536 0.641550 2.44 0.002283 3.08 4
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum
write 175063.57 119909.89 155586.28 19059.75 2735.37 1873.59 2431.04 297.81 6.85890 NA NA 0 16384 64 5 0 1 1 0 0 1 67108864 67108864 1048576.0 MPIIO 0
read 367707.11 333407.74 351843.32 12929.67 5745.42 5209.50 5497.55 202.03 2.98431 NA NA 0 16384 64 5 0 1 1 0 0 1 67108864 67108864 1048576.0 MPIIO 0
Finished : Thu Oct 23 22:09:02 2025
-----------------------------------------------------
ior MPIIO write then read using alltoall transfer
-----------------------------------------------------
IOR-4.1.0+dev: MPI Coordinated Test of Parallel I/O
Began : Thu Oct 23 22:09:45 2025
Command line : /lus/flare/projects/Aurora_deployment/pkcoff/tickets/ior-cbperf/ior-alltoalltransfer -a MPIIO -b 64M -t 64M -c -w -W -r -C -g -i 5 --mpiio.useStridedDatatype --mpiio.useFileView
Machine : Linux x4113c2s7b0n0
TestID : 0
StartTime : Thu Oct 23 22:09:45 2025
Path : testFile
FS : 89.2 TiB Used FS: 26.4% Inodes: -0.0 Mi Used Inodes: 0.0%
Options:
api : MPIIO
apiVersion : (4.1)
test filename : testFile
access : single-shared-file
type : collective
segments : 1
ordering in a file : sequential
ordering inter file : constant task offset
task offset : 1
nodes : 256
tasks : 16384
clients per node : 64
memoryBuffer : CPU
dataAccess : CPU
GPUDirect : 0
repetitions : 5
xfersize : 64 MiB
blocksize : 64 MiB
aggregate filesize : 1 TiB
Results:
access bw(MiB/s) IOPS Latency(s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter
------ --------- ---- ---------- ---------- --------- -------- -------- -------- -------- ----
write 92835 2452.79 6.68 65536 65536 4.61 6.68 0.005492 11.30 0
read 74606 1268.57 12.91 65536 65536 1.14 12.92 0.002259 14.05 0
write 115819 2564.67 6.39 65536 65536 2.66 6.39 0.002228 9.05 1
read 80126 1320.26 12.41 65536 65536 0.675007 12.41 0.002122 13.09 1
write 109355 2394.21 6.84 65536 65536 2.74 6.84 0.002123 9.59 2
read 87435 1448.18 11.31 65536 65536 0.676990 11.31 0.002198 11.99 2
write 117646 2638.15 6.21 65536 65536 2.70 6.21 0.002328 8.91 3
read 37308 1339.55 12.23 65536 65536 15.87 12.23 0.002044 28.11 3
write 114811 2553.69 6.41 65536 65536 2.72 6.42 0.002190 9.13 4
read 87951 1449.62 11.30 65536 65536 0.617967 11.30 0.002182 11.92 4
Summary of all tests:
Operation Max(MiB) Min(MiB) Mean(MiB) StdDev Max(OPs) Min(OPs) Mean(OPs) StdDev Mean(s) Stonewall(s) Stonewall(MiB) Test# #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggs(MiB) API RefNum
write 117646.18 92834.53 110093.20 9060.87 1838.22 1450.54 1720.21 141.58 9.59668 NA NA 0 16384 64 5 0 1 1 0 0 1 67108864 67108864 1048576.0 MPIIO 0
read 87951.41 37307.84 73485.08 18751.07 1374.24 582.93 1148.20 292.99 15.83250 NA NA 0 16384 64 5 0 1 1 0 0 1 67108864 67108864 1048576.0 MPIIO 0
Finished : Thu Oct 23 22:13:03 2025
This script runs 2 versions of IOR,
- the first run denoted with 'ior MPIIO write then read using cpu buffers' is the regular IOR where each rank writes 64M in 1 contiguous block to a single shared file with collective MPIIO (MPI_File_write_all) and then reads it back (MPI_File_read_all). It runs it with ROMIO collective buffering enabled with 16MB collective buffers and 8 aggregators per node specified with the 'cb=1' parameter where the script creates a ROMIO hints file. It runs 5 iterations as denoted with the 'iterations=5' parameter.
- The script then does the same run denoted with 'ior MPIIO write then read using alltoall transfer' which is a different binary where instead of each rank writing 64MB in 1 contiguous chunk it splits the buffer into 16384 4k segments and writes them discontiguously across the entire shared file, creating the MPI_File with a file view and vector data type.
So
- on the write in the collective buffering phase for the regular IOR 1 rank sends 1 16MB contiguous chunk to the 16MB collective buffer, and on the read each collective buffer sends its entire 16MB contents to a rank.
- For the alltoalltransfer version
- for the write each rank segments the buffer and sends 4k of data to 512 collective buffers,
- but on the read each collective buffer sends 4k of data to 4096 ranks.
The 'wr/rd(s)' column shows the latency for the MPI_File_write_all in the case of write and MPI_File_read_all in the case of read. You can see comparing the 'ior MPIIO write then read using cpu buffers' with the 'ior MPIIO write then read using alltoall transfer' that the write latency doubles and the read latency is about 5x.
From a DAOS perspective the actual file io is exactly the same, so the delta is all in the collective buffering.
All the data movement is MPI_Send/Recv, so in the 'ior MPIIO write then read using alltoall transfer' case of the write each rank has a completion queue for 512 MPI_Sends, and on the read each collective buffer has a completion queue of 4096 entries for all the MPI_Sends back to the ranks. For the 'ior MPIIO write then read using cpu buffers' case there is just 1 entry on the completion queue. So the size of the completion queue roughly corresponds with the increase in write and read latency, as there is just as many bytes of data movement in the write as the read, the size of the completion queue is the major difference which is why the read is so much worse than the write for the alltoalltransfer case.