Performance issue in ROMIO collective buffer aggregation for Parallel HDF5 on sunspot #6984

pkcoff · 2024-04-18T21:02:46Z

Using the mpich build 'mpich/20231026/icc-all-pmix-gpu' on sunspot and using darshan and vtune for performance analysis I am seeing what appears to be very bad performance in the messaging layer for the ROMIO collective buffering aggregation. I am usng the HDF5 h5bench exerciser benchmark which uses collective MPI-IO for the backend. This is just on 1 node so just intra-node communication, looking at darshan for example using 2 ranks I see:

POSIX   -1      14985684057340396765    POSIX_F_READ_TIME       0.101752        /lus/gila/projects/Aurora_deployment/pkcoff/run/h5bench/rundir8991806/hdf5TestFile-844165987    /lus/gila       lustre
POSIX   -1      14985684057340396765    POSIX_F_WRITE_TIME      0.272666        /lus/gila/projects/Aurora_deployment/pkcoff/run/h5bench/rundir8991806/hdf5TestFile-844165987    /lus/gila       lustre
MPI-IO  -1      14985684057340396765    MPIIO_F_WRITE_TIME      0.797941        /lus/gila/projects/Aurora_deployment/pkcoff/run/h5bench/rundir8991806/hdf5TestFile-844165987    /lus/gila       lustre

Time is in seconds, the total mpi-io time is 0.79 sec and within that the posix (lustre io) time is only 0.27 sec to write and then 0.10 sec to read (if doing rmw) so the delta is most likely messaging layer, and with 16 ranks it gets much worse:

POSIX   -1      4672546656109652293     POSIX_F_READ_TIME       0.774221        /lus/gila/projects/Aurora_deployment/pkcoff/run/h5bench/rundir8991811/hdf5TestFile-1491544850   /lus/gila       lustre
POSIX   -1      4672546656109652293     POSIX_F_WRITE_TIME      1.827263        /lus/gila/projects/Aurora_deployment/pkcoff/run/h5bench/rundir8991811/hdf5TestFile-1491544850   /lus/gila       lustre
MPI-IO  -1      4672546656109652293     MPIIO_F_WRITE_TIME      37.605015       /lus/gila/projects/Aurora_deployment/pkcoff/run/h5bench/rundir8991811/hdf5TestFile-1491544850   /lus/gila       lustre

So for 16 ranks the ratio is alot higher for the messaging layer time in mpiio. HDF5 is using collective mpi-io aggregation so there is a POSIX section which has all the times for the actual lustre filesystem interaction and then an MPIIO section with times that include all the messaging and the POSIX time, so taking the delta between them roughly gives the messaging time for aggregation. With Vtune I can see most all the time for mpi-io writing (MPI_File_write_at_all) is in ofi. So for 1 node 16 ranks the question is for MPIIO out of 37.61 secodes only 2.6 seconds are spent writing to lustre leaving over 35 seconds doing what I presume is mpi communication for the aggregation. To reproduce on sunspot running against lustre (gila):

Start interactive job on 1 node: qsub -lwalltime=60:00 -lselect=1 -A Aurora_deployment -q workq -I

cd /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir
module unload mpich/icc-all-pmix-gpu/52.2
module use /soft/preview-modulefiles/24.086.0
module load mpich/20231026/icc-all-pmix-gpu
export ROMIO_HINTS=/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/romio_hints
export MPIR_CVAR_ENABLE_GPU=1
export MPIR_CVAR_BCAST_POSIX_INTRA_ALGORITHM=mpir
export MPIR_CVAR_ALLREDUCE_POSIX_INTRA_ALGORITHM=mpir
export MPIR_CVAR_BARRIER_POSIX_INTRA_ALGORITHM=mpir
export MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM=mpir
unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE
unset MPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE
export LD_LIBRARY_PATH=/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir:/soft/datascience/aurora_nre_models_frameworks-2024.0/lib/
export FI_PROVIDER=cxi
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_CQ_FILL_PERCENT=20
export FI_MR_CACHE_MONITOR=disabled
export FI_CXI_OVFLOW_BUF_SIZE=8388608
export DARSHAN_LOGFILE=/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/darshan.log
LD_PRELOAD=/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/libdarshan.so:/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/libhdf5.so:/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/libpnetcdf.so mpiexec -np 16 -ppn 16 --cpu-bind=verbose,list:4:56:5:57:6:58:7:59:8:60:9:61:10:62:11:63 --no-vni -envall -genvall /soft/tools/mpi_wrapper_utils/gpu_tile_compact.sh ./hdf5Exerciser --numdims 3 --minels 128 128 128 --nsizes 1 --bufmult 2 2 2 --metacoll --addattr --usechunked --maxcheck 100000 --fileblocks 128 128 128 --filestrides 128 128 128 --memstride 128 --memblock 128

Then to get the darshan text file run this:

./darshan-parser darshan.log > darshan.txt

The text was updated successfully, but these errors were encountered:

raffenet · 2024-05-30T18:52:26Z

Talked to @pkcoff offline. He's going to test without unsetting the collective tuning file envvars to see if there's any impact on the performance and report back.

pkcoff · 2024-05-30T19:49:54Z

@raffenet advised me to NOT unset the collective tuning json vars, I did so:
pkcoff@x1921c3s4b0n0:/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir> echo $MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE /soft/restricted/CNDA/updates/mpich/tuning/20230818-1024/CH4_coll_tuning.json pkcoff@x1921c3s4b0n0:/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir> echo $MPIR_CVAR_CH4_POSIX_COLL_SELECTION_TUNING_JSON_FILE /soft/restricted/CNDA/updates/mpich/20231026/mpich-ofi-all-icc-default-pmix-gpu-drop20231026/json-files/POSIX_coll_tuning.json pkcoff@x1921c3s4b0n0:/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir> echo $MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE /soft/restricted/CNDA/updates/mpich/tuning/20230818-1024/MPIR_Coll_tuning.json pkcoff@x1921c3s4b0n0:/lus/gila/projects/Aurora_deployment/pkcoff/tarurundir>
But that didn't impact performance at all - I still see a ton of time in what I think is MPIIO aggregation per the darshan results:
POSIX -1 9899278149064962649 POSIX_F_READ_TIME 0.829620 /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/hdf5TestFile-2129464776 /lus/gila lustre POSIX -1 9899278149064962649 POSIX_F_WRITE_TIME 1.672418 /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/hdf5TestFile-2129464776 /lus/gila lustre P MPI-IO -1 9899278149064962649 MPIIO_F_WRITE_TIME 36.682667 /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/hdf5TestFile-2129464776 /lus/gila lustre M

pkcoff · 2024-05-30T19:57:13Z

At the advice of Ken I also unset all these:
unset MPIR_CVAR_ENABLE_GPU unset MPIR_CVAR_BCAST_POSIX_INTRA_ALGORITHM unset MPIR_CVAR_ALLREDUCE_POSIX_INTRA_ALGORITHM unset MPIR_CVAR_BARRIER_POSIX_INTRA_ALGORITHM unset MPIR_CVAR_REDUCE_POSIX_INTRA_ALGORITHM
But performance was actually a bit worse:
POSIX -1 5703841420005174308 POSIX_F_READ_TIME 0.774989 /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/hdf5TestFile-1873931539 /lus/gila lustre POSIX -1 5703841420005174308 POSIX_F_WRITE_TIME 1.770138 /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/hdf5TestFile-1873931539 /lus/gila lustre MPI-IO -1 5703841420005174308 MPIIO_F_WRITE_TIME 38.300559 /lus/gila/projects/Aurora_deployment/pkcoff/tarurundir/hdf5TestFile-1873931539 /lus/gila lustre

roblatham00 · 2024-10-22T19:11:57Z

hey paul this is way late but can you try again with reasonable lustre striping? (lfs setstripe -c -1 /lus/gila/projects/Aurora_deployment/pkcoff/run/h5bench/rundir8991806 or whatever the directory is)

roblatham00 · 2024-10-22T19:51:55Z

no matter how many processes you run on a node, the default is for just one of those to be an i/o aggregator, so the 15 non-aggregator processes are going to increase their time score. more stripes increases parallelism as does more aggregators. so try setting these hints:

"romio_lustre_co_ratio" to 2 or 4
"cb_config_list" to "*:8" (or 2 or 4)
"striping_factor" to N, 2N, 4N where N is number of OSTs on /lus/gila

raffenet · 2024-10-22T20:01:15Z

@pkcoff another thing we should do is confirm this issue on actual Aurora vs. Sunspot. If the Aurora behavior is different we need to consider removing the aurora tag and/or closing the issue.

pkcoff · 2024-10-23T01:27:40Z

@raffenet There is similar behavior on aurora with daos, I only opened it on sunspot and gila because they were not available when I had time to open the issue. Once daos_user is stabilized I will retest at 16 ppn with 8 aggregators as @roblatham00 suggests and report back.

raffenet self-assigned this May 30, 2024

raffenet added the aurora label Jul 11, 2024

roblatham00 added the romio anything having to do with our MPI-IO implementation label Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue in ROMIO collective buffer aggregation for Parallel HDF5 on sunspot #6984

Performance issue in ROMIO collective buffer aggregation for Parallel HDF5 on sunspot #6984

pkcoff commented Apr 18, 2024 •

edited by raffenet

Loading

raffenet commented May 30, 2024

pkcoff commented May 30, 2024

pkcoff commented May 30, 2024

roblatham00 commented Oct 22, 2024

roblatham00 commented Oct 22, 2024

raffenet commented Oct 22, 2024

pkcoff commented Oct 23, 2024

Performance issue in ROMIO collective buffer aggregation for Parallel HDF5 on sunspot #6984

Performance issue in ROMIO collective buffer aggregation for Parallel HDF5 on sunspot #6984

Comments

pkcoff commented Apr 18, 2024 • edited by raffenet Loading

raffenet commented May 30, 2024

pkcoff commented May 30, 2024

pkcoff commented May 30, 2024

roblatham00 commented Oct 22, 2024

roblatham00 commented Oct 22, 2024

raffenet commented Oct 22, 2024

pkcoff commented Oct 23, 2024

pkcoff commented Apr 18, 2024 •

edited by raffenet

Loading