ROMIO: excessive number of calls to memcpy() #6985

wkliao · 2024-04-18T21:40:41Z

A PnetCDF user reported a poor performance of collective writes when using a
non-contiguous write buffer. The root of problem is due to a large number of
calls to memcpy() in ADIOI_BUF_COPY in mpich/src/mpi/romio/adio/common/ad_write_coll.c

A performance reproducer is available in
https://github.com/wkliao/mpi-io-examples/blob/master/tests/pio_noncontig.c

This program makes a single call to MPI_File_write_at_all. The user buffer can
be either contiguous (command-line option -g 0) or noncontiguous (default).
The noncontiguous case adds a gap of 16 bytes into the buffer. The file
view consists of multiple subarray data types, appended one after another.
Further description about the I/O pattern can be found at the beginning of the
program file.

Running this program on a Linux machine using UFS ADIO driver on 16 MPI
processes reported run times of 33.07 and 8.27 seconds. The former is when the
user buffer is noncontiguous and the latter contiguous. The user buffer on each
process is of size 32 MB. The noncontiguous case adds a gap of size 16 bytes
into the buffer. The run command used:

    mpiexec -n 16 ./pio_noncontig -k 256 -c 32768 -w
    mpiexec -n 16 ./pio_noncontig -k 256 -c 32768 -w -g 0

The following patch if applied to MPICH prints the number of calls to memcpy().
https://github.com/wkliao/mpi-io-examples/blob/master/tests/0001-print-number-of-calls-to-memcpy.patch

The numbers of memcpy calls are 2097153 and 0 from the above two runs,
respectively.

The text was updated successfully, but these errors were encountered:

hzhou · 2024-04-19T02:55:01Z

I haven't looked at the code, but a scale of 4 (from 8.27 to 33.07) from contig to noncontig seems normal to me especially if the data consists of many small segments.

wkliao · 2024-04-19T17:31:35Z

The noncontiguous case adds a gap of 16 bytes into the buffer.
means the buffer has two contiguous segments. One is of size 256 bytes and
the other 256x16x8191 bytes. Two are separated by a gap of 16 bytes.

The focus point of this issue is the numbers of memcpy calls, as indicated in
the issue title, which is 2097153 per process. In fact, ROMIO can be fixed to
reduce that to 2 memcpy calls.

The test runs I provided was just to prove the point. It is small and reproducible
even on one computer node, easier for debugging. When tested with less
number of processes, say 8, the timing gap becomes bigger, 24.48 vs. 1.98 seconds.
The actual runs reported by PnetCDF user are in much bigger scale, with a total
write amount > 20GB. Time difference was 198.5 vs. 14.9 seconds.

hzhou · 2024-04-19T20:01:17Z

Thank you for the details! Hui

hzhou · 2024-04-21T20:32:01Z

Writing down my notes after looking at the code – The buffer in memory is a "dense" non-contig datatype – in the reproducer it's two segments -- but the filetype is fairly fragmented. In the aggregate code, we calculate a contig_access_count, which is the number of segments as a result of intersect between memory buffer datatype and the file datatype. In the reproducer, this results in 2097153 for each process. In ADIOI_Fill_send_buffer, each process memcpy the segments into a send buffer before sending to the aggregators, and this results in 2097153 memcpy, significantly hurting performance. @wklaio Does the above describe the issue? I am not familiar with ROMIO code so I could be way off – why don't we use MPI_Pack to prepare the send buffer? Also something I have been thinking, if we support a "partial datatype", e.g. MPIX_Type_create_partial(old_count, old_type, offset, size, &new_type), that may be useful. It let middle-ware users such as ROMIO to directly use MPI do pipeline-like operations without messing with flat_list or contig_segments.

…

-- Hui

wkliao · 2024-04-22T19:55:49Z

Your understanding of the issue is correct.

I am not familiar with ROMIO code so I could be way off – why don't we use MPI_Pack to prepare the send buffer?

I think it is because memory footprint. In my test program, the addition memory space is 32 MB. For bigger problem size, the footprint is bigger.

I do not follow the idea of "partial datatype". Will it help construct a datatype that is an intersection of 2 other datatypes (user buffer type and file view)?

roblatham00 · 2024-04-25T16:07:35Z

@wkliao Hui implemented a way to work on datatypes without flattening the whole thing first. we would still have to compute the intersection of memory type and file view but I think his hope is that the datatype data structures might be less memory intensive -- not as a solution to this issue but an idea for a ROMIO enhancement that came to mind while looking at this code.

wkliao · 2024-04-25T17:12:36Z

As the current implementation of collective I/O is done in multiple rounds of two-phase I/O,
if such partial datatype flattening could work, then I expect the memory footprint could
be reduced significantly, which will be great.

FYI, I added codes inside ROMIO to measure the memory footprint and ran pio_noncontig.c
using the commands provided in my earlier comments. The high watermark is about
300 MB (maximal among 16 processes) for such a small test case. I think it mainly comes
from the fileview datatype flattening.

As for this issue, my own solution is to check whether or not the part of user buffer
in each two-phase I/O round is contiguous. If it is, then use it to call MPI_Issend and
thus, skip most of the memcpy calls.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROMIO: excessive number of calls to memcpy() #6985

ROMIO: excessive number of calls to memcpy() #6985

wkliao commented Apr 18, 2024

hzhou commented Apr 19, 2024

wkliao commented Apr 19, 2024

hzhou commented Apr 19, 2024 via email

hzhou commented Apr 21, 2024 via email

wkliao commented Apr 22, 2024

roblatham00 commented Apr 25, 2024

wkliao commented Apr 25, 2024

ROMIO: excessive number of calls to memcpy() #6985

ROMIO: excessive number of calls to memcpy() #6985

Comments

wkliao commented Apr 18, 2024

hzhou commented Apr 19, 2024

wkliao commented Apr 19, 2024

hzhou commented Apr 19, 2024 via email

hzhou commented Apr 21, 2024 via email

wkliao commented Apr 22, 2024

roblatham00 commented Apr 25, 2024

wkliao commented Apr 25, 2024