-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROMIO: excessive number of calls to memcpy() #6985
Comments
I haven't looked at the code, but a scale of 4 (from 8.27 to 33.07) from contig to noncontig seems normal to me especially if the data consists of many small segments. |
The focus point of this issue is the numbers of memcpy calls, as indicated in The test runs I provided was just to prove the point. It is small and reproducible |
Thank you for the details!
Hui
|
Writing down my notes after looking at the code –
The buffer in memory is a "dense" non-contig datatype – in the reproducer it's two segments -- but the filetype is fairly fragmented. In the aggregate code, we calculate a contig_access_count, which is the number of segments as a result of intersect between memory buffer datatype and the file datatype. In the reproducer, this results in 2097153 for each process. In ADIOI_Fill_send_buffer, each process memcpy the segments into a send buffer before sending to the aggregators, and this results in 2097153 memcpy, significantly hurting performance.
@wklaio Does the above describe the issue?
I am not familiar with ROMIO code so I could be way off – why don't we use MPI_Pack to prepare the send buffer?
Also something I have been thinking, if we support a "partial datatype", e.g. MPIX_Type_create_partial(old_count, old_type, offset, size, &new_type), that may be useful. It let middle-ware users such as ROMIO to directly use MPI do pipeline-like operations without messing with flat_list or contig_segments.
…--
Hui
|
Your understanding of the issue is correct.
I think it is because memory footprint. In my test program, the addition memory space is 32 MB. For bigger problem size, the footprint is bigger. I do not follow the idea of "partial datatype". Will it help construct a datatype that is an intersection of 2 other datatypes (user buffer type and file view)? |
@wkliao Hui implemented a way to work on datatypes without flattening the whole thing first. we would still have to compute the intersection of memory type and file view but I think his hope is that the datatype data structures might be less memory intensive -- not as a solution to this issue but an idea for a ROMIO enhancement that came to mind while looking at this code. |
As the current implementation of collective I/O is done in multiple rounds of two-phase I/O, FYI, I added codes inside ROMIO to measure the memory footprint and ran pio_noncontig.c As for this issue, my own solution is to check whether or not the part of user buffer |
A PnetCDF user reported a poor performance of collective writes when using a
non-contiguous write buffer. The root of problem is due to a large number of
calls to memcpy() in ADIOI_BUF_COPY in mpich/src/mpi/romio/adio/common/ad_write_coll.c
A performance reproducer is available in
https://github.com/wkliao/mpi-io-examples/blob/master/tests/pio_noncontig.c
This program makes a single call to MPI_File_write_at_all. The user buffer can
be either contiguous (command-line option
-g 0
) or noncontiguous (default).The noncontiguous case adds a gap of 16 bytes into the buffer. The file
view consists of multiple subarray data types, appended one after another.
Further description about the I/O pattern can be found at the beginning of the
program file.
Running this program on a Linux machine using UFS ADIO driver on 16 MPI
processes reported run times of 33.07 and 8.27 seconds. The former is when the
user buffer is noncontiguous and the latter contiguous. The user buffer on each
process is of size 32 MB. The noncontiguous case adds a gap of size 16 bytes
into the buffer. The run command used:
The following patch if applied to MPICH prints the number of calls to memcpy().
https://github.com/wkliao/mpi-io-examples/blob/master/tests/0001-print-number-of-calls-to-memcpy.patch
The numbers of memcpy calls are 2097153 and 0 from the above two runs,
respectively.
The text was updated successfully, but these errors were encountered: