part-persist: implement message aggregation#13039
part-persist: implement message aggregation#13039AxelSchneewind wants to merge 14 commits intoopen-mpi:mainfrom
Conversation
224e839 to
84882d1
Compare
|
@mdosanjh Any chance we could get an initial review from you on this PR? |
|
I’ll look into it today.
…On Tue, Jan 28, 2025 at 09:15 Tommy Janjusic ***@***.***> wrote:
@mdosanjh <https://github.com/mdosanjh> Any chance we could get an
initial review from you on this PR?
—
Reply to this email directly, view it on GitHub
<#13039 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD4RGBD4MSPZA3QM3Y4HJZ32M6UJNAVCNFSM6AAAAABVHNG5I6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJZGQ2TIMZRGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
i think someone needs to look at the mpi4py failure. looks specific to changes in this PR. |
|
Just an assumption from my side having a look at the error message without diving into the mpi4py code: It could be related to the extension of fields in the Request handle, which is mapped to a python object in mpi4py. |
|
I doubt that. Probably failing only with mpi4py because it has the most extensive test suite in our CI infrastructure. This pr should not be merged till mpi4py passes. |
|
If its any help here's where the segfault is occurring: at line 533. The problem is req->flags is NULL. |
|
Thank you very much, I will look into it. That error never occurred with my tests |
83a4131 to
55007af
Compare
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
This aggregation scheme is intended to allow OpenMPI to transfer larger messages if the user-reported partitions are too small or too many. This is achieved by using an internal partitioning where each internal (transfer) partition corresponds to one or multiple user-reported partitions. The implementation provides an interface for insertion of user partitions, that optionally outputs a transfer partition that is ready. This is achieved by associating each transfer partition with an atomic counter, tracking the number of corresponding pready-calls. As soon as a counter reaches the number of corresponding user-partitions, the transfer partition is returned in the respective insertion call. This implementation is thread-safe. Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
55007af to
da974db
Compare
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: da974db: use MCA_BASE_COMPONENT_INIT
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
da974db to
5816374
Compare
|
Hello! The Git Commit Checker CI bot found a few problems with this PR: 7e95464: add aggregation scheme header to local_sources
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
7e95464 to
67aecd6
Compare
|
The bug that was found by the mpi4py-tests pipeline should now be resolved. I solved it by using separate request lists for each component. This might also be relevant for PR #12796 that also introduces a new part-component.
The segfault can occur when a request object is used where |
|
Could anyone take a look at this PR? Also, the pipeline should be rerun as the timeout seems unrelated to the changes in this PR. |
Created a component for partitioned communication that supports message aggregation, based on the existing part-persist component. The goal is to prevent drops in effective bandwidth when using too fine-grained partitionings.
This component allows enforcing hard limits on size and count of transferred partitions, regardless of the partitioning required by the application. These limits can be specified using mca-parameters (min_message_size, max_message_count). Their default values might require revision.
If a user-provided partitioning violates the constraints, a more coarse-grained partitioning is selected, where multiple user-partitions are mapped to an internal (transfer) partition. Transfer of an internal partition is started as soon as Pready has been called on all corresponding user-partitions. Each transfer partition is associated with an atomic counter, tracking the number of corresponding user-partitions that have been marked ready.
The implementation could also be extended to use more advanced aggregation algorithms.
This implementation is a result of "Benchmarking the State of MPI Partitioned Communication in Open MPI" presented at EuroMPI 2024 (https://events.vsc.ac.at/event/123/page/341-program).