Skip to content

part-persist: implement message aggregation#13039

Open
AxelSchneewind wants to merge 14 commits intoopen-mpi:mainfrom
AxelSchneewind:part-persist-aggregated
Open

part-persist: implement message aggregation#13039
AxelSchneewind wants to merge 14 commits intoopen-mpi:mainfrom
AxelSchneewind:part-persist-aggregated

Conversation

@AxelSchneewind
Copy link
Contributor

@AxelSchneewind AxelSchneewind commented Jan 15, 2025

Created a component for partitioned communication that supports message aggregation, based on the existing part-persist component. The goal is to prevent drops in effective bandwidth when using too fine-grained partitionings.

This component allows enforcing hard limits on size and count of transferred partitions, regardless of the partitioning required by the application. These limits can be specified using mca-parameters (min_message_size, max_message_count). Their default values might require revision.

If a user-provided partitioning violates the constraints, a more coarse-grained partitioning is selected, where multiple user-partitions are mapped to an internal (transfer) partition. Transfer of an internal partition is started as soon as Pready has been called on all corresponding user-partitions. Each transfer partition is associated with an atomic counter, tracking the number of corresponding user-partitions that have been marked ready.

The implementation could also be extended to use more advanced aggregation algorithms.

This implementation is a result of "Benchmarking the State of MPI Partitioned Communication in Open MPI" presented at EuroMPI 2024 (https://events.vsc.ac.at/event/123/page/341-program).

@AxelSchneewind AxelSchneewind marked this pull request as ready for review January 15, 2025 15:00
@AxelSchneewind AxelSchneewind force-pushed the part-persist-aggregated branch from 224e839 to 84882d1 Compare January 17, 2025 13:54
@janjust janjust requested a review from mdosanjh January 28, 2025 16:14
@janjust
Copy link
Contributor

janjust commented Jan 28, 2025

@mdosanjh Any chance we could get an initial review from you on this PR?

@mdosanjh
Copy link
Contributor

mdosanjh commented Jan 28, 2025 via email

@hppritcha
Copy link
Member

i think someone needs to look at the mpi4py failure. looks specific to changes in this PR.

@cniethammer
Copy link
Contributor

Just an assumption from my side having a look at the error message without diving into the mpi4py code: It could be related to the extension of fields in the Request handle, which is mapped to a python object in mpi4py.

@hppritcha
Copy link
Member

I doubt that. Probably failing only with mpi4py because it has the most extensive test suite in our CI infrastructure. This pr should not be merged till mpi4py passes.

@hppritcha
Copy link
Member

If its any help here's where the segfault is occurring:

514	mca_part_persist_aggregated_start(size_t count, ompi_request_t** requests)
515	{
516	    int err = OMPI_SUCCESS;
517	    size_t _count = count;
518	
519	    for(size_t i = 0; i < _count && OMPI_SUCCESS == err; i++) {
520	        mca_part_persist_aggregated_request_t *req = (mca_part_persist_aggregated_request_t *)(requests[i]);
521	
522	        // reset aggregation state here
523	        if (MCA_PART_PERSIST_AGGREGATED_REQUEST_PSEND == req->req_type) {
524	            mca_part_persist_aggregated_psend_request_t *sendreq = (mca_part_persist_aggregated_psend_request_t *)(req);
525	            part_persist_aggregate_regular_reset(&sendreq->aggregation_state);
526	        }
527	
528	        /* First use is a special case, to support lazy initialization */
529	        if(false == req->first_send)
530	        {
531	            if(MCA_PART_PERSIST_AGGREGATED_REQUEST_PSEND == req->req_type) {
532	                req->done_count = 0;
533	                memset((void*)req->flags,0,sizeof(int32_t)*req->real_parts);
534	            } else {
535	                req->done_count = 0;
536	                err = req->persist_reqs[0]->req_start(req->real_parts, req->persist_reqs);
537	                memset((void*)req->flags,0,sizeof(int32_t)*req->real_parts);
538	            }
539	        } else {
540	            if(MCA_PART_PERSIST_AGGREGATED_REQUEST_PSEND == req->req_type) {
541	                req->done_count = 0;
542	                for(size_t j = 0; j < req->real_parts && OMPI_SUCCESS == err; j++) {
543	                    req->flags[j] = -1;
544	                }
545	            } else {
546	                req->done_count = 0;
547	            }
548	        }
549	        req->req_ompi.req_state = OMPI_REQUEST_ACTIVE;
550	        req->req_ompi.req_status.MPI_TAG = MPI_ANY_TAG;

at line 533. The problem is req->flags is NULL.

@AxelSchneewind
Copy link
Contributor Author

AxelSchneewind commented Feb 6, 2025

Thank you very much, I will look into it. That error never occurred with my tests

@AxelSchneewind AxelSchneewind force-pushed the part-persist-aggregated branch 2 times, most recently from 83a4131 to 55007af Compare December 17, 2025 12:17
@AxelSchneewind AxelSchneewind marked this pull request as draft December 17, 2025 12:17
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
This aggregation scheme is intended to allow OpenMPI to transfer larger messages
if the user-reported partitions are too small or too many.
This is achieved by using an internal partitioning where each internal (transfer)
partition corresponds to one or multiple user-reported partitions.

The implementation provides an interface for insertion of user partitions,
that optionally outputs a transfer partition that is ready.

This is achieved by associating each transfer partition with an atomic counter,
tracking the number of corresponding pready-calls. As soon as a counter reaches
the number of corresponding user-partitions, the transfer partition is returned in the respective insertion call.

This implementation is thread-safe.

Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
@AxelSchneewind AxelSchneewind force-pushed the part-persist-aggregated branch from 55007af to da974db Compare December 17, 2025 12:39
@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

da974db: use MCA_BASE_COMPONENT_INIT

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
@AxelSchneewind AxelSchneewind force-pushed the part-persist-aggregated branch from da974db to 5816374 Compare December 17, 2025 12:40
@github-actions
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

7e95464: add aggregation scheme header to local_sources

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Signed-off-by: Axel Schneewind <axel.schneewind@hlrs.de>
@AxelSchneewind AxelSchneewind force-pushed the part-persist-aggregated branch from 7e95464 to 67aecd6 Compare December 19, 2025 15:19
@AxelSchneewind
Copy link
Contributor Author

The bug that was found by the mpi4py-tests pipeline should now be resolved.
The error was caused by a mismatch between function pointers for the implementations of MPI_Start and MPI_P[send|recv]_Init (for more detail, see below).

I solved it by using separate request lists for each component.

This might also be relevant for PR #12796 that also introduces a new part-component.
I understand the issue as follows:

  • The function pointer for MPI_Psend_Init is stored in the mca_part-object and set by mca_part_base_select.
  • The function pointer for MPI_Start is stored in the request objects (req_start-field), which are preallocated when opening a component. As I understand it, all components (i.e. both part_persist and part_persist_aggregated) are opened during MPI_Init, and initialize some request objects each.

The segfault can occur when a request object is used where req_start points to the implementation from a component that has not been selected

@AxelSchneewind AxelSchneewind marked this pull request as ready for review January 5, 2026 13:08
@AxelSchneewind
Copy link
Contributor Author

Could anyone take a look at this PR?

Also, the pipeline should be rerun as the timeout seems unrelated to the changes in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants