Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI communicator split failures #125

Open
DJDavies2 opened this issue Jun 6, 2024 · 1 comment
Open

MPI communicator split failures #125

DJDavies2 opened this issue Jun 6, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@DJDavies2
Copy link
Contributor

DJDavies2 commented Jun 6, 2024

What happened?

I am getting failures of this type:

Completed case 0: Test MPI Communicator Split
0 tests failed out of 1.
Completed case 0: Test MPI Communicator Split
0 tests failed out of 1.
Completed case 0: Test MPI Communicator Split
0 tests failed out of 1.
Completed case 0: Test MPI Communicator Split
0 tests failed out of 1.

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 16084 RUNNING AT expspicesrv053
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Tests that produce this error are e.g. eckit_test_mpi_splitcomm, eckit_test_mpi_group or eckit_test_mpi_internal_access.

What are the steps to reproduce the bug?

Build and run ctests. It seems that the problems occur with mpich but not with openmpi.

Version

develop

Platform (OS and architecture)

Linux

Relevant log output

No response

Accompanying data

No response

Organisation

Met Office

@DJDavies2 DJDavies2 added the bug Something isn't working label Jun 6, 2024
@wdeconinck
Copy link
Member

wdeconinck commented Jun 17, 2024

Probably also related to ecmwf/fckit#41
In that issue there's mention of explicit warnings like:

[WARNING] yaksa: 2 leaked handle pool objects

This yaksa is apparently a memory pool used in MPICH.
My hunch is that the eckit approach of calling MPI_Finalize during the destruction of static objects (after main) does not play nice with MPICH. @tlmquintino do you have any suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants