Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort output for test_iProbe. #119

Draft
wants to merge 2 commits into
base: develop
Choose a base branch
from
Draft

Conversation

DJDavies2
Copy link
Contributor

Would this be a potential fix for #47? It seems that sometimes the output is not in the correct order for the ASSERT. I'm not sure whether this is a situation that the output is required to be in a particular order or not. If not, then sorting the output might be a solution.

@@ -1000,6 +1001,8 @@ CASE("test_iProbe") {
--count;
}

std::sort(data.begin(), data.begin() + nproc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we know the size is nproc, we may as well write

std::sort(data.begin(), data.end());

or better still

std::sort(begin(data), end(data));

Also, I'd expect the same fix to be also needed for test_probe, although I'm too surprised we hit this
first with test_iProbe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have updated as you say.

The failure is transient, so maybe it does affect test_probe as well, I just haven't really noticed. I have made the same change for test_probe.

@wdeconinck
Copy link
Member

wdeconinck commented May 3, 2024

Is this transient failure still happening?

The test should not fail without the sorting either though. I'm puzzled why sorting should be needed. Also the "test_probe" was also mentioned as intermittently failing, and this sorting is not applied there.

We can also try to make the test more verbose in case of failure, e.g. each MPI rank logging the content of "data"

@DJDavies2
Copy link
Contributor Author

More verbosity upon failure would be helpful. It isn't trivial though as steps would need to be taken to ensure the output from different processors was presented in a readable way, just printing stuff out results in garbled output.

@wdeconinck
Copy link
Member

I have added in develop branch more output on this test in case of failures...

@wdeconinck wdeconinck marked this pull request as draft May 6, 2024 14:38
@DJDavies2
Copy link
Contributor Author

I have added in develop branch more output on this test in case of failures...

I run the test with your updates with the extra output and this is what it produced as output:

76: Test command: /opt/cray/snplauncher/7.6.0/bin/mpiexec "-n" "4" "/home/d03/frwd/cylc-run/eckit-47/share/make_mo__cray_gnu/eckit/tests/mpi/eckit_test_mpi_parallel"
76: Working Directory: /home/d03/frwd/cylc-run/eckit-47/share/make_mo__cray_gnu/eckit/tests/mpi
76: Environment variables:
76: OMP_NUM_THREADS=1
76: Test timeout computed to be: 10000000
76: Running 23 tests:
76: Running 23 tests:
76: Running 23 tests:
76: Running 23 tests:
76: Running case 0: test_rank_size ...
76: Running case 0: test_rank_size ...
76: Running case 0: test_rank_size ...
76: Running case 0: test_rank_size ...
76: Completed case 0: test_rank_size
76: Running case 1: test_broadcast ...
76: Completed case 0: test_rank_size
76: Completed case 0: test_rank_size
76: Running case 1: test_broadcast ...
76: Completed case 0: test_rank_size
76: Running case 1: test_broadcast ...
76: Test value
76: Running case 1: test_broadcast ...
76: Test value
76: Test value
76: Test value
76: Test vector
76: Test raw data
76: Test vector<pair<int,int>>
76: Test vector<pair<long,long>>
76: Test vector
76: Test raw data
76: Test vector<pair<int,int>>
76: Test vector
76: Test raw data
76: Test vector<pair<int,int>>
76: Test vector
76: Test raw data
76: Test vector<pair<int,int>>
76: Test vector<pair<long,long>>
76: Test vector<pair<long,long>>
76: Test vector<pair<long long,long long>>
76: Test vector<pair<long long,long long>>
76: Completed case 1: test_broadcast
76: Running case 2: test_gather_scalar ...
76: Test vector<pair<long,long>>
76: Test vector<pair<long long,long long>>
76: Completed case 1: test_broadcast
76: Running case 2: test_gather_scalar ...
76: Test vector<pair<long long,long long>>
76: Completed case 1: test_broadcast
76: Running case 2: test_gather_scalar ...
76: Completed case 1: test_broadcast
76: Running case 2: test_gather_scalar ...
76: Completed case 2: test_gather_scalar
76: Completed case 2: test_gather_scalar
76: Running case 3: test_gather_nscalars ...
76: Completed case 2: test_gather_scalar
76: Running case 3: test_gather_nscalars ...
76: Completed case 2: test_gather_scalar
76: Running case 3: test_gather_nscalars ...
76: Running case 3: test_gather_nscalars ...
76: Completed case 3: test_gather_nscalars
76: Running case 4: test_gatherv_equal_stride ...
76: Completed case 3: test_gather_nscalars
76: Running case 4: test_gatherv_equal_stride ...
76: Completed case 3: test_gather_nscalars
76: Running case 4: test_gatherv_equal_stride ...
76: Completed case 4: test_gatherv_equal_stride
76: Running case 5: test_gatherv_unequal_stride ...
76: Completed case 4: test_gatherv_equal_stride
76: Running case 5: test_gatherv_unequal_stride ...
76: Completed case 5: test_gatherv_unequal_stride
76: Running case 6: test_scatter_scalar ...
76: Completed case 5: test_gatherv_unequal_stride
76: Running case 6: test_scatter_scalar ...
76: Completed case 4: test_gatherv_equal_stride
76: Running case 5: test_gatherv_unequal_stride ...
76: Completed case 5: test_gatherv_unequal_stride
76: Running case 6: test_scatter_scalar ...
76: Completed case 3: test_gather_nscalars
76: Running case 4: test_gatherv_equal_stride ...
76: Completed case 4: test_gatherv_equal_stride
76: Running case 5: test_gatherv_unequal_stride ...
76: Completed case 5: test_gatherv_unequal_stride
76: Running case 6: test_scatter_scalar ...
76: Completed case 6: test_scatter_scalar
76: Running case 7: test_scatter_nscalars ...
76: Completed case 6: test_scatter_scalar
76: Running case 7: test_scatter_nscalars ...
76: Completed case 6: test_scatter_scalar
76: Running case 7: test_scatter_nscalars ...
76: Completed case 6: test_scatter_scalar
76: Running case 7: test_scatter_nscalars ...
76: Completed case 7: test_scatter_nscalars
76: Running case 8: test_scatterv ...
76: Completed case 8: test_scatterv
76: Running case 9: sendReceiveReplace ...
76: Completed case 7: test_scatter_nscalars
76: Running case 8: test_scatterv ...
76: Completed case 8: test_scatterv
76: Running case 9: sendReceiveReplace ...
76: Completed case 7: test_scatter_nscalars
76: Running case 8: test_scatterv ...
76: Completed case 8: test_scatterv
76: Running case 9: sendReceiveReplace ...
76: Completed case 7: test_scatter_nscalars
76: Running case 8: test_scatterv ...
76: Completed case 8: test_scatterv
76: Running case 9: sendReceiveReplace ...
76: Testing swap
76: Testing swap
76: Testing open end shift
76: Testing swap
76: Testing open end shift
76: Testing swap
76: Testing open end shift
76: Testing circular shift
76: Completed case 9: sendReceiveReplace
76: Testing circular shift
76: Testing open end shift
76: Testing circular shift
76: Completed case 9: sendReceiveReplace
76: Running case 10: test_reduce ...
76: v : <Running case 10: test_reduce ...
76: v : <Completed case 9: sendReceiveReplace
76: Running case 10: test_reduce ...
76: v : <Testing circular shift
76: Completed case 9: sendReceiveReplace
76: Running case 10: test_reduce ...
76: v : <-1,0>-4,3>
76:
76: -2,1>
76: -3,2>
76: Testing reduce
76: Testing reduce
76: Testing reduce
76: Testing reduce
76: arr : [arr : [5arr : [51]
76: 52]
76: Testing reduce inplace
76: arr : [5
3]
76: Testing reduce inplace
76: 4]
76: Testing reduce inplace
76: Testing reduce inplace
76: Completed case 10: test_reduce
76: Running case 11: test_allReduce ...
76: Completed case 10: test_reduce
76: Running case 11: test_allReduce ...
76: Completed case 10: test_reduce
76: Running case 11: test_allReduce ...
76: Completed case 10: test_reduce
76: Running case 11: test_allReduce ...
76: v : <v : <-1,0>
76: v : <-3,-4,3>
76: 2>
76: v : <-2,1>Testing all_reduce
76:
76: Testing all_reduce
76: Testing all_reduce
76: Testing all_reduce
76: arr : [51]
76: Testing all_reduce inplace
76: Completed case 11: test_allReduce
76: Running case 12: test_allGather ...
76: arr : [5
2]
76: Testing all_reduce inplace
76: Completed case 11: test_allReduce
76: Running case 12: test_allGather ...
76: arr : [53]
76: Testing all_reduce inplace
76: Completed case 11: test_allReduce
76: Running case 12: test_allGather ...
76: Completed case 12: test_allGather
76: Running case 13: test_allGatherv ...
76: arr : [5
4]
76: Testing all_reduce inplace
76: Completed case 11: test_allReduce
76: Running case 12: test_allGather ...
76: Completed case 12: test_allGather
76: Running case 13: test_allGatherv ...
76: Completed case 12: test_allGather
76: Running case 13: test_allGatherv ...
76: Completed case 12: test_allGather
76: Running case 13: test_allGatherv ...
76: Completed case 13: test_allGatherv
76: Running case 14: test_allToAll ...
76: Completed case 13: test_allGatherv
76: Running case 14: test_allToAll ...
76: Completed case 13: test_allGatherv
76: Running case 14: test_allToAll ...
76: Completed case 13: test_allGatherv
76: Running case 14: test_allToAll ...
76: Completed case 14: test_allToAll
76: Running case 15: test_nonblocking_send_receive ...
76: Completed case 14: test_allToAll
76: Running case 15: test_nonblocking_send_receive ...
76: Completed case 14: test_allToAll
76: Running case 15: test_nonblocking_send_receive ...
76: Completed case 15: test_nonblocking_send_receive
76: Running case 16: test_blocking_send_nonblocking_receive ...
76: Completed case 14: test_allToAll
76: Running case 15: test_nonblocking_send_receive ...
76: Completed case 15: test_nonblocking_send_receive
76: Running case 16: test_blocking_send_nonblocking_receive ...
76: Completed case 16: test_blocking_send_nonblocking_receive
76: Running case 17: test_blocking_send_receive ...
76: Completed case 17: test_blocking_send_receive
76: Running case 18: test_broadcastFile ...
76: Completed case 16: test_blocking_send_nonblocking_receive
76: Running case 17: test_blocking_send_receive ...
76: Completed case 17: test_blocking_send_receive
76: Running case 18: test_broadcastFile ...
76: Completed case 15: test_nonblocking_send_receive
76: Running case 16: test_blocking_send_nonblocking_receive ...
76: Completed case 15: test_nonblocking_send_receive
76: Running case 16: test_blocking_send_nonblocking_receive ...
76: Completed case 16: test_blocking_send_nonblocking_receive
76: Running case 17: test_blocking_send_receive ...
76: Completed case 16: test_blocking_send_nonblocking_receive
76: Running case 17: test_blocking_send_receive ...
76: Completed case 17: test_blocking_send_receive
76: Running case 18: test_broadcastFile ...
76: Completed case 17: test_blocking_send_receive
76: Running case 18: test_broadcastFile ...
76: Completed case 18: test_broadcastFile
76: Running case 19: test_waitAll ...
76: Completed case 18: test_broadcastFile
76: Running case 19: test_waitAll ...
76: Completed case 18: test_broadcastFile
76: Running case 19: test_waitAll ...
76: Completed case 18: test_broadcastFile
76: Running case 19: test_waitAll ...
76: Completed case 19: test_waitAll
76: Running case 20: test_waitAny ...
76: Completed case 19: test_waitAll
76: Running case 20: test_waitAny ...
76: Completed case 19: test_waitAll
76: Running case 20: test_waitAny ...
76: Completed case 19: test_waitAll
76: Running case 20: test_waitAny ...
76: Completed case 20: test_waitAny
76: Running case 21: test_probe ...
76: Completed case 20: test_waitAny
76: Running case 21: test_probe ...
76: Completed case 20: test_waitAny
76: Running case 21: test_probe ...
76: Completed case 20: test_waitAny
76: Running case 21: test_probe ...
76: Completed case 21: test_probe
76: Running case 22: test_iProbe ...
76: Completed case 21: test_probe
76: Running case 22: test_iProbe ...
76: Completed case 21: test_probe
76: Running case 22: test_iProbe ...
76: �[31mTest "test_probe" failed: Condition failed: i == data[i] @ (/home/d03/frwd/cylc-run/eckit-47/share/mo-bundle/eckit/tests/mpi/eckit_test_mpi.cc +962 test_930)�[0m
76: Completed case 21: test_probe
76: Running case 22: test_iProbe ...
76: Completed case 22: test_iProbe
76: 0 tests failed out of 23.
76: Completed case 22: test_iProbe
76: 0 tests failed out of 23.
76: Completed case 22: test_iProbe
76: 0 tests failed out of 23.
76: �[31mTest "test_iProbe" failed: Condition failed: i == data[i] @ (/home/d03/frwd/cylc-run/eckit-47/share/mo-bundle/eckit/tests/mpi/eckit_test_mpi.cc +1009 test_972)�[0m
76: Completed case 22: test_iProbe
76: FAILED: test_probe
76: FAILED: test_iProbe
76: 2 tests failed out of 23.

@wdeconinck
Copy link
Member

Are you sure you ran "develop" branch? because the mentioned line 1009 is e.g. not one of the lines that would write output:
https://github.com/ecmwf/eckit/blob/develop/tests/mpi/eckit_test_mpi.cc#L1009C1-L1010C1

@DJDavies2
Copy link
Contributor Author

I have run tests with the develop branch. Unfortunately I don't seem to be able to recreate the failure now. Do you think you might have fixed the issue?

@wdeconinck
Copy link
Member

I did not fundamentally change anything except adding some more diagnostic output. If it occurs again, sporadically, please update the issue with the diagnostic output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants