-
Notifications
You must be signed in to change notification settings - Fork 911
Description
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
OpenMPI v5.0.5
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Compiled from source distribution using GCC devtoolset-9
Please describe the system on which you are running
- Operating system/version: RHEL 8
- Computer hardware: AWS EC2 c5.9xlarge, c6i.8xlarge
- Network type: standard AWS VPC with SSH/TCP
Details of the problem
We've used OpenMPI to farm work across AWS EC2 instances with good success for a few years and are currently working towards running OpenMPI workloads across clusters of machines of different instance types. We're currently testing a mix of c5.9xlarge and c6i.8xlarge instances. c5.9xlarges are slightly larger (18 cores, 72gb RAM vs 16 cores, 64gb RAM) and slightly older. We run with one process per node and let our software use native threading. With this mix of hardware we've seen OpenMPI intermittently but frequently hang prior to launching our exectuables (maybe 50% of the time when using mixed hardware). We have all of our hardware running using the same Amazon Machine Image so it's not likely that there are any software/path differences causing issues, and if I edit our hostfile to include only c5.9xlarges or only c6i.8xlarges the software reliably works as expected. Unfortunately our test environment is airgapped so getting full logs is a challenge, but I will include what I can. If you guys want to see more let me know and I'll see what I can transcribe.
I've been able to reproduce this issue with the following command:
mpirun --mca plm_ssh_args "-v" --debug-daemons --mca plm_base_verbose 100 --merge-stderr-to-stdout -verbose -N 1 --bind-to none --display-map --hostfile /path/to/hostfile.txt ls
The command hangs prior to running ls
and prior to displaying the topo map. This hostfile for the current test consists of 6 c6i.8xlarges (including the host that's calling mpirun
) and 4 c5.9xlarges. All the hosts are accessible over SSH, and I can log in to each host to see the prted and ssh processes running. With logging turned up the final log statement is as follows (rarely only 7 or 8 report, and of course it does sometimes process all the way to completion):
[...] [...] plm:base:orted_report_launch job prterun-ip-[...] recvd 9 of 10 reported daemons
What immediately jumps out at me from these logs is the following error:
[one of the remote hosts] [local ptrerun] prted:comm:process_commands() Unknown command!
On one hand this seems bad, but on the other hand I still actually see this same error message when the process does not hang, so it might be a red herring.
One other data point I'll leave here as well, if I arrange the hostfile such that rank 0 is the larger instance type, the frequency of the hang drops down to maybe ~1% (from ~50%), it seems whatever race condition or whatnot is still there but less likely fail. I was very hopeful that was going to resolve the issue but I saw our tests hang twice over the course of about a week...
I realize this is not a lot to go on, but do you guys have any insights on what the issue might be? Can you guys provide any guidance on additional logging I could enable, extra tests to run, or extra diagnostics I could compile in if I rebuilt MPI? I tracked down where prted:comm:process_commands() Unknown command!
is getting logged but unfortunately I'm not super familiar with OpenMPI/PRTE source so I'm not sure where to look for what's sending that command. Our next steps if we can't figure out a cause here is likely going to be trying the same tests using an OpenMPI 4 build, which we'd been using until recently.
Thanks for your help, I appreciate any clarity you guys can bring