-
Notifications
You must be signed in to change notification settings - Fork 859
WeeklyTelcon_20190716
Geoffrey Paulsen edited this page Jul 16, 2019
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Akshay Venkatesh (nVidia)
- Artem Polyakov (Mellanox)
- Brendan Cunningham (Intel)
- Brian Barrett (Amazon)
- Edgar Gabriel (UH)
- Geoff Paulsen (IBM)
- Harumi Kuno (HPE)
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Josh Hursey (IBM)
- Michael Heinz (Intel)
- Ralph Castain (Intel)
- Todd Kordenbrock
- Aravind Gopalakrishnan (Intel)
- Arm (UTK)
- Brandon Yates (Intel)
- Dan Topa (LANL)
- David Bernhold
- Geoffroy Vallee
- George Bosilca (UTK)
- Jake Hemstad
- Joshua Ladd (Mellanox)
- Mark Allen (IBM)
- Matias Cabral
- Matthew Dosanjh (Sandia)
- Nathan Hjelm
- Noah Evans (Sandia)
- Peter Gottesman (Cisco)
- Thomas Naughton
- Xin Zhao (Mellanox)
- mohan
-
Git submodules
- This PR is in progress. Requires CI owners to add
--recursive
to their Jenkin's git clone commands. - As a first step, Jeff created:
- PR 6821 "hwloc201 use a submodule"
- This PR is in progress. Requires CI owners to add
-
What to do with OFI BTL and OFI MTL
- Harumi Kuno (HPE) - Discussion about OMPI's component philosophy
- mail archive: https://www.mail-archive.com/[email protected]/msg20736.html
- ofi/BTL and MTL components can step on each other.
- PSM2 - when a user of PSM2 calls PSM2_Finalize, as long as there's a PSM2 provider, PSM2 is refcounting is only observed in initializing not in finallizing, meaning first finalize, was finalizing entire job.
-
Status of Scale testing
- No update
- Issue 6786 "OMPI 4.0.1 TCP connection errors beyond 86 nodes"
- Issue 6198 "SSH launch fails when host file has more than 64 hosts"
- IBM is also working on something like this as well (for ssh launch)
- Prefer this every night, instead of each PR.
-
Issue 6799 "UFM buffers failing in culpGetMemHandle ?"
- No update
- Complete
- No update
- Suggest just doing hwloc (stable and not too much development) first
- No update
Blockers All Open Blockers
Review v3.0.x Milestones v3.0.4
Review v3.1.x Milestones v3.1.4
- Tested new PMIx
- Exposed a few new test suite issues in "ibm", but fixed
Review v4.0.x Milestones v4.0.2
- PR6806 - Want to wait until CI is back. Do we have any tests to test this?
- Howard will reproduce and add to ibm suite
- 2nd Put issue PR 6568 (Vader deadlocking with 4MB transfers)
- waiting on George to return (end of the month)
- New Datatype work https://github.com/open-mpi/ompi/pull/6695 (master)
- Want for v4.0.2
- Now approved for master.
- waiting on George to return (end of the month). We could merge to master, but if any issues, we'd need George to fix.
-
https://github.com/open-mpi/ompi/issues/6568 - put protocol has lost it's pipelining.
- Right now only shows in vader, because all others prefer get protocol.
- Vader generate a bunch of 32K frags. so for 4MBs overwhelms vader.
- Does NOT occur with single copy like CMA or KNEM.
- Issue 6789 - OMPI crashes when configured with ucx version
- Issue with PML UCX conflicting with btl_uct - memory hooks
- New this week: Howard not convinced it's memory hooks.
Review Master Master Pull Requests
- PR6556 and 6621 should go to the release branches.
- no update
- Good reminder that we now need to be careful about OPAL's ABI.
- When do we get rid of 32bit?
- Still don't have any release manager.
- Need to identify someone in next few months.
- PMIx v3.1.3 is ready to release.
- Two issues around MPIR attach
- 5501 - IBM need to investigate.
- 5115 - Community OpenMPI Possibly still PMIx
- howard will try to reproduce
- Still Open MPIR attach issue in v3.1.x
- Neither of these issues should block v4.0.2
- MPIR We emit a warning saying we've deprecated MPIR
- Need a wiki page describing how to get MPIR to work.
- What is the answer?
- DDT is about 90% ready.
- Two issues around MPIR attach
- PMIx v2.2 update could be ready soon after that.
- Take a look at Gile's PRRTE work. He may have done SOME of that. He should have done that all in PRRTE layer, maybe just some MPI layer work remains.
- PR6339 - he's closed, and re-opened a new branch to look at.
- Howard reviewed PR6339, and likes everything that Giles did, so abandoned his branch
- This is a good approach, and gets something running, but it's not complete
- Need people to react and do things.
- Fall Face to face is canceled due to lack of agenda
- PRTE transition still requires dedicated discussion
- Might meet in New Mexico, University of Tennessee, or Dallas (IBM)
- Should make a meeting prep page
- Jeff will make doodle.
- Two days
- IBM has to triage some failures on master and v4.0.x