-
Notifications
You must be signed in to change notification settings - Fork 859
WeeklyTelcon_20170307
Geoffrey Paulsen edited this page Jan 9, 2018
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen
- Jeff Squyres
- Artem Polyakov
- Brian Barrett
- David Bernholdt
- Edgar Gabriel
- Geoffroy Vallee
- Howard
- Josh Hursey
- Ralph
- Joshua Ladd
- Nathan Hjelm
- Thomas Naughton
- Todd Kordenbrock
- Ryan Grant
Review All Open Blockers
Review Milestones v1.10.6
- No plans for a v1.10.7.
Review Milestones v2.1.0
-
rcache 3013 - rcache broken in all verison.
- big-hammer workaround for v2.0 and v2.1 Don't hook madvise.
- For 3.0 need to
- Registration cache is a bottleneck, and will always hit issues like this if we hook madvise.
- PR 3013 - might not be necessary if create new PR that does NOT hook madvise (remove a function pointer).
- PR 3013 is an optimization, not bad, but really need to remove madvise. Nathan will Merge PR3013 now.
- PR 3013 is not NEEDED for v2.1, so just going into master / v3.0.
- Ask Nathan to create v2.x, and v2.0.x and master PRs that remove hook of madvise.
- Does this necessitate a v2.0.3 release? Have to be doing malloc and free in threads.
- Failure-mode threads deadlock, not silent memory corruption. Does not affect 1.10 (using pmalloc hooks).
- 3 issues on BSD and various flavors.
- 3 are PMIx related. Include file missing, Josh already put up PR for.
- oob 3115 dlopen failing to find files, but happening 20%. Open BSD on i386. getcwd() is missing.
- NetBSD on AMD64 - not sure how common this is.
- Artem and Paul are looking into Issue 3117 - Waiting to hear if it's easy to fix.
- Taking one from Edgar: https://github.com/open-mpi/ompi/pull/3105 since we're doing another RC (for rcache fix).
- Only Blockers for v2.1: Issue 3117, and unhooking madvise.
- PMIx - reason we're doing an accelerated v3.0
- Whitelist Issue 3107
- UCX has it's own Multithreading API that needs to be enabled. UCX is thread safe. Inside UCX PML
- allocator will be inside of OSHMEM.
- Sounds reasonable (component level stuff).
- DELAYED TO v3.1 - Info Keys - IBM Do an Audit with what was posted last week from MPI Forum and rebase.
- PR 2941.
- Open-MPI currently doesn't implement.
- Concerned about Don't want to implement something if it's NOT going to be solid.
- Nathan would like to have it into v3.0, but not necessary.
- Sounds like everyone is okay with delaying this to v3.1, but want to get it into Master soon.
- Updated internal hwloc DELAYED to v3.1. Still support latest hwloc via external.
- Big Elephant in room is PMIx v2.0, it's not released yet, but it's being Whitelitsted, but we need to branch v3.0 soon to make June 15 release.
- 43 Issues against v3.0 out there. Feature / Enhancements based ones just be punted to v3.1.
- Schedule for branching v3.0 after next week's meeting.
- Planning on doing a v2.1.2 release in next week or so.
- Don't want to hold up v2.1, would go into v2.1.1
- PMIx v2.0 - looking good for early april release
Review Master Pull Requests
- Don't break the build!!!
Review Master MTT testing
- OSHMEM testing in MTT via CISCO is greatly improved. (Lookat what CISCO did if you're interested in OSHMEM)
- Gone from many thousands of false failures to a few hundred.
- We should begin thinking about scheduling our next face to face.
- Cisco, ORNL, UTK, NVIDIA
- Mellanox, Sandia, Intel
- LANL, Houston, IBM, Fujitsu