-
Notifications
You must be signed in to change notification settings - Fork 859
WeeklyTelcon_20170117
Geoffrey Paulsen edited this page Jan 9, 2018
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen
- Ralph
- Howard
- Josh Hursey
- Josh Ladd
- Nathan Hjelm
- Sylvain Jeaugey
- Todd Kordenbrock
- David Bernholdt
- Geoffroy Vallee
Review All Open Blockers
Review Milestones v1.10.6
- 1.10.6 will be needed.
- Still 5 PRs that need review (Jeff and Giles)
- Estimated schedule: RC this week, check the issues, want by end of the month.
- From last week: Want to check that 2678 doesn't impact 1.10, but think it might. -O3 optimization.
- Ralph already merged in.
Review Milestones v2.0.2
- want to verify that 1654 was fixed by Nathan's Pull Request.
- Nathan - we were freeing deleted VMAs in memory path. Nathan put them on a list, and clean it up from next instant, which will never be called from a memory hook.
- Closing 2666 - Paul Hargrove found issues with his install (OSX), when he fixed, it went away.
- Jeff needs to review 2728 -
- Pull 2730 and related on 2.x
- Curious about -hostfile change. Is this a change in our behavior?
- Yes, if you saw -host on 2.0 today, and no -n it will launch only 1 process.
- With the change, it will auto-detect how many slots for that host.
- We decided last face to face not to make changes like this in a minor change.
- the problem that surfaces, then there is a fundamental difference between -host foo, and put foo in hostfile.
- Decision make host and hostfile behave the same.
- Make the user specify if you're in a non-resource managed environment.
- if you saw foo in a hostfile, we auto-detect slots
- Don't like the idea of changing behavior in minor update.
- We went with user having to specify in non-managed cases.
- Master does not do this today.
- All in notes from last face 2 face.
- hash this out at face2face next week.
- 2724 - porting for additional signals
- minor change in behavior - it's be surprising if anyone is relying on us NOT forwarding a signlal.
- in 1.10 the child processes were in same process group as orted. So people were relying on hitting an orted with a signal, all children getting it. But now we need to trap and forward.
- There are signals that people have come to rely on. Probably want.
- agreed to merge in.
- Put in Memheap refactoring.
- Otherwise, 2.0.2 will probably do one more RC (before -hostfile / -host). Howard
- Artem will start on it in next few hours or so.
- If we bring PR in, we can begin PMIx testing with 1.2.0, and then ship with 1.2.1. Alternatively we can do it in one giant PR.
- Does the community want it in 1 PR or 2?
- Depends on extra amount of work.
- The fix we're waiting on is a code-path Open MPI should never go down... An MPI application might go down that road, but Open MPI shouldn't hit it.
- PMIx folks in favor of earlier testing of PMIx 1.2.0, and then pickup PMIx 1.2.1 for code-correctness.
- Some reports of folks having issues on Titan. Does that come into play here?
- No, it's a known issue. But don't think we should react to this yet, since their use-case is very different from Open MPI.
- someone file a bug (blocker) to pickup PMIx 1.2.1 (josh)
- How should we test to stress?
- just launching in general for now.
- Does the community want it in 1 PR or 2?
Review Milestones v2.1.0
- PMIx is biggest thing.
- Ralph has a fix that we need before 2.1, otherwise we'll have problems like Trinity.
- mpirun is running on login node (different than compute node).
- PMIx sees this (different than mpirun node), and so ranks send their topology nodes back to mpirun. This consumes a lot of time!
- new mca parameter to request no-one but first rank to send topology strings.
- It won't solve eventual trinity problem (compute nodes of different types).
- What is the OMPI v2.1 schedule?
- Lets talk about this next week. Depends on how PMIx works
Review Master Pull Requests
Review Master MTT testing
- Mellanox cluster is dying somehow. Will look into.
- MTT an Open Shmem -
- Giles said you're just running all OSHMEM tests, and expecting them all to pass.
- Jeff looked at latest
- Jeff sent them a patch, to help us run OSHMEM tests. This should help us run OSHMEM, and if others could turn this on after we get this back that would help.
- Python stuff - Ralph sees a bunch of mtt failures he needs to look at.
- Should re-start a telcom on this... haven't since December.
- Will get MTT telcom going again.
- Got MTT emails restored.
- Face 2 Face
- did not expect 15 people, so location will probably change. Jeff will send out a note.
- Please add agenda items to face2face. Those are not in order, so just add to bottom.
- Anything with SPI?
- http://www.spi-inc.org/projects/open-mpi/
- All good to go. We're official now. Jeff and Ralph just needs to close the loop on this.
- Lobbying Github to change us to a non-profit - going back and forth.
- Ralph and Jeff need to finish on-boarding with SPI.
- Mellanox - doing a lot with UCX - Deploying to customer sites.
- had some performance analysis - dashboard monitoring proposal for face2face.
- Something on the dev-list in last 48 hours. KNEM + Yalla on 2.0.1
- Ideas: Mellanox HPCX Open MPI - is compiled with KNEM, but KNEM not activated on cluster.
- Shouldn't vader just run-without it if not there?
- MXM also barking because it can't find KNEM.
- Set BLT vader copy mechanicms to CMA or NONe if kernel 2.x
- KNEM might be a little faster than CMA.
- Sandia
- Before holiday, implemented some non-contig atomic
- MTT broke - hope to fix this weekend.
- Intel fixed some bugs recently (DBM mostly)
- most of time on PMIx.
- Cisco, ORNL, UTK, NVIDIA
- Mellanox, Sandia, Intel
- LANL, Houston, IBM, Fujitsu