Skip to content

WeeklyTelcon_20200204

Geoffrey Paulsen edited this page Feb 5, 2020 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoffrey Paulsen (IBM)
  • Todd Kordenbrock (Sandia)
  • Jeff Squyres (Cisco)
  • Artem Polyakov (Mellanox)
  • Austen Lauria (IBM)
  • Brendan Cunningham (Intel)
  • Joseph Schuchart
  • Josh Hursey (IBM)
  • Ralph Castain (Intel)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)

not there today (I keep this for easy cut-n-paste for future notes)

  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Joshua Ladd (Mellanox)
  • Thomas Naughton (ORNL)
  • Brian Barrett (AWS)
  • Michael Heinz (Intel)
  • William Zhang (AWS)
  • Edgar Gabriel (UH)
  • Charles Shereda (LLNL)
  • David Bernhold (ORNL)
  • George Bosilca (UTK)
  • Matthew Dosanjh (Sandia)
  • Brandon Yates (Intel)
  • Erik Zeiske
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Xin Zhao (Mellanox)
  • mohan (AWS)
  • Akshay Venkatesh (NVIDIA)

New Business

  • Coverity coverage for PRRTE

    • Brian is working, but it needs a current copy of PMIX as well.
  • Anything to do to make Cray CI more stable?

    • Some discussion last week.
    • Brian tried to update cray to do shallow clone, but it ran into an issue with the submodule reference and the shallow clone, so abandoned this effort.
  • Josh and Ralph have been working on PRTE in general, stabilizing, etc.

    • finding the remaining issues.
    • Ralph is working on existing direct-modex issue.
    • Not blocking on OMPI info integration.
    • Only thing holding us back on committing.
  • Josh has been adding some PRRTE CI PR49 in PMIX-TEST want to get in this week.

    • Adding some PRRTE tests (non-mpi)
    • Ralph mentioned adding a double get test.
  • In the future we can have some PRRTE tests that run in Open-MPI

    • Perhaps only run these when the PRRTE or PMIX submodule reference updates?
    • Those could be MPI based tests, and could do more.
  • PRRTE/PMIX additional testing in Open-MPI project

    • It'd be convenent for a bot to label a PR if someone updates a PMIX/PRRTE code Then this could trigger some additional testing.

Release Branches

Review v3.0.x Milestones v3.0.6

Review v3.1.x Milestones v3.1.6

  • Jeff has another PR to put in and do another rc for both
  • Jeff filed 7361 - compilation issue and filed.

Review v4.0.x Milestones v4.0.3

  • v4.0.3 in the works.

    • Put out a
    • Schedule: End of january.
    • Try to get rc1 built this Friday
  • Howard PRed #7321 to v4.0.x

    • xpmem worked on v3.x, so don't think it needs cherry-picking back.
    • Nathan to see if these fixes are relevant on 3.0.x and 3.1.x
  • Issue 7220 - vader not cleaning up properly (vader backing files).

    • in v3.x series, uses pmix 2.x (can't register cleanup files)
      • Nathan: old workaround after add-procs all processes unlink?
      • No longer doing this because moved files from /tmp to /dev/shmem (v3.0?)
        • This would bring up more bugs for users with very small /tmp.
    • in v4.0.x, (uses pmix 3.x, and CAN register files for cleanup)
      • sigterm forgets to call pmix interface to cleanup registered files.
      • in session directory always cleanup, but in /dev/shmem
  • Issue 6960 (closed) had something cherry-picked to release branch, but it's still not fixed.

    • Configuring --enable-ipv6 shouldn't preclude ipv4.
    • Do we need to cherry-pick 6964 back into v4.0.x ?
    • Fix this in PRRTE.

v5.0.0

  • Schedule: No real schedule yet. *

Face to face

  • Portland Oregon, Feb 17, 2020.
  • Please register on Wiki page, since Jeff has to register you.
  • Date looks good. Feb 17th right before MPI Forum
    • 2pm monday, and maybe most of Tuesday
    • Cisco has a portland facility and is happy to host.
    • about 20-30 min drive from MPI Forum, will probably need a car.

Infrastrastructure

Review Master Master Pull Requests

CI status


Depdendancies

PMIx Update

  • PMIx v3.1.5 is probably NOT in January.
    • Continuing to do some work on bugs that were found.
    • Need to talk about what will go back into v3.1.x
  • CI testing only tests build and did it run, but doesn't test HOW it ran.
    • Environment setup can be a bit different.
    • For example no-permissions in /tmp. Might pass on one machine, and fail on another without /tmp permissions.

ORTE/PRRTE

  • Was passing yesterday, and then rebased yesterday and is now failing again.

    • Some changes had been done since last rebase.
    • opal_gethostname() being one of them.
    • Needed more initialization code to take into consideration hostnames assigned top-down from runtime system.
  • Strange issue is: Suck up libevent and hwloc into opal staticly, but in Pmix link against libopal to get access to these components. Even with name shifting (under opal names) it can call down into opal. pmix_error_log, found himself in opal_output with an unitialized hostname that segfaults.

    • Need to find a way to link directly to pmix, hwloc,
    • even have disable-dlopen set.
    • Problem: want one process (seperate from MPI process) (i.e. prrte) that calls prrte_init, and ends up linking in opal, because it's the embedded coded.
    • How should we split these out?
      • Make libtool convenence libraries of them.
      • prrte rather than linking against libtool, links against the convenence libraries.
      • convenence libraries then just get sucked into the code.
      • where this fails, is that you can't link against both these convenence libraries and libopal?
    • configury? doesn't prrte need to know if we're linking embedded or external?
    • Brian will write up some thoughts on this on Friday.
  • Still a bunch of things to do after this PR goes in.

  • Singleton comm-spawn... how do we make this work? - PMIx understands it.

    • Do we need to support singleton comm-spawn starting the PRRTEs?
    • Now that we will support a persistant infrastructure, maybe we just require users to start it first.
  • Address comm-spawn issues that have been raised.

MTT


Back to 2019 WeeklyTelcon-2019

Clone this wiki locally