Skip to content

WeeklyTelcon_20221017

Geoffrey Paulsen edited this page Oct 19, 2022 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoffrey Paulsen (IBM)
  • Jeff Squyres (Cisco)
  • Austen Lauria (IBM)
  • Brendan Cunningham (Cornelis Networks)
  • Christoph Niethammer (HLRS)
  • Edgar Gabriel (UoH)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Jan (Sandia)
  • Joseph Schuchart
  • Thomas Naughton (ORNL)
  • Todd Kordenbrock (Sandia)
  • Tommy Janjusic (nVidia)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Brian Barrett (AWS)
  • David Bernhold (ORNL)
  • Josh Fisher (Cornelis Networks)
  • Artem Polyakov (nVidia)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Erik Zeiske
  • George Bosilca (UTK)
  • Hessam Mirsadeghi (UCX/nVidia)
  • Jingyin Tang
  • Josh Hursey (IBM)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Raghu Raja (AWS)
  • Ralph Castain (Intel)
  • Sam Gutierrez (LLNL)10513
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Xin Zhao (nVidia)

v4.1.x

  • libevent CVE was a red herring.
  • v4.1.5
    • Schedule: targeting ~6 mon (Targeting November, looking RC next week or two)

v5.0.x

  • New v5.0.0 blocker LSF issue came in. 10943

  • NEW - Discuss CUDA + OFI on v5.0.x

    • main runs fine, but in OMPI v5rc8 was in OFI MTL code, that was removed as part of the accelerator.
      • Was this a bug that was fixed as part of the new framework?
        • The bug is that if configured with CUDA support in v5.0.x before new framework, but ran on a system without GPUs, the CUDA part of OFI MTL thought it had to deregister buffers not related to the CUDA device.
          • Saw this in IMB's Alltoall.
      • Yes, did fix some issues in OFT MTL code, and proper device to host copy.
  • Jenkins - make tarball issue.

    • RPM builds dont work in Jenkins on v5.0.x
      • Doesn't block RC, but DOES block
    • Updated PMIx / PRTE submodule pointers on v5 yesterday.
  • HAN/Adapt -

    • Joseph was able to dump trees that they use.
    • He will open a ticket on performance differences, need some help understanding.
    • He'd like to enable HAN if detect
    • If --rank by or something is not a linear sequence over the nodes by the ranks,
      • Then bump the priority of HAN.
    • If it turns out the Verification of ranks in communicator, might want to do it at Comm create time and set a flag.
      • Yes this is how Joseph is trying to do it.
    • Any benefit in
  • Adapt is supposed to be for noisy clusters/applications.

  • Symbol Pollution - PRs posted against main, and will PR once merged.

  • Docs - Remaining blocking issue (besides above) for v5.0.0

Main branch

Accelerator framework

  • Merged to main, and to v5.0.x
    • Will put out a new v5.0.0rc9 to include this.
  • Now that this is merged, What do we expect to see in the configury summary info?
    • Howard saw: CUDA support NO, ROCM support NO.
    • Howard will file an issue. Probably just the configure variables this is based on, have changed.
    • CUDA and ROCM components now directly link against those libraries.
    • Perhaps the Configure MACROs need to be updated.
    • will file an issue.

MTT

Administrative tasks

Face-to-face

Super Computing?

  • Open MPI missed submitting request for BoF this year.
  • MPI Forum will be presenting.
Clone this wiki locally