implement hardware collectives for slingshot

Let's collect notes on what's needed to support hardware collectives on El Cap.

I'll summarize what I got from talking to @trws just now -  I may be misremembering so please correct me.

HPE arranged a test recently and demonstrated a clear benefit  of hardware collectives on job sizes above about 4000 nodes.  Below that not so much.  Since the resources required to enable collectives are somewhat limited, it may be best to *only* enable them on large jobs, and even then, perhaps only when requested.  Performance can actually degrade when collectives are used on smaller jobs.

The process involves making one request to the fabric  manager per job to set up (up to 24?) "trees".   A handle is returned.  The handle must be distributed to all the MPI tasks via environment variables.  (The actual consumer is libfabric, which is used by Cray MPI).  At the end of the job, the fabric manager is told to destroy the handle.  The fabric manager operations could be performed by the the job shell, authenticated as the end user.  Thus a job shell plugin could handle making these requests and distributing the handle to other parts of the job, and cleaning up afterwards.  All self contained within the job. 

The details of the authentication need to be understood.  There were two options, a pre shared key, which presumably implies that Flux would need to do key management for users; and an Oauth like mechanism.

Fabric manager communications are simple REST.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

implement hardware collectives for slingshot #344

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

implement hardware collectives for slingshot #344

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions