-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Let's collect notes on what's needed to support hardware collectives on El Cap.
I'll summarize what I got from talking to @trws just now - I may be misremembering so please correct me.
HPE arranged a test recently and demonstrated a clear benefit of hardware collectives on job sizes above about 4000 nodes. Below that not so much. Since the resources required to enable collectives are somewhat limited, it may be best to only enable them on large jobs, and even then, perhaps only when requested. Performance can actually degrade when collectives are used on smaller jobs.
The process involves making one request to the fabric manager per job to set up (up to 24?) "trees". A handle is returned. The handle must be distributed to all the MPI tasks via environment variables. (The actual consumer is libfabric, which is used by Cray MPI). At the end of the job, the fabric manager is told to destroy the handle. The fabric manager operations could be performed by the the job shell, authenticated as the end user. Thus a job shell plugin could handle making these requests and distributing the handle to other parts of the job, and cleaning up afterwards. All self contained within the job.
The details of the authentication need to be understood. There were two options, a pre shared key, which presumably implies that Flux would need to do key management for users; and an Oauth like mechanism.
Fabric manager communications are simple REST.