Added PjrtClient::UpdateGlobalProcessInfo
method.
#28011
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Added
PjrtClient::UpdateGlobalProcessInfo
method.Overview
Recall that a multi-controller JAX program involves multiple PjRt
clients running across multiple processes. These clients perform collective
operations, like AllReduce and AllGather, to execute a distributed program.
This commit adds an
UpdateGlobalProcessInfo
method that updates a client withinformation about all processes. For example, if there are four processes, we
might call
UpdateGlobalProcessInfo
on process 0 with the information thatprocess 0, 1, and 2 are healthy but process 3 is dead.
Motivation
I am currently working on making multi-controller JAX fault tolerant. Part of
this work involves cancelling collectives where one of the participants of the
collective has failed. The
UpdateGlobalProcessInfo
method will allow a PjRtclient to notice when a peer process has failed and abort any collectives it is
performing with this failed peer.
Previously, I was using the coordination service to
determine when processes failed, but PjRt clients executed via C plugins do not
have access to the coordination service.
Future Work
This commit introduces the new
UpdateGlobalProcessInfo
method and pipes itthrough the C++ sandwich, but it doesn't actually implement it yet. Nothing is
calling the new method either. These things will come in future changes.
Alternatives
Rather than introducing a new
PjRtClient
method, I could have piped acoordination service client through the C plugin API into
PjRtClient
s.However, this would be very complicated. The code to pipe a key-value store
client is complicated, and the API for the coordination service client is
significantly more complex.
I could shoehorn the new API into the existing key-value store. For example, I
could establish a convention that the state of every process i is stored in
some special key
process_{i}
in the key-value store. This felt roundabout.