Replies: 4 comments 4 replies
-
|
Beta Was this translation helpful? Give feedback.
-
Thanks @ishant162 @scngupta-dsp proposing this, I have few comments and suggestions:
In general, I have seen resiliency are handled by a third service which monitors the health of the different compnent in the deployment. It will be interesting how can we have resiliency without the need of 3rd service. |
Beta Was this translation helpful? Give feedback.
-
Hey @ishant162 and @scngupta-dsp , thanks for the detailed proposal! The discussion between Option 1 and Option 2 for treating collaborator failures is an interesting one... Although Option 1 takes the "fail-fast" approach, it may make federations brittle - especially in cases where the errors are non-uniform (think of a bug in one of the local datasets that crashes the training process). For those circumstances Option 2 would be better suited IMO. Option 1, on the other hand, could be used for the first version of the implementation, followed by an Option 2-based approach once straggler handling has been introduced. @rahulga1 makes a good point that we expect the failed component to catch and report the error, which may not always be possible (think fatal memory access issues or system outages). While the proposed mechanism does contribute to the overall FederatedRuntime resilience, a complementary one may be needed, based for example on regular health checks that the Director and Envoys would perform resp. on the Aggregator and Collaborator processes. PS: I think |
Beta Was this translation helpful? Give feedback.
-
The proposal to enhance exception handling in the SUMMARY Exception Handling in Aggregator and Collaborators:
Exception Handling Mechanism:
Suggestions / Further Enhancements:
NEXT STEPS:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
SUMMARY
This proposal aims to enhance the FederatedRuntime components (Aggregator and Collaborators) to effectively handle exceptions encountered during flow execution, thereby improving system robustness and user experience. The specific objectives include:
SCOPE
This proposal focuses on addressing exceptions encountered during flow execution in Aggregator and Collaborator components within FederatedRuntime (a distributed infrastructure). Exception handling is not applicable to LocalRuntime, as it is a simulation where any encountered exceptions are immediately visible to the user, enabling them to correct the issue and rerun the flow.
Handling connectivity issues between participants is outside the scope of this proposal. In current implementation, an existing health check mechanism between Envoys and Director ensures that Director is aware of status of participants (online / offline) in the Federation.
MOTIVATION
Current Design & Limitations
If an exception occurs during the execution of an aggregator step (due to bugs, misconfigurations, or logical inconsistencies in the Flow), collaborators are not notified and remain waiting for tasks from the experiment.
Example:
In the following code snippet:
In the
start
step, the lineprint(self.collaborators)
raises anAttributeError
becauseself.collaborators
is accessed before initialization.If an exception occurs during the execution of a collaborator step (due to bugs, misconfigurations, or logical inconsistencies in the Flow), the aggregator is not informed and continues waiting for responses from collaborator(s).
Example:
In the following code snippet:
If the
inference
function is not defined, it will raiseNameError: name 'inference' is not defined
.PROPOSAL
If an exception is encountered during flow execution (in an aggregator or collaborator step):
Handling Exceptions in Aggregator Steps:
As Aggregator is the central component in Federated Learning experiment, an exception in its steps would stall the ongoing experiment. Therefore, the following handling is proposed:
Handling Exceptions in Collaborator Steps:
An exception in a collaborator step should result in the termination of the ongoing experiment on the specific node. While the lack of participation from an individual collaborator may slow down convergence or reduce diversity, it may not halt the overall training process. Therefore following options are considered for handling the scenario
Scenario Analysis: Analysis of both the options in different scenarios is shown in Table below
In Vertical FL: Continuation of training despite loss of participants would be preferable
In Vertical FL: Not preferred (leads to brittle federations)
In Vertical FL: Preferred (leads to resilient federation)
Based on the scenarios above, Option 1 is preferrable in case of exceptions observed in Horizontal FL and exceptions due to enforcement of a security policy. Option 2 offers resilient federations which could be preferred for handling exceptions in Vertical FL and enables a uniform & scalable handling of participant issues and delays (i.e., stragglers).
Recommendation [To be discussed]::
As Option 2 enables Federations that are resilient to errors, and allows uniform handling of participants that are not able to respond, therefore it is recommended for implementation in the first phase. In subsequent phases this framework could be seamlessly extended.
Note: For now, collaborator exceptions will not trigger any specific actions. Enhanced handling for such cases will be addressed separately in the Straggler Handling proposal.
PROPOSED APPROACH
In a distributed infrastructure Director and Envoys are long-lived components and maintain their state across experiments. Aggregator and Collaborator are short-lived components with scope limited to a single experiment. Since Director and Envoys are not directly involved in experiment-specific logic therefore experiment specific error handling responsibilities are delegated to Aggregator and Collaborator. This approach ensures error handling occurs closer to where the issues actually occur and results in a cleaner design.
Step 1: Director & Aggregator Resilience
Enhance the Director and Aggregator to handle exceptions encountered while executing aggregator steps.
TECHNICAL DETAILS
Aggregator:
time_to_quit
indication to ALL the collaborators.time_to_quit
indication to ALL the collaborators (viaget_tasks
response), causing collaborators to stop, and Envoys to revert to theWait for experiment
state.Director:
Wait for experiment
state.User:
Note: This is a prototype of the error and may evolve during development.
Step 2: Envoy & Collaborator Resilience
TECHNICAL DETAILS
Collaborator:
send_task_results
RPC (for Option 1)Aggregator: (only for option 1)
time_to_quit
indication to ALL participantsDirector: (only for option 1)
Wait for experiment
state.User:
Note: This is a prototype of the error and may evolve during development.
Updates to RPC Communication:
AggregatorClient(Collaborator) -> AggregatorServer(Aggregator)
send_task_results
: Update to include the error_traceback.RuntimeDirectorClient(User) -> DirectorServer(Director)
:get_flow_state
: Update to include the error_traceback along with its origin (source of the error).KEY BENEFITS
CONCERNS
MITIGATION
NOT IN SCOPE
Straggler Handling: To be covered in a follow-up proposal.
NEXT STEPS
Beta Was this translation helpful? Give feedback.
All reactions