[Workflow API] Enhancing Fault Tolerance and Resilience in Federated Runtime #1521

ishant162 · 2025-04-08T12:50:20Z

ishant162
Apr 8, 2025
Collaborator

SUMMARY
This proposal aims to enhance the FederatedRuntime components (Aggregator and Collaborators) to effectively handle exceptions encountered during flow execution, thereby improving system robustness and user experience. The specific objectives include:

Implementing exception handling during flow execution for Aggregator and Collaborator components.
Ensuring proper cleanup of experiments in participating nodes.
(Optional) Notifying users of encountered failures.

SCOPE
This proposal focuses on addressing exceptions encountered during flow execution in Aggregator and Collaborator components within FederatedRuntime (a distributed infrastructure). Exception handling is not applicable to LocalRuntime, as it is a simulation where any encountered exceptions are immediately visible to the user, enabling them to correct the issue and rerun the flow.

Handling connectivity issues between participants is outside the scope of this proposal. In current implementation, an existing health check mechanism between Envoys and Director ensures that Director is aware of status of participants (online / offline) in the Federation.

MOTIVATION
Current Design & Limitations

Exceptions in Aggregator Steps:
If an exception occurs during the execution of an aggregator step (due to bugs, misconfigurations, or logical inconsistencies in the Flow), collaborators are not notified and remain waiting for tasks from the experiment.
Example:
In the following code snippet:

@aggregator
def start(self):
    """
    This is the start of the Flow.
    """
    print(self.collaborators)
    self.collaborators = self.runtime.collaborators
    self.current_round = 0
    self.next(self.aggregated_model_validation, foreach="collaborators")

In the start step, the line print(self.collaborators) raises an AttributeError because self.collaborators is accessed before initialization.

Exceptions in Collaborator Steps:
If an exception occurs during the execution of a collaborator step (due to bugs, misconfigurations, or logical inconsistencies in the Flow), the aggregator is not informed and continues waiting for responses from collaborator(s).
Example:
In the following code snippet:

@collaborator
def local_model_validation(self):
    self.local_validation_score = inference(self.model, self.test_loader)  # Not Defined
    print(
        f'Doing local model validation for collaborator {self.input}:'
        f' {self.local_validation_score}')
    self.next(self.join, exclude=['training_completed'])

If the inference function is not defined, it will raise NameError: name 'inference' is not defined.

A real world example would be functionality introduced in PR Support for the exclusion/inclusion of specific data types to be passed through the network. #1429.

PROPOSAL
If an exception is encountered during flow execution (in an aggregator or collaborator step):

The framework should catch the exception and report it to the user.
The state of participants in the Federation should be restored to a deterministic state, enabling the user to correct the error and resubmit the updated flow.

Handling Exceptions in Aggregator Steps:
As Aggregator is the central component in Federated Learning experiment, an exception in its steps would stall the ongoing experiment. Therefore, the following handling is proposed:
- The framework should catch the exception and terminate the experiment on all participants.
- The failure should be reported back to the user with the last known state of the flow object at the Aggregator.
- All participating nodes (Director and Envoys) should revert to the initial state (i.e., wait for experiment state).
Handling Exceptions in Collaborator Steps:
An exception in a collaborator step should result in the termination of the ongoing experiment on the specific node. While the lack of participation from an individual collaborator may slow down convergence or reduce diversity, it may not halt the overall training process. Therefore following options are considered for handling the scenario
- Option 1: Terminate the ongoing experiment on all participating nodes.
- Option 2: Continue running the experiment on remaining participants and handle the particular collaborator as a straggler.

Scenario Analysis: Analysis of both the options in different scenarios is shown in Table below

Scenario	Desirable Behavior	Option 1	Option 2
Exception due to flow logic (e.g. missing attribute)	In Horizontal FL: Fail fast approach would be preferred for easier debugging In Vertical FL: Continuation of training despite loss of participants would be preferable	In Horizontal FL: Preferred (fail fast) In Vertical FL: Not preferred (leads to brittle federations)	In Horizontal FL: Not preferred error could be detected (with delay) In Vertical FL: Preferred (leads to resilient federation)
Exception due to collaborator’s dataset crash	Continue training despite participant loss	Not preferred (leads to brittle federation)	Preferred (leads to resilient federation)
Exception due to security policy (e.g. PR 1429)	Detection at first occurrence (i.e. Fail fast) would be preferred	Preferred (fail fast)	Not preferred (error could be detected with delay)
Straggler handling	Handle all slow/unresponsive participants uniformly and at scale	Requires special handling of collaborators that are not responding due to connectivity issues	Scales seamlessly; unified handling of all unresponsive / slow collaborators

Based on the scenarios above, Option 1 is preferrable in case of exceptions observed in Horizontal FL and exceptions due to enforcement of a security policy. Option 2 offers resilient federations which could be preferred for handling exceptions in Vertical FL and enables a uniform & scalable handling of participant issues and delays (i.e., stragglers).

Recommendation [To be discussed]::
As Option 2 enables Federations that are resilient to errors, and allows uniform handling of participants that are not able to respond, therefore it is recommended for implementation in the first phase. In subsequent phases this framework could be seamlessly extended.

Note: For now, collaborator exceptions will not trigger any specific actions. Enhanced handling for such cases will be addressed separately in the Straggler Handling proposal.

PROPOSED APPROACH
In a distributed infrastructure Director and Envoys are long-lived components and maintain their state across experiments. Aggregator and Collaborator are short-lived components with scope limited to a single experiment. Since Director and Envoys are not directly involved in experiment-specific logic therefore experiment specific error handling responsibilities are delegated to Aggregator and Collaborator. This approach ensures error handling occurs closer to where the issues actually occur and results in a cleaner design.

Step 1: Director & Aggregator Resilience
Enhance the Director and Aggregator to handle exceptions encountered while executing aggregator steps.

TECHNICAL DETAILS
Aggregator:

Catch the exception and transition to wait state to send time_to_quit indication to ALL the collaborators.
Send time_to_quit indication to ALL the collaborators (via get_tasks response), causing collaborators to stop, and Envoys to revert to the Wait for experiment state.
Provide the status of experiment (including exception information) and return the control to Director.

Director:

Send detailed failure notification to the Experiment Manager.
Transition to the Wait for experiment state.

User:

The user will encounter the error in the following manner. User is expected to correct the flow and resubmit the experiment.

Origin: [aggregator]
Experiment could not run due to error: Traceback (most recent call last):
File "/home/ishant/miniforge3/envs/dir_res/lib/python3.10/site-packages/openfl/experimental/workflow/component/aggregator/aggregator.py", line 342, in run_flow
next_step = self.do_task(f_name)
File "/home/ishant/miniforge3/envs/dir_res/lib/python3.10/site-packages/openfl/experimental/workflow/component/aggregator/aggregator.py", line 533, in do_task
f(*selected_clones)
File "/home/ishant/miniforge3/envs/dir_res/lib/python3.10/site-packages/openfl/experimental/workflow/placement/placement.py", line 45, in wrapper
f(*args, **kwargs)
File "/home/ishant/director_res/openfl/openfl-tutorials/experimental/workflow/FederatedRuntime/301_MNIST_Watermarking/director/FederatedFlow_MNIST_Watermarking/src/experiment.py", line 155, in start
print(self.test_var)
AttributeError: 'FederatedFlow_MNIST_Watermarking' object has no attribute 'test_var'.

Note: This is a prototype of the error and may evolve during development.

Step 2: Envoy & Collaborator Resilience

TECHNICAL DETAILS
Collaborator:

If a collaborator encounters an exception during the experiment, the Aggregator receives a detailed failure traceback via the existing send_task_results RPC (for Option 1)
The particular Envoy shall transition to Wait for experiment state.

Aggregator: (only for option 1)

Analyze the results reported by collaborator
If an exception is reported stop the experiment and transition to state where it waits to send time_to_quit indication to ALL participants
Provide the status of experiment (incl. exception information) and return the control to Director.

Director: (only for option 1)

Send detailed failure notification to the Experiment Manager.
Transition to the Wait for experiment state.

User:

The user will encounter the error in the following manner. User is expected to correct the flow and resubmit the experiment.

Origin: [Chandler]
Experiment could not run due to error: Traceback (most recent call last):
File "/home/ishant/miniforge3/envs/envoy-res/lib/python3.10/site-packages/openfl/experimental/workflow/component/collaborator/collaborator.py", line 148, in run
f_name, ctx = self.do_task(next_step, clone)
File "/home/ishant/miniforge3/envs/envoy-res/lib/python3.10/site-packages/openfl/experimental/workflow/component/collaborator/collaborator.py", line 216, in do_task
f()
File "/home/ishant/miniforge3/envs/envoy-res/lib/python3.10/site-packages/openfl/experimental/workflow/placement/placement.py", line 96, in wrapper
f(*args, **kwargs)
File "/home/ishant/envoy_res/openfl/openfl-tutorials/experimental/workflow/FederatedRuntime/301_MNIST_Watermarking/Chandler/Chandler_FederatedFlow_MNIST_Watermarking/src/experiment.py", line 199, in aggregated_model_validation
print(self.test_var)
AttributeError: 'FederatedFlow_MNIST_Watermarking' object has no attribute 'test_var'

Note: This is a prototype of the error and may evolve during development.

Updates to RPC Communication:

AggregatorClient(Collaborator) -> AggregatorServer(Aggregator)
- send_task_results: Update to include the error_traceback.
RuntimeDirectorClient(User) -> DirectorServer(Director):
- get_flow_state: Update to include the error_traceback along with its origin (source of the error).

KEY BENEFITS

Improves overall Robustness of the system by catching exceptions and transitioning the federation to a deterministic state.
Improves User Experience by informing the exception encountered during the experiment.

CONCERNS

Testing and Validation: Thorough testing is required to validate error-handling paths, especially in distributed setups.

MITIGATION

Documentation and Logging: Clearly document new behaviors and log meaningful messages to help maintain transparency and traceability during failures.
Extensive Unit & Integration Testing: Create test cases to simulate failure conditions across all components, ensuring coverage of edge cases.

NOT IN SCOPE
Straggler Handling: To be covered in a follow-up proposal.

NEXT STEPS

Implementation

payalcha · 2025-04-08T14:55:11Z

payalcha
Apr 8, 2025
Collaborator

If collaborator is not able to connect to aggregator, will there be retry added with some timeout? As currently there is no retry in collaborator and aggregator connection.
There are two proposal Option 1 is to revert to initial state, so if model is trained for 5 rounds are we saying that if any failure in collaborators happened in 6th round. What is expected to happen?

1 reply

ishant162 Apr 23, 2025
Collaborator Author

Thanks @payalcha for the feedback.

Assuming the comment refers to Envoys and the Director, it is acknowledged and will be addressed in a separate PR. This proposal specifically focuses on handling errors encountered during flow execution.
We propose that the last known state of the flow object (in this case from the 5th round) will be returned to the user by the aggregator.

rahulga1 · 2025-04-10T09:00:01Z

rahulga1
Apr 10, 2025
Maintainer

Thanks @ishant162 @scngupta-dsp proposing this, I have few comments and suggestions:

Why exception handling being limited to FederatedRuntime experiment? Having it for LocalRuntime or even in SecuredFederatedRuntime will be useful, so will it be generic enough?
Can we add some real examples of Fault Tolerance/Resiliency here?
Sometimes, exception happens due to missing the corner cases, for example, one of the component lost the network connectiviy briefly and couldnt respond back in time, in these scenarios certain number of retries improve user experience and save time/resources.
Why we are relying on the failed component itself to propagate the error to other component, If collaborator has stopped then how it will send the exception back? Also what if send_task_results have issue in case of collaborator?

In general, I have seen resiliency are handled by a third service which monitors the health of the different compnent in the deployment. It will be interesting how can we have resiliency without the need of 3rd service.

1 reply

ishant162 Apr 23, 2025
Collaborator Author

Thanks @rahulga1 for the feedback.

This proposal is applicable to Federated Runtime for the following reasons:
- LocalRuntime is a simulation, meaning any exceptions encountered during flow.run() are immediately visible to the user, allowing them to correct the issue and rerun the flow. Proposal is updated to clarify this aspect.
- SecureFederatedRuntime does not currently exist, but the same principles can be applied once implemented.
Accepted and proposal is updated with examples of errors. A real-world example of this is functionality introduced in PR Support for the exclusion/inclusion of specific data types to be passed through the network. #1429.
Proposal has been updated to clarify that this proposal focuses on addressing exceptions encountered during the execution of the flow. Handling connectivity issues between participants is outside the scope of this proposal. Our understanding is that components which are not able to respond in time due to Network connectivity issues will require to be handled as stragglers and this will be covered in a separate proposal.
We had two approaches to handle experiment failure:
1. Initial Approach:
  The Director catches exceptions from the Aggregator and notifies the respective Envoys to stop the Collaborators.
  Envoys, in turn, catch exceptions from the Collaborators and report failures back to the Director. The concern with this approach was that it involved long-lived components to monitor execution of flow and due to this an alternate approach was evaluated.
2. Proposed Approach:
  Aggregator and Collaborator handle their own exceptions during experiment execution. Since they are directly involved in running the flow, it makes sense for them to manage failure locally. This keeps error handling modular, clean, and closer to where the error actually occurs.

Let us discuss this in a offline discussion and evaluate relative merits of both the approach (similar comment from Teo).

Note: send_task_results is an RPC, and if a collaborator has issues in sending task results, our understanding is that it would get treated as a straggler (to be handled separately).

teoparvanov · 2025-04-10T13:24:57Z

teoparvanov
Apr 10, 2025
Maintainer

Hey @ishant162 and @scngupta-dsp , thanks for the detailed proposal!

The discussion between Option 1 and Option 2 for treating collaborator failures is an interesting one... Although Option 1 takes the "fail-fast" approach, it may make federations brittle - especially in cases where the errors are non-uniform (think of a bug in one of the local datasets that crashes the training process). For those circumstances Option 2 would be better suited IMO. Option 1, on the other hand, could be used for the first version of the implementation, followed by an Option 2-based approach once straggler handling has been introduced.

@rahulga1 makes a good point that we expect the failed component to catch and report the error, which may not always be possible (think fatal memory access issues or system outages). While the proposed mechanism does contribute to the overall FederatedRuntime resilience, a complementary one may be needed, based for example on regular health checks that the Director and Envoys would perform resp. on the Aggregator and Collaborator processes.

PS: I think LocalRuntime can be left out of scope, as it is a single process anyway, and any uncaught exception would result in an exit. If so, this should be clearly stated in the design proposal.

1 reply

ishant162 Apr 23, 2025
Collaborator Author

Thanks @teoparvanov for the feedback.

The behavior of LocalRuntime in case of exceptions has been described in the proposal.
We’ve outlined a comparison of both options for handling collaborator failures. We’d like to discuss this further to determine which approach should be followed.
As clarified in the proposal the current scope is limited to scenarios where an exception is raised and can be caught. For handling unexpected or fatal errors, We would need to evaluate alternate approaches. We would propose to discuss and align in an offline discussion.

scngupta-dsp · 2025-05-08T08:54:17Z

scngupta-dsp
May 8, 2025

The proposal to enhance exception handling in the FederatedRuntime components was discussed in an offline review with key stakeholders

SUMMARY

Exception Handling in Aggregator and Collaborators:

Aggregator Exceptions: If an exception or failure occurs in the Aggregator, the experiment should be terminated across all participants
Collaborator Exceptions: Exceptions or failures in Collaborators should not lead to the termination of the entire experiment. Instead, the preferred approach is to continue the experiment with the remaining participants (i.e. Option-2), treating the affected Collaborator as a straggler. This approach enables resilient federations and allows for unified handling of participants unable to respond due to exceptions, failures, connectivity issues, or other reasons

Exception Handling Mechanism:

The current proposal for catching exceptions should be enhanced to monitor short-lived components (Aggregator and Collaborators) for any errors, whether or not they result in exceptions.
Since the Aggregator runs as an async-coroutine within the Director process and the Collaborator runs as a function call within the Envoy process, the framework should be enhanced to:
- Introduce a heartbeat mechanism to enable long-lived components (Director and Envoys) to monitor the health of short-lived components (Aggregator and Collaborators).
- Enhance Director to monitor the health of the Aggregator async-coroutine (within the same process).
- Refactoring the Collaborator implementation to run as an async-coroutine (within the same process), and enhance Envoy to monitor the health of Collaborator async-coroutine

Suggestions / Further Enhancements:

The flflow.run() API is currently a blocking call, preventing users from stopping the experiment. Future releases should provide an API, such as flflow.stop(), to allow users to halt experiments. This will require refactoring FederatedRuntime APIs to introduce an asynchronous mode of execution (e.g. flflow.run(mode=async))
It would be beneficial to launch Director and Envoys as agents or bots, for example, within Kubernetes, to support SecureFederatedRuntime.

NEXT STEPS:
The proposal will be implemented in the following phases:

Step 1: Enhance Aggregator and Collaborators to gracefully catch exceptions, allowing the respective Director and Envoys to return to the wait_for_experiment state.
Step 2: Introduce health monitoring between the Director and Aggregator, and between Envoys and Collaborators.
Step 3: Refactor FederatedRuntime APIs to introduce an asynchronous mode for executing flows and additional APIs to allow users to stop experiments. This step shall be addressed by a separate design proposal to introduce the API enhancements

1 reply

teoparvanov May 12, 2025
Maintainer

Thanks for summarizing, @scngupta-dsp!

If we are going with Option 2 for collaborator exceptions, it means that the (experimental) Aggregator should be enhanced with at least a basic straggler handler strategy (similar to the non-experimental Aggregator, right?

[Workflow API] Enhancing Fault Tolerance and Resilience in Federated Runtime #1521

Uh oh!

Uh oh!

ishant162 Apr 8, 2025 Collaborator

Replies: 4 comments · 4 replies

Uh oh!

payalcha Apr 8, 2025 Collaborator

Uh oh!

ishant162 Apr 23, 2025 Collaborator Author

Uh oh!

rahulga1 Apr 10, 2025 Maintainer

Uh oh!

ishant162 Apr 23, 2025 Collaborator Author

Uh oh!

Uh oh!

teoparvanov Apr 10, 2025 Maintainer

Uh oh!

ishant162 Apr 23, 2025 Collaborator Author

Uh oh!

scngupta-dsp May 8, 2025

Uh oh!

Uh oh!

teoparvanov May 12, 2025 Maintainer

ishant162
Apr 8, 2025
Collaborator

Replies: 4 comments 4 replies

payalcha
Apr 8, 2025
Collaborator

ishant162 Apr 23, 2025
Collaborator Author

rahulga1
Apr 10, 2025
Maintainer

ishant162 Apr 23, 2025
Collaborator Author

teoparvanov
Apr 10, 2025
Maintainer

ishant162 Apr 23, 2025
Collaborator Author

scngupta-dsp
May 8, 2025

teoparvanov May 12, 2025
Maintainer