-
Proposal: [MP-0004](0004-FaultTolerance.md)
-
Authors: [Emily Jiang](https://github.com/Emily-Jiang), [Jonathan Halterman](https://github.com/jhalterman/), [Antoine Sabot-Durand](https://github.com/antoinesd), [John Ament](https://github.com/johnament)
-
Status: v1.0 released
During the review process, add the following fields as needed:
-
Decision Notes: [Discussion thread topic covering the Rationale](https://groups.google.com/forum/!topic/microprofile/ezFC1TLGozU), [Discussion thread topic with additional Commentary](https://groups.google.com/forum/!forum/microprofile)
It is increasingly important to build fault tolerant micro services. Fault tolerance is about leveraging different strategies to guide the execution and result of some logic. Retry policies, bulkheads, and circuit breakers are popular concepts in this area. They dictate whether and when executions should take place, and fallbacks offer an alternative result when an execution does not complete successfully.
As mentioned above, the Fault Tolerance proposal is to focus the aspects: TimeOut, RetryPolicy, Fallback, bulkhead and circuit breaker.
-
TimeOut: Define a duration for timeout
-
RetryPolicy: Define a criteria on when to retry
-
Fallback: provide an alternative solution for a failed execution.
-
Bulkhead: isolate failures in part of the system while the rest part of the system can still function.
-
CircuitBreaker: offer a way of fail fast by automatically failing execution to prevent the system overloading and indefinite wait or timeout by the clients.
The main design is to separate execution logic from execution. The execution can be configured with fault tolerance policies, such as RetryPolicy, fallback, Bulkhead and CircuitBreaker.
[Hystrix](https://github.com/Netflix/Hystrix) and [Failsafe](https://github.com/jhalterman/failsafe) are two popular libraries for handling failures. This proposal is to define a standard API and approach for applications to follow in order to achieve the fault tolerance.
The requirements are as follows:
-
Loose coupling: Execution logic should not know anything about the execution status or fault tolerance.
-
Failure handling strategy should be configured when the execution takes place.
-
Support for synchronous and asynchronous execution
-
Integration with 3rd party asynchronous APIs. This is necessary to handle executions that are completed at some time in the future, where retries will need to be explicitly scheduled from within the asynchronous execution. This is common when working with various 3rd party asynchronous tools such as Netty, RxJava, Vert.x, etc.
-
Require immutable failure handling policy configuration
-
Some Failure policy configurations, e.g. CircuitBreaker, RetryPolicy, can be used stand alone. For example, it has been very useful for circuit breakers to be standalone constructs which can be plugged into and intentionally shared across multiple executions. Likewise for retry policies. Additionally, an Execution construct can be offered that allows retry policies to be applied to some logic in a standalone, manually controlled way.
Mailinglist thread: [Discussion thread topic for that proposal](https://groups.google.com/forum/#!topic/microprofile/ezFC1TLGozU)
Currently there are at least two libraries to provide fault tolerance. It is best to uniform the technologies and define a standard so that micro service applications can adopt and the implementation of fault tolerance can be provided by the containers if possible.
Separate the responsibility of executing logic (Runnables/Callables/etc) from guiding when execution should take place (through retry policies, bulkheads, circuit breakers). In this way, failure handling strategies become configuration that can influence executions, and the execution API itself is just responsible for receiving some configuration and performing executions.
By default, a failure handling strategy could assume, for example, that any exception is a failure. This is what the RetryPolicy’s retryOn
, abortOn
clauses are about - defining a failure.
Standardise the Fallback, Bulkhead and CircuitBreaker APIs and provide implementations.
-
CDI-first approach to apply RetryPolicy, Fallback, BulkHead, CircuitBreaker using annotations
This specification utilises CDI to simplify the programming model.
Use interceptor binding to specify the execution and policy configuration. An annotation of Asynchronous has to be specified for any asynchronous calls. Otherwise, synchronous execution is assumed.
An annotation to specify the max retries, delays, maxDuration, Duration unit, jitter, retryOn etc.
An annotation to specify when to open a circuit, when to half open, close the circuit.
Timeout to specify the maximum time for a particular execution.
Use this annotation without Asynchronous
annotation for semaphore style. When used with Asynchronous
, it means threadpool style of bulkhead.
## Usage
The annotations can be applied to a bean or methods. They can be used together. For an instance, @Retry
can be used with @Fallback
in order to trigger the fallback
when the Retry
policy fails.
@ApplicationScoped
public class FaultToleranceBean {
int i = 0;
@Retry(maxRetries = 2)
public Runnable doWork() {
Runnable mainService = () -> serviceA(); // This unreliable service sometimes succeeds but
// sometimes throws a RuntimeException
return mainService;
}
}
}
The annotation parameters can be configured via MicroProfile Config. In order to configure the maxRetries
to be 6
for the following Retry
policy, define a property org.microprofile.readme.FaultToleranceBean/doWork/Retry/maxRetries=6
. Alternatively, if the maxRetries
of the Retry
is to be configured to 6
, just specify the property of Retry/maxRetries=6
.
package org.microprofile.readme
@ApplicationScoped
public class FaultToleranceBean {
int i = 0;
@Retry(maxRetries = 2)
public Runnable doWork() {
Runnable mainService = () -> serviceA(); // This unreliable service sometimes succeeds but
// sometimes throws a RuntimeException
return mainService;
}
}
}