Skip to content

Exponential backoff to prevent cascading failures #20717

@elias-dbx

Description

@elias-dbx

What would you like to be added?

Exponential backoff with jitter in client RPC calls.

Why is this needed?

In most codepaths within the etcdv3 client, backoff due to errors is a flat duration with sometimes a bit of jitter. During some failures, the flat duration backoff is not enough to alleviate load on the etcd servers which causes cascading failures. A more adaptable approach is to use exponential backoff with jitter which will better reduce the RPC/second load as well as further de-correlate thundering herds with a larger jitter.

I believe this can be done in three changes:

  1. Implement a new field on the etcd client config BackoffExponent which determines the exponential factor in backoff. For example a BackoffExponent=2 would double the backoff duration after each failure whereas a BackoffExponent=1 would not increase the backoff duration after a failure. The default value can be set to BackoffExponent=1 to preserve current behavior.
  2. Implement a new field on the etcd client config BackoffWaitBetweenMax which configures the max exponential backoff when BackoffExponent > 1. The default value can be set to BackoffWaitBetweenMax=5seconds.
  3. Implement backoff within lease streams, as there is currently no backoff or jitter when a lease stream fails which can cause cascading failures.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions