Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lameducking period for graceful shutdown / serve_with_shutdown #1940

Open
vandry opened this issue Sep 14, 2024 · 2 comments
Open

Lameducking period for graceful shutdown / serve_with_shutdown #1940

vandry opened this issue Sep 14, 2024 · 2 comments

Comments

@vandry
Copy link

vandry commented Sep 14, 2024

Feature Request

Crates

tonic

Motivation

In tonic right now, serve_with_shutdown does not implement lameducking the way I am used to seeing. What I was expecting is that on SIGTERM idle keepalive connections would be disconnected, readiness health checks would start returning negative, but new connections would continue to be accepted (albeit with no keepalive) during a grace period in order to allow clients time to notice that we are lameducking and select different backends. Then eventually after a delay we would close the listening socket.

What actually happens is that we close the listening socket (or at least we stop calling accept() on it) immediately and then drain requests from existing connections.

Is that okay? I don't know, it depends on whether clients can be counted on to promply and transparently try a different backend when they get either a refused or reset connection (depending on the timing) during the short interval after we have started shutting down but before they have had a chance to update their backend list to exclude us. I feel like most gRPC clients might be all right there, but there are stories of 502s from ngnix floating about...

vandry/td-archive-server@7e202e5

Proposal

First, I would like to solicit opinions about whether the behaviour I am looking for is needed or if the status quo is good enough. We definitely implement the lameducking delay as I describe it at very large scale at $DAYJOB but there might be other considerations here.

If the feature is deemed desirable then I propose:

  • Allow a lameducking delay to be configured, defaulting to zero, which makes serve_with_shutdown have the present behaviour.
  • If nonzero, first drain all existing connections and set max_connection_age to zero on all new ones. Then sleep for the delay while still accepting new connections. Then drop the listener and return.
  • It is the caller's responsibility (e.g. using the tonic-health crate) to flip health checks to bad when the shutdown signal is first sent.

Alternatives

See above; it is possible that the status quo is fine.

vandry added a commit to vandry/comprehensive that referenced this issue Oct 5, 2024
Note that graceful shutdown is not lame ducking. With graceful shutdown
existing requests are drained but the listening socket is dropped
immediately upon signal, so we do not accommodate clients that need time
to notice that we are going away.

However, in Kubernetes lame ducking may not be necessary as we are
probably removed from backend pools already before we get SIGTERM.
Still, might some clients be lagging behind and still try talking to
us, fail and not retry, and thus benefit if we lame duck? I don't
know, see hyperium/tonic#1940
@Dietr1ch
Copy link

Dietr1ch commented Oct 24, 2024

What actually happens is that we close the listening socket (or at least we stop calling accept() on it) immediately and then drain requests from existing connections.
Is that okay?

It's not reasonable behaviour, even though, as you mentioned, it works pretty much fine in stateless services behind a load balancer where clients know how to retry.

Implementations battle tested in production perform a graceful shutdown.
Tokio docs talk about graceful shutdown, and it's not hard to find people trying to do so in other stacks (C++, Java, Go from a quick search), and in Rust too, #1820.

A reference C++ implementation defines a Shutdown with a deadline argument to degrade into lame-duck mode before shutting down,

class ServerInterface : public internal::CallHook {
  public:
   ServerInterface() override {}
  
   template <class T>
   void Shutdown(const T& deadline);
   // ...
 }

Docs for Shutdown(const T& deadline).

Granted, they offer a Shutdown() overload too, I don't know when you'd prefer it for anything other than convenience/sloppiness.


I guess the more interesting question is on how to implement this, as Tokio's article already shows that the Service needs not only to stop serving new requests and finish the outstanding ones, but may also need to perform some Application-specific steps, which is fine, but show that maybe the interface for announcing and executing shutdown might need some thought.

Similarly, we may want to temporarily enter lame-duck mode as a way to ease operations on services downstream.

@Dietr1ch
Copy link

Well, it's also sort of blocked by instability of the underlying ecosystem,

#1820 (comment)

I guess some issue de-duping is needed to keep things easy to find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants