[Questions] RabbitMQ broker/node connection draining instantly closes all connections without grace period. #14574

oskar-wicht · 2025-09-19T11:50:00Z

oskar-wicht
Sep 19, 2025

Community Support Policy

I have read RabbitMQ's Community Support Policy
I run RabbitMQ 4.x, the only series currently covered by community support
I promise to provide all relevant information (versions, logs from all nodes, rabbitmq-diagnostics output, detailed reproduction steps)

RabbitMQ version used

4.0.9

Erlang version used

27.3.x

Operating system (distribution) used

debian-12-r1

How is RabbitMQ deployed?

Community Docker image

rabbitmq-diagnostics status output

See https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics

# PASTE OUTPUT HERE, BETWEEN BACKTICKS

Logs from node 1 (with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

2025-09-19 10:23:19.208367+02:00 [warning] <0.2819.0> This node is being put into maintenance (drain) mode
2025-09-19 10:23:19.208525+02:00 [info] <0.2819.0> Marked this node as undergoing maintenance
2025-09-19 10:23:19.208553+02:00 [info] <0.2819.0> Asked to suspend 2 client connection listeners. No new client connections will be accepted until these listeners are resumed!
2025-09-19 10:23:19.208858+02:00 [warning] <0.2819.0> Suspended all listeners and will no longer accept client connections
2025-09-19 10:23:19.208914+02:00 [info] <0.2819.0> Closing connection <0.2736.0> because "Node was put into maintenance mode"
2025-09-19 10:23:19.209026+02:00 [error] <0.2736.0> Error on AMQP connection <0.2736.0> ([::1]:52913 -> [::1]:5672 - my-connection-name, vhost: '/', user: 'guest', state: running), channel 0:
2025-09-19 10:23:19.209026+02:00 [error] <0.2736.0>  operation none caused a connection exception connection_forced: "Node was put into maintenance mode"
2025-09-19 10:23:19.209417+02:00 [warning] <0.2819.0> Closed 1 local client connections
2025-09-19 10:23:19.209500+02:00 [warning] <0.2819.0> Skipping leadership transfer of quorum queues: no candidate (online, not under maintenance) nodes to transfer to!
2025-09-19 10:23:19.209565+02:00 [info] <0.2819.0> Will stop local follower replicas of 0 quorum queues on this node
2025-09-19 10:23:19.209596+02:00 [info] <0.2819.0> Stopped all local replicas of quorum queues hosted on this node
2025-09-19 10:23:19.209621+02:00 [info] <0.2819.0> Will transfer leadership of metadata store with current leader on this node
2025-09-19 10:23:19.209652+02:00 [warning] <0.2819.0> Skipping leadership transfer of metadata store: no candidate (online, not under maintenance) nodes to transfer to!
2025-09-19 10:23:19.209709+02:00 [warning] <0.2819.0> Skipping leadership transfer of metadata store: ok
2025-09-19 10:23:19.209756+02:00 [info] <0.2819.0> Node is ready to be shut down for maintenance or upgrade
2025-09-19 10:23:19.219887+02:00 [info] <0.2736.0> closing AMQP connection ([::1]:52913 -> [::1]:5672 - my-connection-name, vhost: '/', user: 'guest', duration: '18s')

Logs from node 2 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

Logs from node 3 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

rabbitmq.conf

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location

# PASTE rabbitmq.conf HERE, BETWEEN BACKTICKS

Steps to deploy RabbitMQ cluster

not applicable

Steps to reproduce the behavior in question

Start the server
Setup a queue
Setup a client
Start consuming from queue
Put the node into maintenance mode with rabbitmq-upgrade drain

advanced.config

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location

# PASTE advanced.config HERE, BETWEEN BACKTICKS

Application code

(I'm using the java client 5.25.0) These are the logs from the custom java application/client side.

2025-09-19 10:32:27.595 UTC INFO [consumer-2] MyConsumer: MyConsumer was shutdown, consumerTag: 'amq.ctag-123456', message: 'connection error; protocol method: #method<connection.close>(reply-code=320, reply-text=CONNECTION_FORCED - Node was put into maintenance mode, class-id=0, method-id=0)'
2025-09-19 10:32:27.600 UTC WARN [AMQP Connection 0:0:0:0:0:0:0:1:5672] ForgivingExceptionHandler: An unexpected connection driver error occurred (Exception message: Connection reset)
2025-09-19 10:32:27.602 UTC WARN [AMQP Connection 0:0:0:0:0:0:0:1:5672] LoggingShutdownListener: RabbitMQ connection closed by broker. Connection name: my-connection:localhost/0:0:0:0:0:0:0:1. Server: 5672.
2025-09-19 10:32:32.606 UTC WARN [AMQP Connection 0:0:0:0:0:0:0:1:5672] RabbitMqConfiguration: handleRecoveryStarted: com.rabbitmq.client.impl.recovery.AutorecoveringConnection

Kubernetes deployment file

# Relevant parts of K8S deployment that demonstrate how RabbitMQ is deployed
# PASTE YAML HERE, BETWEEN BACKTICKS

What problem are you trying to solve?

My assumption when it comes to connection draining and in general when upgrading server nodes in a clustered scenario is the following:
When a server node (in this case the RabbitMQ node a client is connected to) is about to undergo maintenance it should do the following:

Stop accepting new incoming connections
Send a message to all connected clients that it is about to undergo maintenance
Wait for a certain period so that all clients have time to connect to another node
Close all connections (gracefully, TCP RST only as a last resort)

By looking at the logs from the server and from what I can see in our application Step 3 (waiting for the clients to reconnect) basically does not exist. Once the consumer has been cancelled the connection is forcefully closed by sending TCP RST almost instantly.

My questions:

Shouldn't there be a grace period for clients to reconnect to other nodes before the consumers are cancelled and/or the connection is forcefully closed?
The official documentation on Failure Detection and Recovery Limitation states:

When a connection is in the recovering state, any publishes attempted on its channels will be rejected with an exception. The client currently does not perform any internal buffering of such outgoing messages. It is an application developer's responsibility to keep track of such messages and republish them when recovery succeeds.

Shouldn't there be some client mechanism to buffer messages on the client side for nodes that undergo graceful/planned maintenance until the connection is recovered?

Answered by michaelklishin

Sep 19, 2025

@oskar-wicht "connection draining" is not a term used anywhere in our docs. I assume you mean Maintenance mode.

None of the protocols RabbitMQ support have a provision for such "shutdown notifications". A connection to a node can fail at any moment, which is why explicit confirmations for both consumers and (less commonly) publishers exist in every protocol. All outstanding deliveries are automatically requeued.

If a node is stopped, the assumption is that a client reconnects to another node. So what would such a "shutdown advisory" really add is less than obvious to me.

In any case, you are welcome to go ahead and try to contribute a solution to

When a connection is in the recovering st…

View full answer

michaelklishin · 2025-09-19T16:25:22Z

michaelklishin
Sep 19, 2025
Maintainer

@oskar-wicht "connection draining" is not a term used anywhere in our docs. I assume you mean Maintenance mode.

None of the protocols RabbitMQ support have a provision for such "shutdown notifications". A connection to a node can fail at any moment, which is why explicit confirmations for both consumers and (less commonly) publishers exist in every protocol. All outstanding deliveries are automatically requeued.

If a node is stopped, the assumption is that a client reconnects to another node. So what would such a "shutdown advisory" really add is less than obvious to me.

In any case, you are welcome to go ahead and try to contribute a solution to

When a connection is in the recovering state, any publishes attempted on its channels will be
rejected with an exception. The client currently does not perform any internal buffering of such
outgoing messages. It is an application developer's responsibility to keep track of such messages
and republish them when recovery succeeds.

What you will find out, as our team has many years ago, is that your client will need to have a local on-disk storage, which is not always an option in today's day and age (many applications get like 50 MiB of disk space if not less).

So such an "accumulating publisher" will easily run out of memory or disk space, and may not even have a reasonable amount of disk space to begin with, so you can develop a specialized library with certain assumptions about the deployment environment (it will be a fair amount of effort) but not a general solution in a client library for everyone to use.

That said, if someone wants to investigate what a solution even in just one client (out of dozens for AMQP 0-9-1 and AMQP 1.0 alone) might look like, they are welcome to do it.

Asking "should there be…" is not how change happens in open source software.

1 reply

oskar-wicht Sep 22, 2025
Author

@michaelklishin First of all, many thanks for the comprehensive and fast reply, I really appreciate it.
Secondly, please excuse the "Shouldn't there be..." - wording, I didn't mean to offend or directly push for change. I'll make sure to rephrase in the future. The questions were meant in way to find out if the feature I'm looking for doesn't exist or if I just haven't found the corresponding documentation on it yet.

I assume you mean Maintenance mode.

Yes, this is what I meant. With "draining" was referring to the rabbitmq-upgrade drain cli command

If a node is stopped, the assumption is that a client reconnects to another node. So what would such a "shutdown advisory" really add is less than obvious to me.

My train of thought was, that once the client receives an "I'm-about-to-enter-maintenance mode"-signal from the server, it opens an additional connection to another node. When the connection has been established it flips a switch and starts publishing using the new connection. From my point of view, the benefit would be an increased quality of service during scheduled / graceful node shutdowns without implementing some sort of internal buffering. This would only apply for graceful shutdown situations and would only make sense if the node allows for a certain grace period between sending this signal and actually closing the connections.

So such an "accumulating publisher" will easily run out of memory or disk space [...]

That thought also crossed my mind, but maybe I underestimated the additional requirements on disk space and the complexity it adds. Thanks for the explaining the reasons behind it.

That said, if someone wants to investigate what a solution even in just one client (out of dozens for AMQP 0-9-1 and AMQP 1.0 alone) might look like, they are welcome to do it.

While I appreciate the invitation, as of now, I have enough on my plate ;)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Questions] RabbitMQ broker/node connection draining instantly closes all connections without grace period. #14574

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Questions] RabbitMQ broker/node connection draining instantly closes all connections without grace period. #14574

Uh oh!

Uh oh!

oskar-wicht Sep 19, 2025

Community Support Policy

RabbitMQ version used

Erlang version used

Operating system (distribution) used

How is RabbitMQ deployed?

rabbitmq-diagnostics status output

Logs from node 1 (with sensitive values edited out)

Logs from node 2 (if applicable, with sensitive values edited out)

Logs from node 3 (if applicable, with sensitive values edited out)

rabbitmq.conf

Steps to deploy RabbitMQ cluster

Steps to reproduce the behavior in question

advanced.config

Application code

Kubernetes deployment file

What problem are you trying to solve?

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

michaelklishin Sep 19, 2025 Maintainer

Uh oh!

oskar-wicht Sep 22, 2025 Author

oskar-wicht
Sep 19, 2025

Replies: 1 comment 1 reply

michaelklishin
Sep 19, 2025
Maintainer

oskar-wicht Sep 22, 2025
Author