DLX logs "Cannot forward any dead-letter messages from source quorum queue" #12626

timbuchwaldt · 2024-10-31T12:15:46Z

timbuchwaldt
Oct 31, 2024

Describe the bug

We see the following warning upon broker restart:
2024-10-31 10:43:15.411289+00:00 [warning] <0.3158.0> Cannot forward any dead-letter messages from source quorum queue 'input-pending-retry' in vhost 'my-vhost' with configured dead-letter-exchange exchange '' in vhost 'my-vhost' and configured dead-letter-routing-key 'input-pending'. This can happen either if the dead-letter routing topology is misconfigured (for example no queue bound to dead-letter-exchange or wrong dead-letter-routing-key configured) or if non-mirrored classic queues are bound whose host node is down. Fix this issue to prevent dead-lettered messages from piling up in the source quorum queue. This message will not be logged again.

This occurs as soon as a message is stuck in said queue. By wiping the policies that configure this behavior, previously dead-lettered messages (only showing up in total, not ready/unacked) get dropped, but the problem re-occurs again.

Reproduction steps

I was unable to reproduce this outside of the clusters that show this behaviour. I suspect this is connected to those clusters having lived quite a few days and seen a few upgrades.

Expected behavior

Dead-lettering works across upgrades

Additional context

Clusters of 3/5 nodes show this behavior, running the current 4.0.3 release.

timbuchwaldt · 2024-10-31T12:16:33Z

timbuchwaldt
Oct 31, 2024
Author

Potentially related: In some of the clusters we see that node restarts (with 3-node-quorum quorum queues and 5 brokers) don't go through, logging that queues would should down if certain nodes would be taken offline.

0 replies

michaelklishin · 2024-10-31T15:02:38Z

michaelklishin
Oct 31, 2024
Maintainer

@timbuchwaldt we cannot help given the amount of information provided. We do not guess in this community and certainly won't guess what your topology looks like, roughly what what your clients do and what else may be in the logs.

3 replies

timbuchwaldt Oct 31, 2024
Author

Sure. So what I can give you from here (non-work-phone etc):
It’s 2 quorum queues. The input-pending one dead-letters to input-pending-retry, input-pending-retry back to input-pending with a TTL. They go via the default exchange, so no complexities involved.

michaelklishin Oct 31, 2024
Maintainer

Logs from all nodes from the moment the upgrade was initiated.

kjnilsson Nov 1, 2024
Maintainer

at debug level ideally

michaelklishin · 2024-10-31T15:06:31Z

michaelklishin
Oct 31, 2024
Maintainer

The error message mentions a few possible scenarios. DLX could not do its job, for one reason or another. Maybe the topology was changed concurrently. Maybe a node was restarted and it was one of the effects of what #12412 seeks to address w.r.t. policy applications.

Dead lettering is not magic, it depends on routing topologies as any other publisher. And topologies can change during upgrades, in particular when classic queues are involved or policies are changed concurrently with node restarts.

4 replies

michaelklishin Oct 31, 2024
Maintainer

Another core team member specifically suspects "non-mirrored classic queues are bound whose host node is down".

In which case, the solution is to use a replicated queue type (a quorum queue) because non-replicated CQs can lose availability during upgrades by definition.

timbuchwaldt Oct 31, 2024
Author

There are no classic queues involved. The issue persists across multiple restarts and messages pile up during stable operations as well. The only oddity I notice is that some nodes refuse shutting down due to loss of quorum in a planned restart (we run on k8s), even though according to all visible output (management GUI / console), 2 servers would remain online and therefore enough to form a quorum on a queue with 3 nodes

michaelklishin Oct 31, 2024
Maintainer

Any condition that prevented a quorum queue from accepting a publish would result in this message. DLX is not a client but it can face many of the same scenarios that can prevent publishers from publishing. In fact, in 3.12 or 3.13 it was reworked substantially specifically to make it possible to handle at least some errors vs. quietly failing.

kjnilsson Nov 1, 2024
Maintainer

There are no classic queues involved. The issue persists across multiple restarts and messages pile up during stable operations as well. The only oddity I notice is that some nodes refuse shutting down due to loss of quorum in a planned restart (we run on k8s), even though according to all visible output (management GUI / console), 2 servers would remain online and therefore enough to form a quorum on a queue with 3 nodes

Please share the output of rabbitmq-queues quorum_status for any quorum queues on the system. It's not enough to have rabbit nodes you also need to ensure that each queue has enough members to span at least 3 nodes. It could be that some queues were declared before the cluster was fully formed and only have 2 members.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DLX logs "Cannot forward any dead-letter messages from source quorum queue" #12626

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

DLX logs "Cannot forward any dead-letter messages from source quorum queue" #12626

timbuchwaldt Oct 31, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 3 comments · 7 replies

timbuchwaldt Oct 31, 2024 Author

michaelklishin Oct 31, 2024 Maintainer

timbuchwaldt Oct 31, 2024 Author

michaelklishin Oct 31, 2024 Maintainer

kjnilsson Nov 1, 2024 Maintainer

michaelklishin Oct 31, 2024 Maintainer

michaelklishin Oct 31, 2024 Maintainer

timbuchwaldt Oct 31, 2024 Author

michaelklishin Oct 31, 2024 Maintainer

kjnilsson Nov 1, 2024 Maintainer

timbuchwaldt
Oct 31, 2024

Replies: 3 comments 7 replies

timbuchwaldt
Oct 31, 2024
Author

michaelklishin
Oct 31, 2024
Maintainer

timbuchwaldt Oct 31, 2024
Author

michaelklishin Oct 31, 2024
Maintainer

kjnilsson Nov 1, 2024
Maintainer

michaelklishin
Oct 31, 2024
Maintainer

michaelklishin Oct 31, 2024
Maintainer

timbuchwaldt Oct 31, 2024
Author

michaelklishin Oct 31, 2024
Maintainer

kjnilsson Nov 1, 2024
Maintainer