RabbitMQ 3.8 cluster restart problem #2416

gksnjy · 2020-07-23T07:43:47Z

gksnjy
Jul 23, 2020

RabbitMQ server: 3.8.5
Erlang: 23.0.2
Operating system version: Deployed on kubernetes (1.16.3) cluster using rabbitmq docker images (as part of openstack 'train' deployement using openstack-helm)
client libraries: oslo.messaging
RabbitMQ plugins: rabbitmq-prometheus, rabbitmq-management (disable_metrics_collection set), rabbitmq_peer_discovery_k8s

If we shutdown kubernetes nodes (one by one) and bring the cluster back(again one by one), rabbitmq 3.8.5 cluster (of 3 nodes) doesn't start with below error:

2020-07-21 09:18:55.987 [info] <0.268.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2020-07-21 09:18:55.987 [error] <0.268.0> Feature flag `quorum_queue`: migration function crashed: {error,{failed_waiting_for_tables,['[email protected]','[email protected]','[email protected]'],{node_not_running,'[email protected]'}}}
[{rabbit_table,wait,3,[{file,"src/rabbit_table.erl"},{line,120}]},{rabbit_core_ff,quorum_queue_migration,3,[{file,"src/rabbit_core_ff.erl"},{line,60}]},{rabbit_feature_flags,run_migration_fun,3,[{file,"src/rabbit_feature_flags.erl"},{line,1611}]},{rabbit_feature_flags,'-verify_which_feature_flags_are_actually_enabled/0-fun-2-',3,[{file,"src/rabbit_feature_flags.erl"},{line,2278}]},{maps,fold_1,3,[{file,"maps.erl"},{line,233}]},{rabbit_feature_flags,verify_which_feature_flags_are_actually_enabled,0,[{file,"src/rabbit_feature_flags.erl"},{line,2276}]},{rabbit_feature_flags,sync_feature_flags_with_cluster,3,[{file,"src/rabbit_feature_flags.erl"},{line,2091}]},{rabbit_mnesia,ensure_feature_flags_are_in_sync,2,[{file,"src/rabbit_mnesia.erl"},{line,656}]}]

But same exercise done with RabbitMQ version 3.7.13 doesn't have any issues. RabbitMQ cluster comes up fine after some 2-3 auto restarts by kuberntes on failure.

As per one of the mailing lists, using force_boot does bring the cluster back, but we were expecting the cluster to be up without any interventions as in 3.7.13.

Let me know if more info is needed.

Thanks,

Answered by michaelklishin

Jul 23, 2020

3.7.13 does not have feature flags, so they cannot hit what is effectively a timeout exception.

failed_waiting_for_tables means that no peers came online within a given window of time. On Kubernetes specifically, there is a well known scenario where a readinessProbe that assumes a fully booted node — such as the deprecated rabbitmq-diagnostics node_health_check — will deadlock the deployment as other nodes will never come online until the currently deployed one does not finish booting, and the current one waits for peers to be around before it considers a boot to be successful.

Use one of the basic health checks available today for readinessProbe and never rabbitmq-diagnostics node_health…

View full answer

michaelklishin · 2020-07-23T12:37:20Z

michaelklishin
Jul 23, 2020
Maintainer

3.7.13 does not have feature flags, so they cannot hit what is effectively a timeout exception.

failed_waiting_for_tables means that no peers came online within a given window of time. On Kubernetes specifically, there is a well known scenario where a readinessProbe that assumes a fully booted node — such as the deprecated rabbitmq-diagnostics node_health_check — will deadlock the deployment as other nodes will never come online until the currently deployed one does not finish booting, and the current one waits for peers to be around before it considers a boot to be successful.

Use one of the basic health checks available today for readinessProbe and never rabbitmq-diagnostics node_health_check.

2 replies

gksnjy Jul 28, 2020
Author

Thanks @michaelklishin ! This helped! :)

michaelklishin Jul 28, 2020
Maintainer

FTR, we now have a dedicated doc section on readiness probes :)

michaelklishin · 2020-07-23T12:42:55Z

michaelklishin
Jul 23, 2020
Maintainer

This has been discussed several times on rabbitmq-users and in community Slack, although I could not find a specific thread quickly because Google Groups search has surprisingly low recall.

In our peer discovery example we use a check that does not require a node to be fully booted, which means it will not block the deployment of other pods.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RabbitMQ 3.8 cluster restart problem #2416

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

RabbitMQ 3.8 cluster restart problem #2416

gksnjy Jul 23, 2020

Replies: 2 comments · 2 replies

michaelklishin Jul 23, 2020 Maintainer

gksnjy Jul 28, 2020 Author

michaelklishin Jul 28, 2020 Maintainer

michaelklishin Jul 23, 2020 Maintainer

gksnjy
Jul 23, 2020

Replies: 2 comments 2 replies

michaelklishin
Jul 23, 2020
Maintainer

gksnjy Jul 28, 2020
Author

michaelklishin Jul 28, 2020
Maintainer

michaelklishin
Jul 23, 2020
Maintainer