RabbitMQ 3.8 cluster restart problem #2416
-
RabbitMQ server: 3.8.5 If we shutdown kubernetes nodes (one by one) and bring the cluster back(again one by one), rabbitmq 3.8.5 cluster (of 3 nodes) doesn't start with below error:
But same exercise done with RabbitMQ version 3.7.13 doesn't have any issues. RabbitMQ cluster comes up fine after some 2-3 auto restarts by kuberntes on failure. As per one of the mailing lists, using force_boot does bring the cluster back, but we were expecting the cluster to be up without any interventions as in 3.7.13. Let me know if more info is needed. Thanks, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Use one of the basic health checks available today for |
Beta Was this translation helpful? Give feedback.
-
This has been discussed several times on In our peer discovery example we use a check that does not require a node to be fully booted, which means it will not block the deployment of other pods. |
Beta Was this translation helpful? Give feedback.
3.7.13
does not have feature flags, so they cannot hit what is effectively a timeout exception.failed_waiting_for_tables
means that no peers came online within a given window of time. On Kubernetes specifically, there is a well known scenario where areadinessProbe
that assumes a fully booted node — such as the deprecatedrabbitmq-diagnostics node_health_check
— will deadlock the deployment as other nodes will never come online until the currently deployed one does not finish booting, and the current one waits for peers to be around before it considers a boot to be successful.Use one of the basic health checks available today for
readinessProbe
and neverrabbitmq-diagnostics node_health…