-
Notifications
You must be signed in to change notification settings - Fork 155
Feat[MQB]: Thread safe Cluster Quorum Manager and ITs #1001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat[MQB]: Thread safe Cluster Quorum Manager and ITs #1001
Conversation
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
…ed quorum variable Signed-off-by: Dmitrii Petukhov <[email protected]>
… quorum=1 and then start other nodes Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
2c31d64 to
28ecbc3
Compare
kaikulimu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed all the non-Python files. Please fix the IT which is failing.
Also, please take a look at #1006 which you can use in this PR.
Co-authored-by: Yuan Jing Vincent Yan <[email protected]> Signed-off-by: Dmitrii Petukhov <[email protected]>
Co-authored-by: Yuan Jing Vincent Yan <[email protected]> Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Co-authored-by: Yuan Jing Vincent Yan <[email protected]> Signed-off-by: Dmitrii Petukhov <[email protected]>
Co-authored-by: Yuan Jing Vincent Yan <[email protected]> Signed-off-by: Dmitrii Petukhov <[email protected]>
Co-authored-by: Yuan Jing Vincent Yan <[email protected]> Signed-off-by: Dmitrii Petukhov <[email protected]>
Co-authored-by: Yuan Jing Vincent Yan <[email protected]> Signed-off-by: Dmitrii Petukhov <[email protected]>
Co-authored-by: Yuan Jing Vincent Yan <[email protected]> Signed-off-by: Dmitrii Petukhov <[email protected]>
Co-authored-by: Yuan Jing Vincent Yan <[email protected]> Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
|
@bbpetukhov ITs failing. Please fix. |
…f intentionally Signed-off-by: Dmitrii Petukhov <[email protected]>
8d16674 to
f58e04b
Compare
…e-node scenarios Signed-off-by: Dmitrii Petukhov <[email protected]>
f58e04b to
488bd6d
Compare
Fixed |
Signed-off-by: Dmitrii Petukhov <[email protected]>
kaikulimu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is very close to being mergeable. Great job!
| # (restart_as_fsm_mode, restart_as_legacy_mode), | ||
| # (restart_as_legacy_mode, restart_as_fsm_mode), | ||
| (restart_to_fsm_single_node_with_quorum_one, restart_as_legacy_mode), | ||
| # ( | ||
| # restart_to_fsm_single_node_with_quorum_one_and_start_others, | ||
| # restart_as_legacy_mode, | ||
| # ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # (restart_as_fsm_mode, restart_as_legacy_mode), | |
| # (restart_as_legacy_mode, restart_as_fsm_mode), | |
| (restart_to_fsm_single_node_with_quorum_one, restart_as_legacy_mode), | |
| # ( | |
| # restart_to_fsm_single_node_with_quorum_one_and_start_others, | |
| # restart_as_legacy_mode, | |
| # ), | |
| (restart_as_fsm_mode, restart_as_legacy_mode), | |
| (restart_as_legacy_mode, restart_as_fsm_mode), | |
| (restart_to_fsm_single_node_with_quorum_one, restart_as_legacy_mode), | |
| (restart_to_fsm_single_node_with_quorum_one_and_start_others, restart_as_legacy_mode), |
We do need to test all these scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, now in the GitHub Actions UI, the test name shows as test_restart_between_modes.py::test_restart_between_Legacy_and_FSM[single_node_fsm-strong_consistency-switch_cluster_mode0]. switch_cluster_mode0 is not very descriptive as we do not know what 0 means. Is there a way to name the test as test_restart_between_modes.py::test_restart_between_Legacy_and_FSM[single_node_fsm-strong_consistency-restart_to_fsm_single_node_with_quorum_one] for example?
You can ask @jll63 for help if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting the switches was obviously an oversight!
After I enabled the switches, I noticed an interesting behavior in the last switch: "AssertionError: Node east2 does not have 2 messages for queue bmq://bmq.test.mmap.priority/qqq0 in partition 0 at storage layer"
The check failed which verifies that all new messages are stored in the partition of all nodes. It fails both in Eventual and Strong consistency modes. https://github.com/bbpetukhov/blazingmq/blob/it-legacy-to-fsm-switch-via-quorum1/src/integration-tests/test_restart_between_modes.py#L409
It happens in the scenario where cluster started in FSM mode with only single node up at first, set quorum to 1, wait until cluster gets ready, and only then start other nodes.
https://github.com/bbpetukhov/blazingmq/blob/it-legacy-to-fsm-switch-via-quorum1/src/integration-tests/test_restart_between_modes.py#L172C5-L172C64
Maybe this is a bug in broker?
| # 4. POST MORE MESSAGES | ||
| for queue in [priority_queue, fanout_queue]: | ||
| postFewMessages(producer, queue, ["msg4"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we should also add/remove appIds in between posting messages, similar to what we did in the first half.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented add/remove more appIds become switching back to backup mode: https://github.com/bbpetukhov/blazingmq/blob/it-legacy-to-fsm-switch-via-quorum1/src/integration-tests/test_restart_between_modes.py#L801
3 switches out of 4 passed, but there is one which fails - "FSM mode with single node and quorum 1. Restart back to Legacy mode". On restarting cluster back in Legacy mode, "east2" node fails to start with error:
INFO mqba_application.cpp:538 STARTING bmqbrkr [pid: 165205]
INFO mqba_application.cpp:538 Reading broker configuration from etc/bmqbrkrcfg.json
INFO m_bmqbrkr_task.cpp:314 Memory allocation limit set to 32.00 GB
INFO bmqproc.py:130 PANIC [STARTUP] (-1): Failed to initialize task [rc: -4, reason: 'failed to start pipe control channel (rc: 1)']
ERROR balb_pipecontrolchannel.cpp:455 Named pipe './/bmqbrkr.ctl' is already in use by another process
INFO bmqtsk_logcleaner.cpp:88 Cleaning files matching 'logs/logs.*.*' older than 04NOV2025_18:06:28.964093.
ERROR balb_pipecontrolchannel.cpp:608 Unable to create named pipe './/bmqbrkr.ctl'
INFO bmqtsk_logcleaner.cpp:122 Cleaning files matching 'logs/logs.*.*' completed, 0 file(s) deleted.
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Describe your changes
Initial intention was to add a new switch to ITs which restarts cluster (having multiple nodes) in FSM mode running only single node, then set quorum to 1 to get cluster into heathy state having only one node up.
After code inspection it was found out that there are 4 quorums in the code which are not in sync. Thus this PR proposes a new class Cluster Quorum Manager which stores and provides thread safe Quorum variable for other components.
There is a small change in behavior how quorum from config file is treated. Previously even if in config a big value is set for quorum, it is reset to number of nodes. However quorum can be set to any value via admin commands. I refactored the behavior to be universal in both cases -- there is no more restriction on max quorum values.
Testing performed