Skip to content

Conversation

@bbpetukhov
Copy link
Collaborator

@bbpetukhov bbpetukhov commented Oct 24, 2025

Describe your changes
Initial intention was to add a new switch to ITs which restarts cluster (having multiple nodes) in FSM mode running only single node, then set quorum to 1 to get cluster into heathy state having only one node up.

After code inspection it was found out that there are 4 quorums in the code which are not in sync. Thus this PR proposes a new class Cluster Quorum Manager which stores and provides thread safe Quorum variable for other components.

There is a small change in behavior how quorum from config file is treated. Previously even if in config a big value is set for quorum, it is reset to number of nodes. However quorum can be set to any value via admin commands. I refactored the behavior to be universal in both cases -- there is no more restriction on max quorum values.

Testing performed

  • UTs passed
  • ITs passed

@bbpetukhov bbpetukhov marked this pull request as ready for review October 24, 2025 16:22
@bbpetukhov bbpetukhov requested a review from a team as a code owner October 24, 2025 16:22
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Copy link
Collaborator

@kaikulimu kaikulimu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed all the non-Python files. Please fix the IT which is failing.

Also, please take a look at #1006 which you can use in this PR.

bbpetukhov and others added 11 commits November 3, 2025 14:44
Co-authored-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Co-authored-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Co-authored-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Co-authored-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Co-authored-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Co-authored-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Co-authored-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Co-authored-by: Yuan Jing Vincent Yan <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
Signed-off-by: Dmitrii Petukhov <[email protected]>
@bbpetukhov bbpetukhov assigned kaikulimu and unassigned bbpetukhov Nov 7, 2025
@kaikulimu kaikulimu assigned bbpetukhov and unassigned kaikulimu Nov 7, 2025
@kaikulimu
Copy link
Collaborator

@bbpetukhov ITs failing. Please fix.

@bbpetukhov bbpetukhov force-pushed the it-legacy-to-fsm-switch-via-quorum1 branch 2 times, most recently from 8d16674 to f58e04b Compare November 13, 2025 09:17
@bbpetukhov bbpetukhov force-pushed the it-legacy-to-fsm-switch-via-quorum1 branch from f58e04b to 488bd6d Compare November 13, 2025 10:30
@bbpetukhov
Copy link
Collaborator Author

@bbpetukhov ITs failing. Please fix.

Fixed

@bbpetukhov bbpetukhov assigned kaikulimu and unassigned bbpetukhov Nov 13, 2025
Signed-off-by: Dmitrii Petukhov <[email protected]>
Copy link
Collaborator

@kaikulimu kaikulimu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is very close to being mergeable. Great job!

Comment on lines 663 to 669
# (restart_as_fsm_mode, restart_as_legacy_mode),
# (restart_as_legacy_mode, restart_as_fsm_mode),
(restart_to_fsm_single_node_with_quorum_one, restart_as_legacy_mode),
# (
# restart_to_fsm_single_node_with_quorum_one_and_start_others,
# restart_as_legacy_mode,
# ),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# (restart_as_fsm_mode, restart_as_legacy_mode),
# (restart_as_legacy_mode, restart_as_fsm_mode),
(restart_to_fsm_single_node_with_quorum_one, restart_as_legacy_mode),
# (
# restart_to_fsm_single_node_with_quorum_one_and_start_others,
# restart_as_legacy_mode,
# ),
(restart_as_fsm_mode, restart_as_legacy_mode),
(restart_as_legacy_mode, restart_as_fsm_mode),
(restart_to_fsm_single_node_with_quorum_one, restart_as_legacy_mode),
(restart_to_fsm_single_node_with_quorum_one_and_start_others, restart_as_legacy_mode),

We do need to test all these scenarios

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, now in the GitHub Actions UI, the test name shows as test_restart_between_modes.py::test_restart_between_Legacy_and_FSM[single_node_fsm-strong_consistency-switch_cluster_mode0]. switch_cluster_mode0 is not very descriptive as we do not know what 0 means. Is there a way to name the test as test_restart_between_modes.py::test_restart_between_Legacy_and_FSM[single_node_fsm-strong_consistency-restart_to_fsm_single_node_with_quorum_one] for example?

You can ask @jll63 for help if needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting the switches was obviously an oversight!

After I enabled the switches, I noticed an interesting behavior in the last switch: "AssertionError: Node east2 does not have 2 messages for queue bmq://bmq.test.mmap.priority/qqq0 in partition 0 at storage layer"

The check failed which verifies that all new messages are stored in the partition of all nodes. It fails both in Eventual and Strong consistency modes. https://github.com/bbpetukhov/blazingmq/blob/it-legacy-to-fsm-switch-via-quorum1/src/integration-tests/test_restart_between_modes.py#L409

It happens in the scenario where cluster started in FSM mode with only single node up at first, set quorum to 1, wait until cluster gets ready, and only then start other nodes.
https://github.com/bbpetukhov/blazingmq/blob/it-legacy-to-fsm-switch-via-quorum1/src/integration-tests/test_restart_between_modes.py#L172C5-L172C64

Maybe this is a bug in broker?

Comment on lines 795 to 797
# 4. POST MORE MESSAGES
for queue in [priority_queue, fanout_queue]:
postFewMessages(producer, queue, ["msg4"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we should also add/remove appIds in between posting messages, similar to what we did in the first half.

Copy link
Collaborator Author

@bbpetukhov bbpetukhov Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented add/remove more appIds become switching back to backup mode: https://github.com/bbpetukhov/blazingmq/blob/it-legacy-to-fsm-switch-via-quorum1/src/integration-tests/test_restart_between_modes.py#L801

3 switches out of 4 passed, but there is one which fails - "FSM mode with single node and quorum 1. Restart back to Legacy mode". On restarting cluster back in Legacy mode, "east2" node fails to start with error:

INFO  mqba_application.cpp:538 STARTING bmqbrkr [pid: 165205]
INFO  mqba_application.cpp:538 Reading broker configuration from etc/bmqbrkrcfg.json
INFO  m_bmqbrkr_task.cpp:314 Memory allocation limit set to 32.00 GB
INFO  bmqproc.py:130 PANIC [STARTUP] (-1): Failed to initialize task [rc: -4, reason: 'failed to start pipe control channel (rc: 1)']
ERROR balb_pipecontrolchannel.cpp:455 Named pipe './/bmqbrkr.ctl' is already in use by another process
INFO  bmqtsk_logcleaner.cpp:88  Cleaning files matching 'logs/logs.*.*' older than 04NOV2025_18:06:28.964093.
ERROR balb_pipecontrolchannel.cpp:608 Unable to create named pipe './/bmqbrkr.ctl'
INFO  bmqtsk_logcleaner.cpp:122 Cleaning files matching 'logs/logs.*.*' completed, 0 file(s) deleted.

@kaikulimu kaikulimu assigned bbpetukhov and unassigned kaikulimu Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants