Manual failover vote is not limited by two times the node timeout #1305

enjoy-binbin · 2024-11-14T17:06:09Z

This limit should not restrict manual failover, otherwise in some
scenarios, manual failover will time out.

For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.

The problem with the manual failover retry is that the mf will pause
the client 5s in the primary side. So every retry every manual failover
timed out is a bad move.

This limit should not restrict manual failover, otherwise in some scenarios, manual failover will time out. For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs are lost during a manual failover, it cannot vote in the second manual failover. Or in a mixed scenario of plain failover and manual failover, it cannot vote for the subsequent manual failover. Signed-off-by: Binbin <[email protected]>

codecov · 2024-11-14T17:21:02Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.68%. Comparing base (32f7541) to head (d0bd282).
Report is 6 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1305      +/-   ##
============================================
- Coverage     70.69%   70.68%   -0.01%     
============================================
  Files           115      115              
  Lines         63153    63158       +5     
============================================
+ Hits          44643    44646       +3     
- Misses        18510    18512       +2

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.39% <100.00%> (+0.20%)`	⬆️

... and 13 files with indirect coverage changes

madolson · 2024-11-14T18:12:36Z

For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.

I'm not sure I agree with this. I think there should be some built in timeout into the system and you should retry.

enjoy-binbin · 2024-11-15T00:21:05Z

The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move

zuiderkwast

The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move

Yes, client pause for a long time is a bad move. ♟️ ❌

We already had this comment:

/* We did not voted for a replica about this primary for two
 * times the node timeout. This is not strictly needed for correctness
 * of the algorithm but makes the base case more linear. */

Hm, not strictly needed for correctness means that it's OK to change it. It doesn't affect correctness. You added this to the same comment:

 * This limitation does not restrict manual failover. If a user initiates
 * a manual failover, we need to allow it to vote, otherwise the manual
 * failover may time out. */

I think it's safe. I like the fix. The test cases look good too. Just some nits.

tests/unit/cluster/manual-failover.tcl

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Binbin <[email protected]>

zuiderkwast

Looks good. I added a comment about a log message, because you touched it. :) We don't have to fix it though.

src/cluster_legacy.c

Signed-off-by: Binbin <[email protected]>

enjoy-binbin requested review from madolson, zuiderkwast and PingXie November 14, 2024 17:07

zuiderkwast reviewed Nov 15, 2024

View reviewed changes

Apply suggestions from code review

3503d11

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Binbin <[email protected]>

zuiderkwast approved these changes Nov 16, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

update the log message to mention the replicaof

d0bd282

Signed-off-by: Binbin <[email protected]>

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manual failover vote is not limited by two times the node timeout #1305

Manual failover vote is not limited by two times the node timeout #1305

enjoy-binbin commented Nov 14, 2024 •

edited

Loading

codecov bot commented Nov 14, 2024 •

edited

Loading

madolson commented Nov 14, 2024

enjoy-binbin commented Nov 15, 2024

zuiderkwast left a comment

zuiderkwast left a comment

Manual failover vote is not limited by two times the node timeout #1305

Are you sure you want to change the base?

Manual failover vote is not limited by two times the node timeout #1305

Conversation

enjoy-binbin commented Nov 14, 2024 • edited Loading

codecov bot commented Nov 14, 2024 • edited Loading

Codecov Report

madolson commented Nov 14, 2024

enjoy-binbin commented Nov 15, 2024

zuiderkwast left a comment

Choose a reason for hiding this comment

zuiderkwast left a comment

Choose a reason for hiding this comment

enjoy-binbin commented Nov 14, 2024 •

edited

Loading

codecov bot commented Nov 14, 2024 •

edited

Loading