Make manual failover reset the on-going election to promote failover #1274

enjoy-binbin · 2024-11-08T06:02:06Z

If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.

Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.

In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.

If a manual failover got timed out, like the election don't get the enough votes, since we have a auth_timeout and a auth_retry_time, a new manual failover will not be able to proceed on the replica side. Like if we initiate a new manual failover after a election timed out, we will pause the primary, but on the replica side, due to retry_time, replica does not trigger the new election and the manual failover will eventually time out. In this case, if we initiate manual failover again and there is an ongoing election, we will reset it so that the replica can initiate a new election at the manual failover's request. Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

codecov · 2024-11-08T06:21:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.70%. Comparing base (2df56d8) to head (97487bf).
Report is 4 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1274      +/-   ##
============================================
+ Coverage     70.69%   70.70%   +0.01%     
============================================
  Files           114      115       +1     
  Lines         63161    63160       -1     
============================================
+ Hits          44650    44656       +6     
+ Misses        18511    18504       -7

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.47% <100.00%> (+0.24%)`	⬆️

... and 21 files with indirect coverage changes

enjoy-binbin · 2024-11-08T06:46:33Z

A log demo from the test case (before the fix).

replica:

28295:S 08 Nov 2024 14:37:20.208 * Manual failover user request accepted (user request from 'id=4 addr=127.0.0.1:59705 laddr=127.0.0.1:21111 fd=16 name= age=11 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=15 multi-mem=0 rbs=1024 rbp=518 obl=0 oll=0 omem=0 tot-mem=1951 events=r cmd=cluster|failover user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=318 tot-net-out=3163 tot-cmds=7').
28295:S 08 Nov 2024 14:37:20.209 * Received replication offset for paused primary manual failover: 14
28295:S 08 Nov 2024 14:37:20.209 * All primary replication stream processed, manual failover can start.
28295:S 08 Nov 2024 14:37:20.209 * Start of election delayed for 0 milliseconds (rank #0, offset 14).
28295:S 08 Nov 2024 14:37:20.209 * Starting a failover election for epoch 4.
28295:S 08 Nov 2024 14:37:25.096 * Currently unable to failover: Waiting for votes, but majority still not reached.
28295:S 08 Nov 2024 14:37:25.096 * Needed quorum: 2. Number of votes received so far: 1
28295:S 08 Nov 2024 14:37:25.298 # Manual failover timed out.

# The second cluster failover, but got timed out due to the auth_timeout and need to wait for auth_retry_time
28295:S 08 Nov 2024 14:37:25.345 * Manual failover user request accepted (user request from 'id=4 addr=127.0.0.1:59705 laddr=127.0.0.1:21111 fd=16 name= age=16 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=15 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1951 events=r cmd=cluster|failover user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=349 tot-net-out=3168 tot-cmds=8').
28295:S 08 Nov 2024 14:37:25.346 * Received replication offset for paused primary manual failover: 14
28295:S 08 Nov 2024 14:37:25.346 * All primary replication stream processed, manual failover can start.
28295:S 08 Nov 2024 14:37:30.046 * Currently unable to failover: Waiting for votes, but majority still not reached.
28295:S 08 Nov 2024 14:37:30.046 * Needed quorum: 2. Number of votes received so far: 1
28295:S 08 Nov 2024 14:37:30.349 # Manual failover timed out.

the primary:

28385:M 08 Nov 2024 14:37:20.208 * Manual failover requested by replica a31915be22368c4df57d2f17d58cc03f578e3149 ().
28385:M 08 Nov 2024 14:37:20.209 * Failover auth granted to a31915be22368c4df57d2f17d58cc03f578e3149 () for epoch 4
28385:M 08 Nov 2024 14:37:25.221 # Manual failover timed out.
28385:M 08 Nov 2024 14:37:25.346 * Manual failover requested by replica a31915be22368c4df57d2f17d58cc03f578e3149 ().
28385:M 08 Nov 2024 14:37:30.376 # Manual failover timed out.

hpatro · 2024-11-08T19:16:43Z

src/cluster_legacy.c

-        server.cluster->mf_can_start = 1;
+        manualFailoverCanStart();


Should we rather invoke resetManualFailover and clean up failover_auth_time in that period?

We should anyway cleanup failover_auth_time in the resetManualFailover method.

I have thought about putting it in resetManualFailover, but I was worried about introducing other problems since resetManualFailover is called in many places.

…_reset Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2024-11-11T14:45:55Z

tests/unit/cluster/manual-failover.tcl

+        R 3 cluster failover
+
+        # Waiting for primary and replica to confirm manual failover timeout.
+        wait_for_log_messages 0 {"*Manual failover timed out*"} 0 1000 50


i forgot to mention in here, the wait will take at least 5 seconds since the manual failover timeout is 5s, i can use other fields maybe like epoch to replace it if needed.

Using log messages should be OK.

madolson

Seems reasonable to me.

madolson · 2024-11-14T05:55:20Z

tests/unit/cluster/manual-failover.tcl

+        R 3 cluster failover
+
+        # Waiting for primary and replica to confirm manual failover timeout.
+        wait_for_log_messages 0 {"*Manual failover timed out*"} 0 1000 50


Using log messages should be OK.

madolson · 2024-11-14T05:55:48Z

tests/unit/cluster/manual-failover.tcl

+        R 1 debug drop-cluster-packet-filter $CLUSTER_PACKET_TYPE_NONE
+        R 2 debug drop-cluster-packet-filter $CLUSTER_PACKET_TYPE_NONE


I really like the usage of these constants, we should probably do it more to improve readability in tests.

Signed-off-by: Binbin <[email protected]>

zuiderkwast

Not a full review. The idea looks good.

enjoy-binbin requested a review from PingXie November 8, 2024 06:02

fix typo

0595cd0

Signed-off-by: Binbin <[email protected]>

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Nov 8, 2024

hpatro reviewed Nov 8, 2024

View reviewed changes

enjoy-binbin requested review from zuiderkwast and madolson November 11, 2024 05:27

enjoy-binbin added 2 commits November 11, 2024 22:38

Merge remote-tracking branch 'upstream/unstable' into manual_failover…

de98db8

…_reset Signed-off-by: Binbin <[email protected]>

Fix format

6c37c58

Signed-off-by: Binbin <[email protected]>

enjoy-binbin commented Nov 11, 2024

View reviewed changes

madolson approved these changes Nov 14, 2024

View reviewed changes

update name

97487bf

Signed-off-by: Binbin <[email protected]>

zuiderkwast approved these changes Nov 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make manual failover reset the on-going election to promote failover #1274

Make manual failover reset the on-going election to promote failover #1274

enjoy-binbin commented Nov 8, 2024

codecov bot commented Nov 8, 2024 •

edited

Loading

enjoy-binbin commented Nov 8, 2024

hpatro Nov 8, 2024

enjoy-binbin Nov 9, 2024

enjoy-binbin Nov 11, 2024

madolson Nov 14, 2024

madolson left a comment

madolson Nov 14, 2024

madolson Nov 14, 2024

zuiderkwast left a comment

		R 1 debug drop-cluster-packet-filter $CLUSTER_PACKET_TYPE_NONE
		R 2 debug drop-cluster-packet-filter $CLUSTER_PACKET_TYPE_NONE

Make manual failover reset the on-going election to promote failover #1274

Are you sure you want to change the base?

Make manual failover reset the on-going election to promote failover #1274

Conversation

enjoy-binbin commented Nov 8, 2024

codecov bot commented Nov 8, 2024 • edited Loading

Codecov Report

enjoy-binbin commented Nov 8, 2024

hpatro Nov 8, 2024

Choose a reason for hiding this comment

enjoy-binbin Nov 9, 2024

Choose a reason for hiding this comment

enjoy-binbin Nov 11, 2024

Choose a reason for hiding this comment

madolson Nov 14, 2024

Choose a reason for hiding this comment

madolson left a comment

Choose a reason for hiding this comment

madolson Nov 14, 2024

Choose a reason for hiding this comment

madolson Nov 14, 2024

Choose a reason for hiding this comment

zuiderkwast left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 8, 2024 •

edited

Loading