-
Notifications
You must be signed in to change notification settings - Fork 77
Description
I have a cluster with kubegres deployed. We have a primary and 2 replicas. Unfortunately our network is rather bad so we see the following problem a lot and it's a decent amount of manual intervention to fix it (I can fix it via methods not specified here). I'd like to know the right way or an easier way to fix it though. Unfortunately I cannot provide logs.
Scenario: Either there is a network or nfs outtage, the primary fails, this outtage continues to for a bit. The primary dies in some capacity, the database rolls over a few times and eventually we wind up in a state where we have a dead primary complaining about a bad timeline segment and a replica that is available. I can confirm the replica has sufficient information, so I'd prefer to just forget about the old primary and simply make the replica the new primary, I don't care if there is minimal data loss.
Currently in this sort of scenario Kubegres logs that it has basically gone hands off the cluster "Until we fix it manually." Which, I guess is fine. The steps I take to attempt to restore the replica are:
- Promote it using pg_ctl
- Set the promotePod in the kubegres config
- Label the statefulset and the pod to be primary
- Delete the statefulset and backing storage of the old failing primary instance
At this point I'd expect Kubegres to simply take over, use the replica as the new primary, and create 2 new replicas. It doesn't do that. Instead it keeps complaining about the dead primary that is not even in kubernetes anymore and completely ignoring the one I promoted and labeled.
So I have two question
- What is the correct way to resolve the scenario with kubegres so I am working with kubegres and not against it
- Is there an easy way to tell kubegres on the promotePod "I'm the boss, I said promote this pod, ignore the other pods, use this one and move on with life." Like a
forcePromote: true
or something? Or a way to redeploy and simply tell it to use an old pvc/pv that I know of for the first instance?