Skip to content

Old primary blocking Kubegres from proceeding #191

@zac-market

Description

@zac-market

I have a cluster with kubegres deployed. We have a primary and 2 replicas. Unfortunately our network is rather bad so we see the following problem a lot and it's a decent amount of manual intervention to fix it (I can fix it via methods not specified here). I'd like to know the right way or an easier way to fix it though. Unfortunately I cannot provide logs.

Scenario: Either there is a network or nfs outtage, the primary fails, this outtage continues to for a bit. The primary dies in some capacity, the database rolls over a few times and eventually we wind up in a state where we have a dead primary complaining about a bad timeline segment and a replica that is available. I can confirm the replica has sufficient information, so I'd prefer to just forget about the old primary and simply make the replica the new primary, I don't care if there is minimal data loss.

Currently in this sort of scenario Kubegres logs that it has basically gone hands off the cluster "Until we fix it manually." Which, I guess is fine. The steps I take to attempt to restore the replica are:

  1. Promote it using pg_ctl
  2. Set the promotePod in the kubegres config
  3. Label the statefulset and the pod to be primary
  4. Delete the statefulset and backing storage of the old failing primary instance

At this point I'd expect Kubegres to simply take over, use the replica as the new primary, and create 2 new replicas. It doesn't do that. Instead it keeps complaining about the dead primary that is not even in kubernetes anymore and completely ignoring the one I promoted and labeled.

So I have two question

  1. What is the correct way to resolve the scenario with kubegres so I am working with kubegres and not against it
  2. Is there an easy way to tell kubegres on the promotePod "I'm the boss, I said promote this pod, ignore the other pods, use this one and move on with life." Like a forcePromote: true or something? Or a way to redeploy and simply tell it to use an old pvc/pv that I know of for the first instance?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions