Skip to content

Learner promotion not being persisted into v3store may be propagated across multiple upgrades #20793

@ahrtr

Description

@ahrtr

Bug report criteria

What happened?

Issue and analysis

When users upgrade from 1.32.3 (etcd 3.5.19) --> 1.33.1 (etcd 3.5.21) --> 1.34.1 (etcd 3.6.4), the upgrade is always successful in 1.32.3 --> 1.33.1 upgrade, but may fail during 1.33.1 --> 1.34.1.

Previously we fixed the learner promotion not being persisted into v3store issue in #19563. Also sent out formal blog https://etcd.io/blog/2025/upgrade_from_3.5_to_3.6_issue/

However, we missed one scenario, in rolling upgrade, the new added member (learner) may receive a snapshot from a member with old version ( <= 3.5.19), and the wrong membership data may be propagated to the new member even with a new version (>=3.5.20). But cluster will still be working, because v2store is the source of truth.

When users upgrade to 1.34.x (etcd 3.6.x), the wrong membership data may be propagated to new member again, now v3store is the source of truth, eventually the cluster-api may failed to add new learner due to etcdserver: too many learner members in cluster issue.

I developed a tool upgradeDemo, which follows exactly the same as how cluster-api upgrades Kubernetes cluster (add learner first, promote it later, and remove an old node at last; repeat the steps until all nodes are replaced).

I successfully reproduced the issue using the tool. All the related data and logs are saved in upgradeDemo/reproduction.

Solution

We need to deliver a patch for release-3.5 only to handle the case of applySnapshot. We need correct the v3store's membership when applying the snapshot.

We need to fix this in 3.5.24.

Workaround

Workaround 1

Upgrade at least 2 times with etcd version >= 3.5.20, before upgrading to 1.34.x (3.6.x).

For example,

  • upgrade from 1.32.3 (3.5.19) to 1.33.1 (3.5.21)
  • upgrade from 1.33.1 to 1.33.3 (after this upgrade, all the membership data will be automatically corrected by member's self-publish)

If users are still on 1.31.x, then please

  • upgrade to 1.32.x (ensure with etcd 3.5.20+)
  • upgrade to 1.33.x (ensure with etcd 3.5.20+)

Workaround 2

After successfully upgrading to 1.33.x (with etcd 3.5.20+), restart all etcd PODs/instances (just do this one by one; each member will automatically publish itself, so the wrong learner info will be auto corrected)

cc @fuweid @ivanvc @jmhbnz @serathius @siyuanfoundation

What did you expect to happen?

.

How can we reproduce it (as minimally and precisely as possible)?

Use https://github.com/ahrtr/etcd-issues/tree/master/etcd/upgradeDemo

See configuration https://github.com/ahrtr/etcd-issues/blob/master/etcd/upgradeDemo/config.json

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.release/v3.5type/bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions