-
Notifications
You must be signed in to change notification settings - Fork 10.2k
Description
Bug report criteria
- This bug report is not security related, security issues should be disclosed privately via [email protected].
- This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
- You have read the etcd bug reporting guidelines.
- Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.
What happened?
Issue and analysis
When users upgrade from 1.32.3 (etcd 3.5.19) --> 1.33.1 (etcd 3.5.21) --> 1.34.1 (etcd 3.6.4), the upgrade is always successful in 1.32.3 --> 1.33.1 upgrade, but may fail during 1.33.1 --> 1.34.1.
Previously we fixed the learner promotion not being persisted into v3store issue in #19563. Also sent out formal blog https://etcd.io/blog/2025/upgrade_from_3.5_to_3.6_issue/
However, we missed one scenario, in rolling upgrade, the new added member (learner) may receive a snapshot from a member with old version ( <= 3.5.19), and the wrong membership data may be propagated to the new member even with a new version (>=3.5.20). But cluster will still be working, because v2store is the source of truth.
When users upgrade to 1.34.x (etcd 3.6.x), the wrong membership data may be propagated to new member again, now v3store is the source of truth, eventually the cluster-api may failed to add new learner due to etcdserver: too many learner members in cluster
issue.
I developed a tool upgradeDemo, which follows exactly the same as how cluster-api upgrades Kubernetes cluster (add learner first, promote it later, and remove an old node at last; repeat the steps until all nodes are replaced).
I successfully reproduced the issue using the tool. All the related data and logs are saved in upgradeDemo/reproduction.
Solution
We need to deliver a patch for release-3.5 only to handle the case of applySnapshot. We need correct the v3store's membership when applying the snapshot.
We need to fix this in 3.5.24.
Workaround
Workaround 1
Upgrade at least 2 times with etcd version >= 3.5.20, before upgrading to 1.34.x (3.6.x).
For example,
- upgrade from 1.32.3 (3.5.19) to 1.33.1 (3.5.21)
- upgrade from 1.33.1 to 1.33.3 (after this upgrade, all the membership data will be automatically corrected by member's self-publish)
If users are still on 1.31.x, then please
- upgrade to 1.32.x (ensure with etcd 3.5.20+)
- upgrade to 1.33.x (ensure with etcd 3.5.20+)
Workaround 2
After successfully upgrading to 1.33.x (with etcd 3.5.20+), restart all etcd PODs/instances (just do this one by one; each member will automatically publish itself, so the wrong learner info will be auto corrected)
cc @fuweid @ivanvc @jmhbnz @serathius @siyuanfoundation
What did you expect to happen?
.
How can we reproduce it (as minimally and precisely as possible)?
Use https://github.com/ahrtr/etcd-issues/tree/master/etcd/upgradeDemo
See configuration https://github.com/ahrtr/etcd-issues/blob/master/etcd/upgradeDemo/config.json
Anything else we need to know?
No response
Etcd version (please run commands below)
$ etcd --version
# paste output here
$ etcdctl version
# paste output here
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
# paste output here
$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here