Skip to content

Conversation

@elderapo
Copy link

@elderapo elderapo commented Oct 16, 2025

For the sake of simplicity, I will be referring to spec.replicas as instances. Renaming replicas to instances won't be a part of this PR (so there won't be breaking changes, requiring a possibly complex operator upgrade for the users).


Scope:

  • Allow scaling instances down.
    • When scaling down StatefulSet's it's not possible to customize which pods get deleted first; instead highest ordinals get deleted first (for example, with 5 instances, scaling down to 3 will delete 4 and 3, keeping 0, 1, and 2). To ensure the primary is not accidentally deleted during scale down, it's required not to be in the "pods to delete pool". If it is, the cluster instances update is not accepted with the following error message:
    The MySQLCluster "mysql-database-cluster" is invalid: spec.replicas: Forbidden: scale-down 2->1 would delete the current primary pod moco-mysql-database-cluster-1. Perform a switchover so the primary's ordinal is < 1, then retry.
    • When scaling down from multiple instances to a single one - primary, semi-sync replication gets disabled on it (because there are no longer replicas to ACK events from the primary).
  • Sequentially add instances to the cluster on scale-up
    • Initially, I thought it would be nice to create new instances one by one, let them clone, sync up, and join the cluster. However, it appears that moco heavily relies on all pod instances being deployed right after scale-up.
  • Prevent staling writes during scale-up.
    • Freshly added replicas do not get taken into account for primary semi-sync ack calculation until they successfully bootstrap for the first time.
  • Allow even instance counts
    • Allowing even instances does not increase fault tolerance vs the previous odd size, but gives the cluster operator more control.
    • Safety is maintained by a true majority for semi-sync replication: ceil((instances - 1) / 2) replicas are required to ACK before primary commits.
  • Prevent staling writes during scale-down.
    • On scale-down, there is a brief write stale (~5 seconds, from what I've observed). Ideally, this can be prevented, and the primary ack configuration should be updated immediately.
    • This might not be easily achievable because reconciliation seems to only take place when the pod count is the same as the actual live pods. Once the terminating pods get removed, the primary gets updated.
  • Update docs to document these changes.
  • Update CHANGELOG.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant