Skip to content

add minDomain for cluster topologySpread #32695

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jubrad
Copy link
Contributor

@jubrad jubrad commented Jun 9, 2025

Without a minDomain a topology spread constraint may place pods on only the subset of topology keys
where an existing node is provisioned. If we want to spread equally across all support availability zones rather than all availability zones that currently have nodes, then we need to set a minDomain value equal to the number of supported availability zones.

Perhaps a more appropriate way to handle this would be use the availability_zone's provided to environmentd. This was easier to throw together, it's more flexible, but less intuitive and easier to mess up.

There's some added trickiness to this one could imagine, where, for self-managed, this is not a global static, but a value that is dictated by the AZs available on the node_groups servicing a particular cluster or cluster size.

Motivation

We don't quite spread evenly.

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@jubrad jubrad force-pushed the feature/min-domains branch from f6ca9c0 to 934a12b Compare June 9, 2025 19:31
Without a minDomain a topology spread constraint may
place pods on only the subset of topology keys
where an existing node is provisioned. If we want to spread equally
across all support availability zones rather than all availability
zones that currently have nodes, then we need to set a minDomain value
equal to the number of supported availability zones.
@jubrad jubrad force-pushed the feature/min-domains branch from 934a12b to 678f4e4 Compare June 10, 2025 03:21
@jubrad jubrad marked this pull request as ready for review June 10, 2025 15:19
@jubrad jubrad requested review from a team as code owners June 10, 2025 15:19
@jubrad jubrad requested a review from aljoscha June 10, 2025 15:19
pub static CLUSTER_TOPOLOGY_SPREAD_MIN_DOMAINS: VarDefinition = VarDefinition::new(
"cluster_topology_spread_min_domains",
value!(Option<i32>; None),
"`minDomains` for replica topology spread constraints (Materialize).",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add to this comment that this should be equal to the number of AZs karpenter can launch materialize-instance nodes into for that region?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants