KEP-3243: Respect PodTopologySpread after rolling upgrades

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- use pod generateName
- implement MatchLabelKeys in only either the scheduler plugin or kube-apiserver
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The pod topology spread feature allows users to define the group of pods over which spreading is applied using a LabelSelector. This means the user should know the exact label key and value when defining the pod spec.

This KEP proposes a complementary field to LabelSelector named MatchLabelKeys in TopologySpreadConstraint which represents a set of label keys only. At a pod creation, kube-apiserver will use those keys to look up label values from the incoming pod and those key-value labels will be merged with existing LabelSelector to identify the group of existing pods over which the spreading skew will be calculated. Note that in case MatchLabelKeys is supported in the cluster-level default constraints (see kubernetes/kubernetes#129198), kube-scheduler will also handle it separately.

The main case that this new way for identifying pods will enable is constraining skew spreading calculation to happen at the revision level in Deployments during rolling upgrades.

Motivation

PodTopologySpread is widely used in production environments, especially in service type workloads which employ Deployments. However, currently it has a limitation that manifests during rolling updates which causes the deployment to end up out of balance (98215, 105661, k8s-pod-topology spread is not respected after rollout).

The root cause is that PodTopologySpread constraints allow defining a key-value label selector, which applies to all pods in a Deployment irrespective of their owning ReplicaSet. As a result, when a new revision is rolled out, spreading will apply across pods from both the old and new ReplicaSets, and so by the time the new ReplicaSet is completely rolled out and the old one is rolled back, the actual spreading we are left with may not match expectations because the deleted pods from the older ReplicaSet will cause skewed distribution for the remaining pods.

Currently, users are given two solutions to this problem. The first is to add a revision label to Deployment and update it manually at each rolling upgrade (both the label on the podTemplate and the selector in the podTopologySpread constraint), while the second is to deploy a descheduler to re-balance the pod distribution. The former solution isn't user friendly and requires manual tuning, which is error prone; while the latter requires installing and maintaining an extra controller. In this proposal, we propose a native way to maintain pod balance after a rolling upgrade in Deployments that use PodTopologySpread.

Goals

Allow users to define PodTopologySpread constraints such that they apply only within the boundaries of a Deployment revision during rolling upgrades.

Non-Goals

Proposal

User Stories (Optional)

Story 1

When users apply a rolling update to a deployment that uses PodTopologySpread, the spread should be respected only within the new revision, not across all revisions of the deployment.

Notes/Constraints/Caveats (Optional)

In most scenarios, users can use the label keyed with pod-template-hash added automatically by the Deployment controller to distinguish between different revisions in a single Deployment. But for more complex scenarios (eg. topology spread associating two deployments at the same time), users are responsible for providing common labels to identify which pods should be grouped.

Risks and Mitigations

Possible misuse

In addition to using pod-template-hash added by the Deployment controller, users can also provide the customized key in MatchLabelKeys to identify which pods should be grouped. If so, the user needs to ensure that it is correct and not duplicated with other unrelated workloads.

The update to labels specified at `matchLabelKeys` isn't supported

MatchLabelKeys is handled and merged into LabelSelector at a pod's creation. It means this feature doesn't support the label's update even though a user could update the label that is specified at matchLabelKeys after a pod's creation. So, in such cases, the update of the label isn't reflected onto the merged LabelSelector, even though users might expect it to be. On the documentation, we'll declare it's not recommended to use matchLabelKeys with labels that might be updated.

Also, we assume the risk is acceptably low because:

It's a fairly low probability to happen because pods are usually managed by another resource (e.g., deployment), and the update to pod template's labels on a deployment recreates pods, instead of directly updating the labels on existing pods. Also, even if users somehow use bare pods (which is not recommended in the first place), there's usually only a tiny moment between the pod creation and the pod getting scheduled, which makes this risk further rarer to happen, unless many pods are often getting stuck being unschedulable for a long time in the cluster (which is not recommended) or the labels specified at matchLabelKeys are frequently updated (which we'll declare as not recommended).
If it happens, selfMatchNum will be 0 and both matchNum and minMatchNum will be retained. Consequently, depending on the current number of matching pods in the domain, matchNum - minMatchNum might be bigger than maxSkew, and the pod(s) could be unschedulable. But, it does not mean that the unfortunate pods would be unschedulable forever.

Design Details

A new optional field named MatchLabelKeys will be introduced to TopologySpreadConstraint. Currently, when scheduling a pod, the LabelSelector defined in the pod is used to identify the group of pods over which spreading will be calculated. MatchLabelKeys adds another constraint to how this group of pods is identified.

type TopologySpreadConstraint struct {
	MaxSkew           int32
	TopologyKey       string
	WhenUnsatisfiable UnsatisfiableConstraintAction
	LabelSelector     *metav1.LabelSelector

	// MatchLabelKeys is a set of pod label keys to select the pods over which 
	// spreading will be calculated. The keys are used to lookup values from the
	// incoming pod labels, those key-value labels are ANDed with `LabelSelector`
	// to select the group of existing pods over which spreading will be calculated
	// for the incoming pod. Keys that don't exist in the incoming pod labels will
	// be ignored.
	MatchLabelKeys []string
}

When a Pod is created, kube-apiserver will obtain the labels from the pod by the keys in matchLabelKeys and the key-value labels are merged to LabelSelector of TopologySpreadConstraint.

For example, when this sample Pod is created,

apiVersion: v1
kind: Pod
metadata:
  name: sample
  labels:
    app: sample
...
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector: {}
    matchLabelKeys: # ADDED
    - app

kube-apiserver modifies the labelSelector like the following:

  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
+     matchExpressions:
+     - key: app
+       operator: In
+       values:
+       - sample
    matchLabelKeys:
    - app

In addition, kube-scheduler will handle matchLabelKeys within the cluster-level default constraints in the scheduler configuration in the future (see kubernetes/kubernetes#129198).

Finally, the feature will be guarded by a new feature flag MatchLabelKeysInPodTopologySpread. If the feature is disabled, the field matchLabelKeys and corresponding labelSelector are preserved if it was already set in the persisted Pod object, otherwise new Pod with the field creation will be rejected by kube-apiserver. Also kube-scheduler will ignore matchLabelKeys in the cluster-level default constraints configuration.

[v1.34] design change and a safe upgrade path

Previously, kube-scheduler just internally handled matchLabelKeys before the calculation of scheduling results. But, we changed the implementation design to the current form to make the design align with PodAffinity's matchLabelKeys. (See the detailed discussion in the alternative section)

However, this implementation change could break matchLabelKeys of unscheduled pods created before the upgrade because kube-apiserver only handles matchLabelKeys at pods creation, that is, it doesn't handle matchLabelKeys at existing unscheduled pods. So, for a safe upgrade path from v1.33 to v1.34, kube-scheduler would handle not only matchLabelKeys from the default constraints, but also all incoming pods during v1.34. We're going to change kube-scheduler to only concern matchLabelKeys from the default constraints at v1.35 for efficiency, assuming kube-apiserver handles matchLabelKeys of all incoming pods.

Also, in case of bugs in this new design, users can disable this feature through a new feature flag, MatchLabelKeysInPodTopologySpreadSelectorMerge (enabled by default). (See more details in Feature Enablement and Rollback)

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread: 2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1) - 87.5%
k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread/plugin.go: 2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1) - 84.8%
k8s.io/kubernetes/pkg/registry/core/pod/strategy.go: 2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1) - 65%

Integration tests

These cases will be added in the existed integration tests:
- Feature gate enable/disable tests
- MatchLabelKeys in TopologySpreadConstraint works as expected
- Verify no significant performance degradation
k8s.io/kubernetes/test/integration/scheduler/filters/filters_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=TestPodTopologySpreadFilter
k8s.io/kubernetes/test/integration/scheduler/scoring/priorities_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=TestPodTopologySpreadScoring
k8s.io/kubernetes/test/integration/scheduler_perf/scheduler_perf_test.go: https://storage.googleapis.com/k8s-triage/index.html?test=BenchmarkPerfScheduling

e2e tests

These cases will be added in the existed e2e tests:
- Feature gate enable/disable tests
- MatchLabelKeys in TopologySpreadConstraint works as expected
k8s.io/kubernetes/test/e2e/scheduling/predicates.go: https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling
k8s.io/kubernetes/test/e2e/scheduling/priorities.go: https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling

Graduation Criteria

Alpha

Feature implemented behind feature gate.
Unit and integration tests passed as designed in TestPlan.

Beta

Feature is enabled by default
Benchmark tests passed, and there is no performance degradation.
Update documents to reflect the changes.

GA

No negative feedback.
Update documents to reflect the changes.

Upgrade / Downgrade Strategy

In the event of an upgrade, kube-apiserver will start to accept and store the field MatchLabelKeys.

In the event of a downgrade, kube-apiserver will reject pod creation with matchLabelKeys in TopologySpreadConstraint. But, regarding existing pods, we leave matchLabelKeys and generated LabelSelector even after downgraded. kube-scheduler will ignore MatchLabelKeys if it was set in the cluster-level default constraints configuration.

Version Skew Strategy

There's no version skew issue.

We changed the implementation design between v1.34 and v1.35, but we designed the change not to involve any version skew issue as described at [v1.34] design change and a safe upgrade path.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

MatchLabelKeysInPodTopologySpread feature flag enables the MatchLabelKeys feature in TopologySpreadConstraint.
MatchLabelKeysInPodTopologySpreadSelectorMerge feature flag enables the new design described at [v1.34] design change and a safe upgrade path.
- If MatchLabelKeysInPodTopologySpreadSelectorMerge is disabled while MatchLabelKeysInPodTopologySpread is enabled, Kubernetes handles MatchLabelKeys with the classic design, kube-scheduler handles it. However, that's basically not recommended unless you encounter a bug in a new design behavior.
- This flag cannot be enabled on its own, and has to be enabled together with MatchLabelKeysInPodTopologySpread. Enabling MatchLabelKeysInPodTopologySpreadSelectorMerge alone has no effect, and matchLabelKeys will be ignored.

The MatchLabelKeysInPodTopologySpreadSelectorMerge feature flag has been added in v1.34 and enabled by default. This flag can be disabled to revert the implementation design change in v1.34 and go back to the previous behavior in case of bug.

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: MatchLabelKeysInPodTopologySpread
- Components depending on the feature gate: kube-scheduler, kube-apiserver
Feature gate (also fill in values in kep.yaml)
- Feature gate name: MatchLabelKeysInPodTopologySpreadSelectorMerge
- Components depending on the feature gate: kube-apiserver

Does enabling the feature change any default behavior?

No.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

The feature can be disabled in Alpha and Beta versions by restarting kube-apiserver and kube-scheduler with feature-gate off. One caveat is that pods that used the feature will continue to have the MatchLabelKeys field set and the corresponding LabelSelector even after disabling the feature gate. In terms of Stable versions, users can choose to opt-out by not setting the matchLabelKeys field.

What happens if we reenable the feature if it was previously rolled back?

Newly created pods need to follow this policy when scheduling. Old pods will not be affected.

Are there any tests for feature enablement/disablement?

No. The unit tests that are exercising the switch of feature gate itself will be added.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

It won't impact already running workloads because it is an opt-in feature in kube-apiserver and kube-scheduler. But during a rolling upgrade, if some apiservers have not enabled the feature, they will not be able to accept and store the field "MatchLabelKeys" and the pods associated with these apiservers will not be able to use this feature. As a result, pods belonging to the same deployment may have different scheduling outcomes.

What specific metrics should inform a rollback?

If the metric schedule_attempts_total{result="error|unschedulable"} increased significantly after pods using this feature are added.
If the metric plugin_execution_duration_seconds{plugin="PodTopologySpread"} increased to higher than 100ms on 90% after pods using this feature are added.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Yes, it was tested manually by following the steps below, and it was working at intended.

create a kubernetes cluster v1.26 with 3 nodes where MatchLabelKeysInPodTopologySpread feature is disabled.
deploy a deployment with this yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 12 
  selector:
    matchLabels:
      foo: bar
  template:
    metadata:
      labels:
        foo: bar
    spec:
      restartPolicy: Always
      containers:
      - name: nginx
        image: nginx:1.14.2
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              foo: bar
          matchLabelKeys:
            - pod-template-hash

pods spread across nodes as 4/4/4
update the deployment nginx image to nginx:1.15.0
pods spread across nodes as 5/4/3
delete deployment nginx
upgrade kubenetes cluster to v1.27 (at master branch) while MatchLabelKeysInPodTopologySpread is enabled.
deploy a deployment nginx like step2
pods spread across nodes as 4/4/4
update the deployment nginx image to nginx:1.15.0
pods spread across nodes as 4/4/4
delete deployment nginx
downgrade kubenetes cluster to v1.26 where MatchLabelKeysInPodTopologySpread feature is enabled.
deploy a deployment nginx like step2
pods spread across nodes as 4/4/4
update the deployment nginx image to nginx:1.15.0
pods spread across nodes as 4/4/4

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Operator can query pods that have the pod.spec.topologySpreadConstraints.matchLabelKeys field set to determine if the feature is in use by workloads.

How can someone using this feature know that it is working for their instance?

Other (treat as last resort)
- Details: We can determine if this feature is being used by checking pods that have only MatchLabelKeys set in TopologySpreadConstraint.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Metric plugin_execution_duration_seconds{plugin="PodTopologySpread"} <= 100ms on 90-percentile.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Component exposing the metric: kube-scheduler
  - Metric name: plugin_execution_duration_seconds{plugin="PodTopologySpread"}
  - Metric name: schedule_attempts_total{result="error|unschedulable"}

Are there any missing metrics that would be useful to have to improve observability of this feature?

Yes. It's helpful if we have the metrics to see which plugins affect to scheduler's decisions in Filter/Score phase. There is the related issue: kubernetes/kubernetes#110643 . It's very big and still on the way.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Yes. there is an additional work: kube-apiserver uses the keys in matchLabelKeys to look up label values from the pod, and change LabelSelector according to them. kube-scheduler also handles matchLabelKeys if the cluster-level default constraints has it. The impact in the latency of pod creation request in kube-apiserver and the scheduling latency should be negligible.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

If the API server and/or etcd is not available, this feature will not be available. This is because the kube-scheduler needs to update the scheduling results to the pod via the API server/etcd.

What are other known failure modes?

N/A

What steps should be taken if SLOs are not being met to determine the problem?

Check the metric plugin_execution_duration_seconds{plugin="PodTopologySpread"} to determine if the latency increased. If increased, it means this feature may increased scheduling latency. You can disable the feature MatchLabelKeysInPodTopologySpread to see if it's the cause of the increased latency.
Check the metric schedule_attempts_total{result="error|unschedulable"} to determine if the number of attempts increased. If increased, You need to determine the cause of the failure by the event of the pod. If it's caused by plugin PodTopologySpread, You can further analyze this problem by looking at the kube-scheduler log.

Implementation History

2022-03-17: Initial KEP
2022-06-08: KEP merged
2023-01-16: Graduate to Beta
2025-01-23: Change the implementation design to be aligned with PodAffinity's matchLabelKeys
2025-04-07: Add a new feature flag MatchLabelKeysInPodTopologySpreadSelectorMerge and update milestone

Drawbacks

Alternatives

use pod generateName

Use pod.generateName to distinguish new/old pods that belong to the revisions of the same workload in scheduler plugin. It's decided not to support because of the following reason: scheduler needs to ensure universal and scheduler plugin shouldn't have special treatment for any labels/fields.

implement MatchLabelKeys in only either the scheduler plugin or kube-apiserver

Technically, we can implement this feature within the PodTopologySpread plugin only; merging the key-value labels corresponding to MatchLabelKeys into LabelSelector internally within the plugin before calculating the scheduling results. This is the actual implementation up to 1.33. But, it may confuse users because this behavior would be different from PodAffinity's MatchLabelKeys.

Also, we cannot implement this feature only within kube-apiserver because it'd make it impossible to handle MatchLabelKeys within the cluster-level default constraints in the scheduler configuration in the future (see kubernetes/kubernetes#129198).

So we decided to go with the design that implements this feature within both the PodTopologySpread plugin and kube-apiserver. Although the final design has a downside requiring us to maintain two implementations handling MatchLabelKeys, each implementation is simple and we regard the risk of increased maintenance overhead as fairly low.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!