- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests for meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
The pod topology spread feature allows users to define the group of pods over which spreading is applied using a LabelSelector. This means the user should know the exact label key and value when defining the pod spec.
This KEP proposes a complementary field to LabelSelector named MatchLabelKeys
in
TopologySpreadConstraint
which represents a set of label keys only.
At a pod creation, kube-apiserver will use those keys to look up label values from the incoming pod
and those key-value labels will be merged with existing LabelSelector
to identify the group of existing pods over
which the spreading skew will be calculated.
Note that in case MatchLabelKeys
is supported in the cluster-level default constraints
(see kubernetes/kubernetes#129198), kube-scheduler will also handle it separately.
The main case that this new way for identifying pods will enable is constraining skew spreading calculation to happen at the revision level in Deployments during rolling upgrades.
PodTopologySpread is widely used in production environments, especially in service type workloads which employ Deployments. However, currently it has a limitation that manifests during rolling updates which causes the deployment to end up out of balance (98215, 105661, k8s-pod-topology spread is not respected after rollout).
The root cause is that PodTopologySpread constraints allow defining a key-value label selector, which applies to all pods in a Deployment irrespective of their owning ReplicaSet. As a result, when a new revision is rolled out, spreading will apply across pods from both the old and new ReplicaSets, and so by the time the new ReplicaSet is completely rolled out and the old one is rolled back, the actual spreading we are left with may not match expectations because the deleted pods from the older ReplicaSet will cause skewed distribution for the remaining pods.
Currently, users are given two solutions to this problem. The first is to add a revision label to Deployment and update it manually at each rolling upgrade (both the label on the podTemplate and the selector in the podTopologySpread constraint), while the second is to deploy a descheduler to re-balance the pod distribution. The former solution isn't user friendly and requires manual tuning, which is error prone; while the latter requires installing and maintaining an extra controller. In this proposal, we propose a native way to maintain pod balance after a rolling upgrade in Deployments that use PodTopologySpread.
- Allow users to define PodTopologySpread constraints such that they apply only within the boundaries of a Deployment revision during rolling upgrades.
When users apply a rolling update to a deployment that uses PodTopologySpread, the spread should be respected only within the new revision, not across all revisions of the deployment.
In most scenarios, users can use the label keyed with pod-template-hash
added
automatically by the Deployment controller to distinguish between different
revisions in a single Deployment. But for more complex scenarios
(eg. topology spread associating two deployments at the same time), users are
responsible for providing common labels to identify which pods should be grouped.
In addition to using pod-template-hash
added by the Deployment controller,
users can also provide the customized key in MatchLabelKeys
to identify
which pods should be grouped. If so, the user needs to ensure that it is
correct and not duplicated with other unrelated workloads.
MatchLabelKeys
is handled and merged into LabelSelector
at a pod's creation.
It means this feature doesn't support the label's update even though a user
could update the label that is specified at matchLabelKeys
after a pod's creation.
So, in such cases, the update of the label isn't reflected onto the merged LabelSelector
,
even though users might expect it to be.
On the documentation, we'll declare it's not recommended to use matchLabelKeys
with labels that might be updated.
Also, we assume the risk is acceptably low because:
- It's a fairly low probability to happen because pods are usually managed by another resource (e.g., deployment),
and the update to pod template's labels on a deployment recreates pods, instead of directly updating the labels on existing pods.
Also, even if users somehow use bare pods (which is not recommended in the first place),
there's usually only a tiny moment between the pod creation and the pod getting scheduled, which makes this risk further rarer to happen,
unless many pods are often getting stuck being unschedulable for a long time in the cluster (which is not recommended)
or the labels specified at
matchLabelKeys
are frequently updated (which we'll declare as not recommended). - If it happens,
selfMatchNum
will be 0 and bothmatchNum
andminMatchNum
will be retained. Consequently, depending on the current number of matching pods in the domain,matchNum
-minMatchNum
might be bigger thanmaxSkew
, and the pod(s) could be unschedulable. But, it does not mean that the unfortunate pods would be unschedulable forever.
A new optional field named MatchLabelKeys
will be introduced to TopologySpreadConstraint
.
Currently, when scheduling a pod, the LabelSelector
defined in the pod is used
to identify the group of pods over which spreading will be calculated.
MatchLabelKeys
adds another constraint to how this group of pods is identified.
type TopologySpreadConstraint struct {
MaxSkew int32
TopologyKey string
WhenUnsatisfiable UnsatisfiableConstraintAction
LabelSelector *metav1.LabelSelector
// MatchLabelKeys is a set of pod label keys to select the pods over which
// spreading will be calculated. The keys are used to lookup values from the
// incoming pod labels, those key-value labels are ANDed with `LabelSelector`
// to select the group of existing pods over which spreading will be calculated
// for the incoming pod. Keys that don't exist in the incoming pod labels will
// be ignored.
MatchLabelKeys []string
}
When a Pod is created, kube-apiserver will obtain the labels from the pod
by the keys in matchLabelKeys
and the key-value labels are merged to LabelSelector
of TopologySpreadConstraint
.
For example, when this sample Pod is created,
apiVersion: v1
kind: Pod
metadata:
name: sample
labels:
app: sample
...
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector: {}
matchLabelKeys: # ADDED
- app
kube-apiserver modifies the labelSelector
like the following:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
+ matchExpressions:
+ - key: app
+ operator: In
+ values:
+ - sample
matchLabelKeys:
- app
In addition, kube-scheduler will handle matchLabelKeys
within the cluster-level default constraints
in the scheduler configuration in the future (see kubernetes/kubernetes#129198).
Finally, the feature will be guarded by a new feature flag MatchLabelKeysInPodTopologySpread
. If the feature is
disabled, the field matchLabelKeys
and corresponding labelSelector
are preserved
if it was already set in the persisted Pod object, otherwise new Pod with the field
creation will be rejected by kube-apiserver.
Also kube-scheduler will ignore matchLabelKeys
in the cluster-level default constraints configuration.
Previously, kube-scheduler just internally handled matchLabelKeys
before the calculation of scheduling results.
But, we changed the implementation design to the current form to make the design align with PodAffinity's matchLabelKeys
.
(See the detailed discussion in the alternative section)
However, this implementation change could break matchLabelKeys
of unscheduled pods created before the upgrade
because kube-apiserver only handles matchLabelKeys
at pods creation, that is,
it doesn't handle matchLabelKeys
at existing unscheduled pods.
So, for a safe upgrade path from v1.33 to v1.34, kube-scheduler would handle not only matchLabelKeys
from the default constraints, but also all incoming pods during v1.34.
We're going to change kube-scheduler to only concern matchLabelKeys
from the default constraints at v1.35 for efficiency,
assuming kube-apiserver handles matchLabelKeys
of all incoming pods.
Also, in case of bugs in this new design, users can disable this feature through a new feature flag,
MatchLabelKeysInPodTopologySpreadSelectorMerge
(enabled by default).
(See more details in Feature Enablement and Rollback)
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread
:2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1)
-87.5%
k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread/plugin.go
:2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1)
-84.8%
k8s.io/kubernetes/pkg/registry/core/pod/strategy.go
:2025-01-14 JST (The commit hash: ccd2b4e8a719dabe8605b1e6b2e74bb5352696e1)
-65%
-
These cases will be added in the existed integration tests:
- Feature gate enable/disable tests
MatchLabelKeys
inTopologySpreadConstraint
works as expected- Verify no significant performance degradation
-
k8s.io/kubernetes/test/integration/scheduler/filters/filters_test.go
: https://storage.googleapis.com/k8s-triage/index.html?test=TestPodTopologySpreadFilter -
k8s.io/kubernetes/test/integration/scheduler/scoring/priorities_test.go
: https://storage.googleapis.com/k8s-triage/index.html?test=TestPodTopologySpreadScoring -
k8s.io/kubernetes/test/integration/scheduler_perf/scheduler_perf_test.go
: https://storage.googleapis.com/k8s-triage/index.html?test=BenchmarkPerfScheduling
-
These cases will be added in the existed e2e tests:
- Feature gate enable/disable tests
MatchLabelKeys
inTopologySpreadConstraint
works as expected
-
k8s.io/kubernetes/test/e2e/scheduling/predicates.go
: https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling -
k8s.io/kubernetes/test/e2e/scheduling/priorities.go
: https://storage.googleapis.com/k8s-triage/index.html?sig=scheduling
- Feature implemented behind feature gate.
- Unit and integration tests passed as designed in TestPlan.
- Feature is enabled by default
- Benchmark tests passed, and there is no performance degradation.
- Update documents to reflect the changes.
- No negative feedback.
- Update documents to reflect the changes.
In the event of an upgrade, kube-apiserver will start to accept and store the field MatchLabelKeys
.
In the event of a downgrade, kube-apiserver will reject pod creation with matchLabelKeys
in TopologySpreadConstraint
.
But, regarding existing pods, we leave matchLabelKeys
and generated LabelSelector
even after downgraded.
kube-scheduler will ignore MatchLabelKeys
if it was set in the cluster-level default constraints configuration.
There's no version skew issue.
We changed the implementation design between v1.34 and v1.35, but we designed the change not to involve any version skew issue as described at [v1.34] design change and a safe upgrade path.
MatchLabelKeysInPodTopologySpread
feature flag enables theMatchLabelKeys
feature inTopologySpreadConstraint
.MatchLabelKeysInPodTopologySpreadSelectorMerge
feature flag enables the new design described at [v1.34] design change and a safe upgrade path.- If
MatchLabelKeysInPodTopologySpreadSelectorMerge
is disabled whileMatchLabelKeysInPodTopologySpread
is enabled, Kubernetes handlesMatchLabelKeys
with the classic design, kube-scheduler handles it. However, that's basically not recommended unless you encounter a bug in a new design behavior. - This flag cannot be enabled on its own, and has to be enabled together with
MatchLabelKeysInPodTopologySpread
. EnablingMatchLabelKeysInPodTopologySpreadSelectorMerge
alone has no effect, andmatchLabelKeys
will be ignored.
- If
The MatchLabelKeysInPodTopologySpreadSelectorMerge
feature flag has been added in v1.34 and enabled by default.
This flag can be disabled to revert the implementation design change in v1.34
and go back to the previous behavior in case of bug.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
MatchLabelKeysInPodTopologySpread
- Components depending on the feature gate:
kube-scheduler
,kube-apiserver
- Feature gate name:
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
MatchLabelKeysInPodTopologySpreadSelectorMerge
- Components depending on the feature gate:
kube-apiserver
- Feature gate name:
No.
The feature can be disabled in Alpha and Beta versions by restarting kube-apiserver and kube-scheduler with feature-gate off. One caveat is that pods that used the feature will continue to have the MatchLabelKeys field set and the corresponding LabelSelector even after disabling the feature gate. In terms of Stable versions, users can choose to opt-out by not setting the matchLabelKeys field.
Newly created pods need to follow this policy when scheduling. Old pods will not be affected.
No. The unit tests that are exercising the switch
of feature gate itself will be added.
It won't impact already running workloads because it is an opt-in feature in kube-apiserver and kube-scheduler. But during a rolling upgrade, if some apiservers have not enabled the feature, they will not be able to accept and store the field "MatchLabelKeys" and the pods associated with these apiservers will not be able to use this feature. As a result, pods belonging to the same deployment may have different scheduling outcomes.
- If the metric
schedule_attempts_total{result="error|unschedulable"}
increased significantly after pods using this feature are added. - If the metric
plugin_execution_duration_seconds{plugin="PodTopologySpread"}
increased to higher than 100ms on 90% after pods using this feature are added.
Yes, it was tested manually by following the steps below, and it was working at intended.
- create a kubernetes cluster v1.26 with 3 nodes where
MatchLabelKeysInPodTopologySpread
feature is disabled. - deploy a deployment with this yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 12
selector:
matchLabels:
foo: bar
template:
metadata:
labels:
foo: bar
spec:
restartPolicy: Always
containers:
- name: nginx
image: nginx:1.14.2
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
matchLabelKeys:
- pod-template-hash
- pods spread across nodes as 4/4/4
- update the deployment nginx image to
nginx:1.15.0
- pods spread across nodes as 5/4/3
- delete deployment nginx
- upgrade kubenetes cluster to v1.27 (at master branch) while
MatchLabelKeysInPodTopologySpread
is enabled. - deploy a deployment nginx like step2
- pods spread across nodes as 4/4/4
- update the deployment nginx image to
nginx:1.15.0
- pods spread across nodes as 4/4/4
- delete deployment nginx
- downgrade kubenetes cluster to v1.26 where
MatchLabelKeysInPodTopologySpread
feature is enabled. - deploy a deployment nginx like step2
- pods spread across nodes as 4/4/4
- update the deployment nginx image to
nginx:1.15.0
- pods spread across nodes as 4/4/4
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
Operator can query pods that have the pod.spec.topologySpreadConstraints.matchLabelKeys
field set to determine if the feature is in use by workloads.
- Other (treat as last resort)
- Details: We can determine if this feature is being used by checking pods that have only
MatchLabelKeys
set inTopologySpreadConstraint
.
- Details: We can determine if this feature is being used by checking pods that have only
Metric plugin_execution_duration_seconds{plugin="PodTopologySpread"} <= 100ms on 90-percentile.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Component exposing the metric: kube-scheduler
- Metric name:
plugin_execution_duration_seconds{plugin="PodTopologySpread"}
- Metric name:
schedule_attempts_total{result="error|unschedulable"}
- Metric name:
- Component exposing the metric: kube-scheduler
Are there any missing metrics that would be useful to have to improve observability of this feature?
Yes. It's helpful if we have the metrics to see which plugins affect to scheduler's decisions in Filter/Score phase. There is the related issue: kubernetes/kubernetes#110643 . It's very big and still on the way.
No.
No.
No.
No.
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Yes. there is an additional work:
kube-apiserver uses the keys in matchLabelKeys
to look up label values from the pod,
and change LabelSelector
according to them.
kube-scheduler also handles matchLabelKeys if the cluster-level default constraints has it.
The impact in the latency of pod creation request in kube-apiserver and the scheduling latency
should be negligible.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
If the API server and/or etcd is not available, this feature will not be available. This is because the kube-scheduler needs to update the scheduling results to the pod via the API server/etcd.
N/A
- Check the metric
plugin_execution_duration_seconds{plugin="PodTopologySpread"}
to determine if the latency increased. If increased, it means this feature may increased scheduling latency. You can disable the featureMatchLabelKeysInPodTopologySpread
to see if it's the cause of the increased latency. - Check the metric
schedule_attempts_total{result="error|unschedulable"}
to determine if the number of attempts increased. If increased, You need to determine the cause of the failure by the event of the pod. If it's caused by pluginPodTopologySpread
, You can further analyze this problem by looking at the kube-scheduler log.
- 2022-03-17: Initial KEP
- 2022-06-08: KEP merged
- 2023-01-16: Graduate to Beta
- 2025-01-23: Change the implementation design to be aligned with PodAffinity's
matchLabelKeys
- 2025-04-07: Add a new feature flag
MatchLabelKeysInPodTopologySpreadSelectorMerge
and update milestone
Use pod.generateName
to distinguish new/old pods that belong to the
revisions of the same workload in scheduler plugin. It's decided not to
support because of the following reason: scheduler needs to ensure universal
and scheduler plugin shouldn't have special treatment for any labels/fields.
Technically, we can implement this feature within the PodTopologySpread plugin only;
merging the key-value labels corresponding to MatchLabelKeys
into LabelSelector
internally
within the plugin before calculating the scheduling results.
This is the actual implementation up to 1.33.
But, it may confuse users because this behavior would be different from PodAffinity's MatchLabelKeys
.
Also, we cannot implement this feature only within kube-apiserver because it'd make it
impossible to handle MatchLabelKeys
within the cluster-level default constraints
in the scheduler configuration in the future (see kubernetes/kubernetes#129198).
So we decided to go with the design that implements this feature within both
the PodTopologySpread plugin and kube-apiserver.
Although the final design has a downside requiring us to maintain two implementations handling MatchLabelKeys
,
each implementation is simple and we regard the risk of increased maintenance overhead as fairly low.