KEP-5710: Workload-aware preemption
KEP-5710: Workload-aware preemption
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Release Signoff Checklist
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable - (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests within one minor version of promotion to GA
- (R) Production readiness review completed
- (R) Production readiness review approved
- “Implementation History” section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website , for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Summary
This KEP describes the changes to kube-scheduler to support workload-aware preemption. We focus on the API, framework and building blocks, not the ideal algorithm - it can come as a follow up. We start with simple implementation, that is heavily based on the existing pod preemption algorithm.
The Workload and PodGroup API introduced in KEP-4671: Gang Scheduling using Workload Object
is extended to
allow expressing the concept of pod group priority and to define the preemption unit. With those
extensions we make the next step towards our workload-aware scheduling north star.
Motivation
Tightly-coupled workloads can require ongoing communication between multiple pods to make progress. While such usecases have always existed (e.g. MPI jobs), in the current AI era the number of such workloads is much higher as both AI Training and Multihost Inference belong to this category.
With KEP-4671: Gang Scheduling using Workload Object , we’re making the first step towards better handling this category of workloads. However, that KEP is focused only on the aspect of initial scheduling of the workload. As mentioned above, the tightly-coupled workload usually requires ongoing communication across many pods not only on startup, but also across its whole lifetime. This means that disrupting a single pod of such workload effectively disrupts the whole workload - even if the rest of the pods are still running, they are not able to make any progress.
In this KEP, we’re proposing a solution for the first (but in many cases the primary) reason of such disruptions - the preemption. Given the supply shortages of the newest and most efficient accelerators as well as their economics, maximizing their utilization is one of the primary goals for majority of users. They achieve it by mixing different kinds of workloads in the same Kubernetes cluster and properly prioritizing these workloads against each other to satisfy the business requirements. In such capacity-constraint environments, preemption is a critical feature allowing users to balance the need to satisfy their business needs (often meaning satisfying certain SLOs for serving workloads) with maximizing utilization of their hardware. However, the currently existing preemption mechanism doesn’t address those needs due its pod-centric nature.
Workload-aware preemption is not a novel thing and was already implemented in other ecosystem projects. However, similarly to gang scheduling, we believe that workload awareness (which includes workload aware preemption) is critical enough that deserves standardizing in core Kubernetes. Only this standardization would allow us tighter integration with other features (managing other types of disruptions, autoscaling and many others) and bring the true value for every Kubernetes user.
Goals
- Define API for describing units of preemption within a workload
- Define API for describing priority of a preemption unit
- Describe the principles and semantics of workload-aware preemption
- Define the base preemption policies
- Define the scheduler changes needed to implement workload-aware preemption
- Provide full backward compatibility for all existing scheduling features
Non-Goals
- Change the way how individual pods (not being part of workloads) are preempted
- Provide the most optimal preemption algorithm from day 1
- Address arbitrary preemption policies (more preemption policies will be needed, but these should be added in a followup KEPs)
- Introduce workload-awareness for handling different kinds of disruptions (e.g. caused by hardware failures) including kubelet eviction
- Design rescheduling for workloads that will be preempted (rescheduling will be addressed in a separate dedicated KEP)
- Change the preemption principle of avoiding preemption if a workload/pod can be scheduled without it. If we decide to change that it will be addressed in a dedicated KEP.
- Propose any tradeoff between preemption and cluster scale-up.
- Design workload-level preemption triggered by external schedulers
- Handling non-uniform priorities across PodGroups
- Full support for Topology Aware Scheduling. This will be provided in TAS KEP.
Proposal
This KEP is tightly coupled with KEP-4671: Gang Scheduling using Workload Object one. It is building on foundations introduced there and assumes the knowledge of the concepts introduced there.
We believe that providing a Gang Scheduling capability without a dedicated preemption mechanism for it is of no use for the high performance batch workloads for which this feature is primarily designed.
Principles
Preemption is critical, but it is causing disruption to workloads that are being preempted. Disruption itself is never desired and this defines our core principles.
- We should always minimize (the cost of) preemptions.
- The cost of preempting a workload should include also the side effects of this preemption. In particular, if a preempted workload will immediately be recreated by a controller and will result in preempting another workload, the cost of that second preemption should be included in the cost of the original preemption.
While cascading preemptions are inevitable in some cases (e.g. if high priority preemptor pod group has very strict placement requirements), in general if there are multiple options of scheduling a higher priority pod group with preemptions, with some of them being expected to cause cascading preemptions and others not, we will try to choose the later. However, due to computational cost of searching the potential space, especially in big clusters, this is not a hard rule, but rather a goal that we will try to optimize for.
High-level approach
We start with a relatively simple design focusing on extensible APIs and semantics without targeting very sophisticated algorithms. In this section we just introduce the individual pieces of the solution and discuss them in more detail in the following sections.
- We piggy-back on existing
PriorityClassAPI to avoid reinventing the concept of priority from scratch. - We extend
WorkloadandPodGroupAPI to allow for defining the preemption unit. Here we also start simple and allow for preemption unit to only correspond to aPodGroupin a gang mode. - We extend
WorkloadandPod GroupAPI to allow for defining the priority of a pod group. - We start simple by just defining a single static priority used for scheduling and preemption While we envision both splitting them into two in the future or making it mutable, both of these can be achieved later in backward-compatible way.
- We start with a simple sub-optimal preemption algorithm that is based on the existing pod preemption algorithm used by kube-scheduler.
The rest of this KEP explains those pieces in more detail.
User Stories
Preemption of AI Training job
When running an AI Training job, I want to ensure that it will not be partially preempted. If at least one my pods is not running, the others are not making progress anyway and are just wasting the resources in the cluster.
AI Training Job as preemptor
When scheduling an AI Training Job, I want preemption to ensure that the whole Training Job can fit on to a cluster. I want to avoid partial preemptions for single pods from my training job if they cannot guarantee that the whole Job will become schedulable.
Preemption of Multihost Inference
When running multihost inference using LeaderWorkerSet, I want to ensure that its single replica (one leader and N workers) will not be partially preempted. If at least one of such pods is not running, the others are not able to serve and are just wasting resources in the cluster.
Preemption of Multihost Inference that can run in a degraded mode
When running multihost inference using LeaderWorkerSet that can run in a degraded mode, I don’t want to preempt the whole replica (one leader and N workers) if a single worker would be preempted, because I prefer to serve in a degraded mode than be completely disrupted.
Notes/Constraints/Caveats
For alpha we defined a Workload Aware Preemption as a separate feature that can be disabled
independently of the Gang Scheduling. We acknowledge that for Beta, releasing the Gang Scheduling
without Workload Aware Preemption does not provide enough value for the end users. That’s why in
Beta we merge those two features and progress them together under single GenericWorkload feature gate.
Risks and Mitigations
Extensibility - it’s obvious that what is proposed in this KEP will not be a final step and we will be evolving it. How can we ensure that we will not put ourselves into a corner.
Mitigation: We enumerate potential extensions after the detailed design and briefly sketch how the proposed design can be extended to accommodate these.
Incompatible scheduler profiles - different scheduling profiles may enable different sets of plugins and if only subset of profiles enable
GangSchedulingplugin (responsible also for workload-aware preemption), we may break the expectations.Mitigation: We will document that
GangSchedulingplugin has to be enabled in all profiles or the logic will need to be reimplemented by other custom plugins. Eventually we may consider builtin validation, but we make it out of scope for this KEP.Scalability - finding the optimal set of pod groups/pods to preempt is computationally expensive problem, however we need to ensure it can be used even in the largest Kubernetes clusters.
Mitigation: We propose a simplified algorithm that is computationally feasible at the cost of providing “reasonably good” preemption victim candidates.
Design Details
Preemption unit
We start with defining what is a unit of preemption. While we can imagine usecases with preemption unit being an arbitrary group of pods, in majority of real world usecases this is actually aligned with the scheduling unit. In other words, the group of pods that should be preempted together matches a group that was initially scheduled together as a gang.
Trying to formalize it, we define WorkloadPortion as one of {all pods in a PodGroup replica or
a single pod}. With that definition both scheduling unit and preemption units can only be
WorkloadPortions.
In the future, we may want to support usecases when a single scheduling unit consists of multiple preemption groups, but we leave that usecase as a future extension (it can be addressed when we decide to extend Workload API with CompositePodGroup concept - for more details see CompositePodGroup API ). However, we never expect preemption unit to be larger than scheduling unit.
Based on that, we will extend the the existing GangSchedulingPolicy as following:
// DisruptionMode defines how individual entities within a group can be disrupted.
// Exactly one mode can be set.
//
// +union
type DisruptionMode struct {
// Single specifies that children can be disrupted independently from each other.
//
// +optional
Single *SingleDisruptionMode
// All specifies that all children can only be disrupted together.
//
// +optional
All *AllDisruptionMode
}
// SingleDisruptionMode specifies that children can be disrupted independently.
type SingleDisruptionMode struct {
// Intentionally empty now.
}
// AllDisruptionMode specifies that children can only be disrupted together.
type AllDisruptionMode struct {
// Intentionally empty now.
}
type PodGroupSpec struct {
// Existing field(s).
// DisruptionMode defines the mode in which a given PodGroup can be disrupted.
// Controllers are expected to fill this field by copying it from a PodGroupTemplate.
// One of Single, All. Defaults to Single if unset.
// This field is immutable.
DisruptionMode *DisruptionMode
}
Given that preemption unit shouldn’t be larger then the scheduling unit, additional validation
will be added to prevent All disruption mode for PodGroups with BasicSchedulingPolicy.
The DisruptionMode is modeled as a struct to allow future extensibility, especially with the upcoming CompositePodGroup. This KEP approval does not mean that these plans are approved but we are seeing potential future use cases, like custom DisruptionMode with “preempt single units until X, then preempt all” semantics, which require DisrtupionMode to be expandable.
While the PreemptionMode might seem the more natural name here, we envision that the same
concept can be later used in Eviction API and other usecases, so we already start with a more
generic name to avoid future confusion.
Pod Group priorities
Prioritizing pod groups across each other requires answering the question: “What is pod group priority?”.
Up until now, only individual pods have assigned priority. For homogenous PodGroup the priority of a PodGroup
should be the same priority as the priority of individual pods that form the group.
We can also think about heterogenous PodGroup where individual pods have different priorities. However, that
case rises following questions:
- what should be the priority of the
PodGroupin the scheduling queue. - what should be the priority of the
PodGroupwhen it is considered as a preemption victim.
For the first question the natural answer is that each Pod should become a separate scheduling unit, with the
priority. With that, the pods from PodGroup should also become a separate PreemptionUnits. In that case,
one can ask why should we join such pods in a PodGroup in the first place. As we do not find an use case for
that, we will continue with the assumption that all pods, even heterogenous, within PodGroup should have the same priority.
Additionally, as described in user stories above, a simple static priority doesn’t seem to be enough. Arguably it is not even a single priority because a priority used for scheduling can be different than the priority that should be used for preemption. So in the ideal world a workload owner should be able to:
- define priority used for scheduling a PodGroup
- define priority used for preemption of a PodGroup
- mutate preemption priority during the whole lifecycle of the workload to reflect the importance of that workload at a given moment
However, while we expect the need for separate mutable priorities, we leave it as a possible further extension out of the scope of this KEP.
In this KEP we introduce:
- a single priority for scheduling and preemption.
- static preemption priority (mutability brings additional complexity that is purely additive and thus should be added in a follow-up KEP)
In KEP-4671: Gang Scheduling using Workload Object we already decided that PodGroup is the scheduling unit for workload-aware scheduling. Different PodGroups (even if part of the same Workload) are scheduled independently. As a result, we continue this path and define the priority also at the level of a scheduling unit.
The proposed PodGroup API extensions look as following.
type PodGroupTemplate struct {
// Existing field(s).
// PriorityClassName, if specified, indicates the priority that should be
// considered when scheduling this pod group. "system-node-critical"
// and "system-cluster-critical" are two special keywords which indicate the
// highest priorities with the former being the highest priority. Any other
// name must be defined by creating a PriorityClass object with that name.
// If not specified, the priority will be default or zero if there is no
// default.
//
// The authoritative priority for this pod group is expressed via the
// 'priority' field.
//
// This field is immutable.
PriorityClassName *string
}
type PodGroupSpec struct {
// Existing field(s).
// PriorityClassName, if specified, indicates the priority that
// should be considered when scheduling this pod group. "system-node-critical"
// and "system-cluster-critical" are two special keywords which indicate the
// highest priorities with the former being the highest priority. Any other
// name must be defined by creating a PriorityClass object with that name.
// If not specified, the priority will be default or zero if there is no
// default.
//
// The authoritative priority for this pod group is expressed via the
// 'priority' field.
//
// This field is immutable.
PriorityClassName *string
}
With that change, when scheduling or preempting a pod that is part of a Pod Group, the
priority defined in the PodGroup object will be used (and priority defined in the Pod
itself will be ignored, thus not reflecting the actual pod priority).
We acknowledge that it might be misleading to users. For Alpha, we described the possible divergence in the documentation.
For Beta and GA, we will disallow the divergence between the priority of the Pod and the PodGroup.
This will be done by the scheduler, which will fail scheduling of the PodGroup, once it observes such divergence.
This mechanism will follow a similar mechanism already implemented in scheduler that disallows PodGroup with pods having different spec.schedulerName.
We will check only the numerical value of the priority, not the priority class from which the value was derived.
This information will be visible to the user in the description of the PodScheduled Conditions in the Pod.Status and
PodGroupInitiallyScheduled in PodGroup.Status:
Pod:
status:
conditions:
- type: PodScheduled
status: "False"
reason: SchedulerError
message: 'all pods in a single pod group should match the priority of the pod group, got: 1 and 2'
PodGroup:
status:
conditions:
- type: PodGroupInitiallyScheduled
status: "False"
reason: SchedulerError
message: 'all pods in a single pod group should match the priority of the pod group, got: 1 and 2'
and via FailedScheduling event:
{
"apiVersion": "v1",
"kind": "Event",
"metadata": {
...
},
"involvedObject": {
...
},
"reason": "FailedScheduling",
"message": "all pods in a single pod group should have the same .spec.schedulerName set, got: \"custom-scheduler-1\" and \"custom-scheduler-2\"",
"source": {
"component": "default-scheduler"
},
"type": "Warning",
"action": "Scheduling",
"reportingComponent": "default-scheduler"
}
We might relax this restriction in the future if there is a strong use cases that justify it.
It’s worth mentioning here, that we want to introduce the same defaulting rules for
PodGroup.Spec.PriorityClassName that we have for pods. Namely, if PriorityClassName is unset
and there exists PriorityClass marked as globalDefault, we default it to that value.
This consistency will allow us to properly handle cases when users set neither pods
nor PodGroup priorities.
Note that, for workload-aware preemption we will support the preemptionPolicy being part
of requestion PriorityClass - namely both currently existing modes: PreemptLowerPriority
and Never.
As the preemptionPolicy is also a field of the Pod, we will apply the same constraints as
for the priority. Namely, all pods within PodGroup will have to share the same preemptionPolicy.
This will be enforced on the scheduler level.
Given that components operate on integer priorities, we will introduce a corresponding fields
that reflect priority of a PodGroup (similarly to how it’s done in Pod API).
Since it is effectively a derivative of the field introduced above it would be tempting to
put that into PodGroup.Status. However, for the consistency with the Pod API we actually
will put that next to the PriorityClassName in the spec:
type PodGroupTemplate struct {
// Existing field(s).
// PriorityClassName, if specified, indicates the priority that should be
// considered when scheduling this pod group. "system-node-critical"
// and "system-cluster-critical" are two special keywords which indicate the
// highest priorities with the former being the highest priority. Any other
// name must be defined by creating a PriorityClass object with that name.
// If not specified, the priority will be default or zero if there is no
// default.
//
// The authoritative priority for this pod group is expressed via the
// 'priority' field.
//
// This field is immutable.
PriorityClassName *string
// Priority reflects the priority of the pod group.
// The higher value, the higher the priority.
// This field is populated from the PriorityClassName.
Priority *int32
}
type PodGroupSpec struct {
// Existing field(s).
// PriorityClassName, if specified, indicates the priority that should be
// considered when scheduling this pod group. "system-node-critical"
// and "system-cluster-critical" are two special keywords which indicate the
// highest priorities with the former being the highest priority. Any other
// name must be defined by creating a PriorityClass object with that name.
// If not specified, the priority will be default or zero if there is no
// default.
//
// The authoritative priority for this pod group is expressed via the
// 'priority' field.
//
// This field is immutable.
PriorityClassName *string
// Priority reflects the priority of the pod group.
// The higher value, the higher the priority.
// This field is populated from the PriorityClassName.
Priority *int32
}
If that appears not being enough, we will similarly extend the CompositePodGroup API in a follow up.
In such case, we expect that for any pod, the priority of a pod from the perspective of being a preemption victim
would be the priority of a smallest unit encompassing it in the CompositePodGroup tree.
Preemption algorithm
We start with describing at the high-level how existing pod-level preemption algorithm works. Below, we will show how to generalize it to workloads.
If a pod P can be scheduled without triggering preemption, we don’t consider preemption at all. To check if a pod P can be scheduled on a given node with preemption we:
Identify the list of potential victims - all running pods with priority lower than the new pod P.
If removing all these victims would not make the node feasible, the node is infeasible.
From the list of potential victims, we try to reprieve (remove from the victims list) any pods whose eviction would violate PodDisruptionBudget.
From remaining potential victims, we start to reprieve pods starting from the highest priority and working down until the set of remaining victims still keeps the node feasible.
Once we find enough nodes feasible for preemption and list of victims for them, we score that and choose the best option.
The above algorithm achieves our principles, as by reprieving the highest priority pods first, it effectively tries to minimize the cascading preemptions later.
We want to generalize the same algorithm to the workload case. However, the difference is not only
moving to the level of PodGroup, but also no longer operating at the level of individual nodes.
We need to look at the cluster as a whole. With that in mind, keeping the algorithm efficient
becomes a challenge, thus we modify to the approach below.
At the same time, we need to support four cases:
- individual pod as preemptor, individual pod(s) as victim(s)
- individual pod as preemptor, pod group(s) (and individual pod(s)) as victim(s)
- pod group as preemptor, individual pod(s) as victim(s)
- pod group as preemptor, pod group(s) (and individual pod(s)) as victim(s)
To achieve that, we don’t want to multiply preemption algorithms and rather want to have a unified high-level approach (with potential minor tweaks per option).
To check if a given preemptor (either (gang) PodGroup G or an individual pod P) can be scheduled with preemption:
Split the cluster into mutually-exclusive domains where a preemptor will be put:
- for pod P, it will always be individual nodes
- for pod group G, we will start with just one “whole cluster”;
For every domain computed above run the following points:
Identify the list of all potential victims in that domain:
- all running pod groups with (preemption) priority lower than preemptor priority; note that some pods from that pod group may be running outside of currently considered domain D - they need to contribute to scoring, but they won’t contribute to feasibility of domain D.
- all individual pods with priority lower than preemptor priority
If removing all potential victims would not make the preemptor schedulable, the preemptor is unschedulable with preemption in currently considered domain D.
For the [Topology-Aware Scheduling] “checking whether preemptor becomes schedulable” yields a list of potential placements. For the alpha support of TAS, we will only assume the best placement (based on TAS Placement Scoring plugins). With the progression of TAS to Beta, we might change the algorithm to consider top N placements and adjust their scores based on the amount of preemption victims each placement yields, and only after that selecting the best one. For cases where TAS is rerun with some of the PodGroup pods already scheduled, the current placement of those pods will be selected by TAS as the only placement. We can also see a future where if the algorithm is performant enough, even for the non TAS case, we will generate multiple potential placements and consider them, to select a placement that yields the lowest amount of disruptions.
For each placement computed above run the following points:
Sort all the potential victims to reflect their “importance” (from the most important to the least ones). Tentatively, the function will sort first by their priority, and within a single priority prioritizing pod groups over individual pods.
Temporarily schedule the preemptor on the chosen placement (assuming that all potential victims are removed).
Perform best-effort reprieval of pod groups and pods violating PodDisruptionBudgets. We achieve it by iterating over potential victims that would violate PodDisruptionBudget to check if these can be placed in the exact same place they are running now. If they can we simply leave them where they are running now and remove them from the potential victims list.
Perform best-effort reprieval of remaining victims. We achieve it in the same manner as in the step above.
For the current pod-based preemption, the above algorithm works identically to the current algorithm. For larger domains, different placements of a preemptor are potentially possible and may result in potentially different sets of victims violating PodDisruptionBudgets to remain feasible. This means that the proposed algorithm is not necessarily minimizing the number of victims that would violate their PodDisruptionBudgets. However, optimizing for it would be extremely expensive computationally so to not significantly hurt performance we propose to accept this limitation (if needed a better algorithm may be proposed as a separate KEP).
We acknowledge the fact that the above algorithm is not optimal, but (a) is compatible with the current pod-based one, (b) is computationally feasible, (c) is simple to reason about. We will proceed with it and may consider improvements in a follow-up KEPs in the future.
We score scheduling decisions for selected number of the domains/placements and choose the best one. For cluster wide preemption without TAS there will be only one decision. For multiple placments we will reuse (and generalize) the preemption scoring functions . During promotion of Topology Aware Scheduling to beta we will consider the exact scoring criteria.
We acknowledge that some users might favor more disruptive preemptions if they allow the workload to be placed in a more optimal way, especially when the Topology-Aware Scheduling comes into the picture. This can be addressed by custom scoring criteria for the placements in the future.
It’s worth noting that as structured, this algorithm addresses all four cases mentioned above that we want to support and is compatible with the current pod-based preemption algorithm. This means we will be able to achieve in-place replacement with relatively localized changes.
In place victim reprieval
The algorithm described above states that at one point we need to check whether a victim can be placed back in the place
they are currently running with preemptor assumed to run on selected nodes. This check will be achieved by running Filter
plugins for each of the schedulable preemptor pods on selected nodes with the preemptor pod CycleState.
We will keep the CycleState objects of each of the schedulable preemptor pods from the scheduling algorithm run within
WAP and update them as we reprieve victims.
We will make sure that the Filter plugins in the WAP will mimic the behavior of Filter plugins from the scheduling cycle. Mainly it means that when running Filter plugins for Nth Pod in the PodGroup, the CycleState/NodeInfo should not contain information about further pods from the Pod Group.
Note: DynamicResources, NodeVolumeLimits and VolumeBinding plugins do not work well with preemption right now. They do not support PreFilterExtension interface and they work on a data provided by informer based managers which does not reflect changes in NodeInfo snapshot. As a result they produce false negatives. We do not expect to improve on this when introducing WAP.
When trying to reprieve Pods belonging to the PodGroup with DisruptionMode=All the preemption logic will be responsible,
for making sure that either all of Pods can be reprieved or none of them will be reprieved. This will be achieved
by reprieving all pods from PodGroup one by one. After each victim is reprieved, it will be added to the state
of the cluster and kept in a list of pods to rollback. If any of the pods from PodGroup fails the Reprieve call
all pods from rollback list will be removed from cluster representation.
TopologyAwareScheduling
Even though we do not aim to have a full support for TAS as a part of this KEP, it’s worth to consider whether this approach will not block or make it hard to implement support for WAP in TAS in the future. As we will rerun whole podGroupSchedulingAlgorithm in the WAP, we will as an output get the best placement with proposed pods assignments. If we also get the CycleStates from those Pod, we will be able to perform an in-place reprieve as described above.
CompositePodGroup
As in TopologyAwareScheduling case we do not aim to have a support for CPG as part of this KEP, but it’s also worth to consider whether this approach will not block WAP for CPGs in the future. This case is analogous to TopologyAwareScheduling, where whole CPG logic will be hidden within the podGroupSchedulingAlgorithm rerun in the WAP.
Pod Group Post Filter
As part of minimizing preemptions goal, arguably the most important thing to do is to avoid unnecessary preemptions. However, with the model of preemption when preemption is triggered immediately after the victims are decided and PostFilter is run per Pod in PodGroup, it doesn’t achieve this goal. The reason for that is that the proposed placement (nomination) can actually appear to be invalid and not be proceeded with, if the whole PodGroup fails to schedule. In such case we will not even proceed to binding and the preemption will be completely unnecessary disruption.
For alpha, to avoid triggering unnecessary preemptions, we disabled the default preemption plugin in PostFilter for pods from PodGroup if the workload aware preemption was enabled.
For beta, we acknowledge that the default preemption is not the only PostFilter plugin out there. Other PostFilter plugins can also perform disruptive actions. In the current model, those plugins works only on the outcome of single pod scheduling cycle, within a PodGroup cycle. With that they do not have a full picture of the pod group scheduling outcome and can perform actions that are either not optimal or in the worst case will not make the PodGroup schedulable anwyay.
For beta and GA we propose to disable all PostFilter plugins in the pod group scheduling cycle in favor of the newly added PodGroupPostFilter extension point. This extension point will be called only once after the whole pod group fails to schedule. It will provide data about the outcome of whole PodGroup scheduling cycle and allow users to define actions that can be taken to make the PodGroup schedulable. Workload Aware Preemption will be one of the implementations of this extension point. As part of the beta promotion we will provide the implementation for other in tree plugin that implements the PostFilter interface (namely the DRA Plugin). We expect owners of out of tree PostFilters to follow with their own implementations.
We will introduce additional safety check during the initialization, that will output an error
log line when we detect plugins that implement PostFilter interface without implementing
PodGroupPostFilter interface. This check will be only enabled when GenericWorkload feature gate
is enabled.
The evaluation of PodGroupPostFilter plugins will reuse the same logic as the PostFilter one. Mainly, the plugins will be executed in the same order as they appear in the scheduler configuration, and the first plugin that returns a success result, will stop evaluation of any further plugins.
// PodGroupPostFilterResult stores information about nominated nodes for a pod group.
type PodGroupPostFilterResult struct {
NominatingInfos map[*v1.Pod]*fwk.NominatingInfo
}
// PodGroupPostFilterPlugin is an interface for "PodGroupPostFilter" plugins. These plugins are called
// after a PodGroup cannot be scheduled.
type PodGroupPostFilterPlugin interface {
fwk.Plugin
// PodGroupPostFilter is called by the scheduling framework
// when the pod group scheduling cycle failed.
//
//
// A PodGroupPostFilter plugin should return one of the following statuses:
// - Unschedulable: the plugin gets executed successfully but the PodGroup cannot be made schedulable.
// - Success: the plugin gets executed successfully and the PodGroup can be made schedulable.
// - Error: the plugin aborts due to some internal error.
//
// Informational plugins should be configured ahead of other ones, and always return Unschedulable status.
// Optionally, a non-nil PodGroupPostFilterResult may be returned along with a Success status. For example,
// a preemption plugin may choose to return nominatedNodeName, so that framework can reuse that to update the
// preemptor pod's status.nominatedNodeName field.
PodGroupPostFilter(ctx context.Context, pg *v1alpha2.PodGroup, pods []*v1.Pod, podGroupCycleState *framework.CycleState, pgSchedulingFunc PodGroupSchedulingFunc) (*PodGroupPostFilterResult, *fwk.Status)
}
Potential future extensions
Here we discuss a couple of extensions that we envision just to ensure that we can build them in an additive and backward-compatible way. The approval of this KEP doesn’t mean an approval for any of those and proceeding with any of these will require dedicated KEP(s) in the future.
Improved preemption algorithm.
We expect that in the future the algorithm proposed in this KEP can be improved to limit the disruptions it causes. For example, instead of considering a single placement of a preemptor for a given set of victims, we may consider multiple different placements. This will have much bigger impact once kube-scheduler supports topology-aware scheduling. As a result, we’re leaving it as a future extension -the algorithm can always be improved and will result in pretty local code changes.
Binary search during preemption
One of the potential optimizations for the workload aware preemption algorithm presented in this KEP, is a special step that would use a binary search across victim priorities to try to fin a minimal priority N for which, we can schedule a preemptor PodGroup without preempting any victims with priority higher than N. This could limit the number of the potential victims to check at the cost of additional scheduling feasibility checks.
Non-uniform priority across CompositePodGroups.
The
CompositePodGroupconcept is also being introduced to Workload API. We can envision a case where differentCompositePodGroupswill require to have different preemption priorities. To achieve that, we could introducePriorityClassNamefield also at theCompositePodGrouplevel, with the semantic that lower-level structure overwrites the higher-level one (e.g. priority set forPodGroupoverwrites the priority forCompositePodGroup). So the API and semantics proposed in this KEP would allow for achieving it in backward compatible way.Non-uniform (Composite)PodGroups
In addition to non-uniform priorities, we may expect other non-uniform behaviors. As an example consider
LeaderWorkerSetand a usecase where we allow for preempting individual workers (with a given unit working in a degraded mode), but don’t allow for preempting a leader. The struct basedDisruptionModeallows for introducing more sophisticated policies (e.g. only a subset ofCompositePodGroupscan be preempted).Dynamic preemption priority
As described above, the preemption priority of a running workload may actually vary over time. In such case, the controller owning a given workload may want to adjust its priority over time to reflect its important and cost of preemption. There are two primary extensions that we can do to achieve that:
- Make
PriorityClassNamemutable over time - Add a new
PreemptionPriorityClassNamefield that will be used when considering a given PodGroup for preemption (potentially also making it mutable).
We believe that at least one of these (potentially both) will be needed in the future, but these all can be achieved in a purely additive way. Mutability is about relaxing validation and defining the semantics for how the mutations are consumed. An
PreemptionPrioritycan also be added in backward-compatible way - if unset it just defaults to scheduling priority but a user has now an ability to overwrite it.In the later case, we will also need to avoid preemption cycle, which can be achieved by an additional constraint that preemption priority cannot be lower then scheduling priority. This ensures that if a given workload X was preempted by workload Y (scheduling(Y) > preemption(X)), it will not be able to preempt back workload Y because preemption(Y) >= scheduling(Y) > preemption(X) >= scheduling(X). This will work fine even if we make preemption priority mutable.
However, given an ability to achieve both of these in backward compatible way later, we leave those for future extensions. We expect this improvement to come up in a dedicated, follow up KEP.
- Make
PodGroupPostFilter extension
The PodGroupPostFilter interface proposed in this KEP contains the most important information about PodGroup that failed scheduling cycle. We can extend this interface in the future to allow plugins to have more insights into why a given pod group cannot be scheduled. One example of that would be an extension similar to the NodeToStatusReader object passed to a PostFilter plugins.
Custom Scoring functions
During the Workload Aware Preemption we remove all potential victims and assume the preemptor. After that we try to reprieve the victims. With a custom scoring functions we could change influence the placement of the preemptor in order to optimize for the reducing the preemption cost. Such change can be designed and implemented as a dedicated, follow up KEP.
Specific in place reprieval method
The in place reprieval proposed in this KEP works by reusing the Filter plugins on preemptor pods. However, this can lead to superflous checks especially for scheduling constraints of preemptor pods that cannot be influenced by “reprieval” of victim. We can see a future, where for the improved performance we will introduce a new method that will allow us to perform a more focused in place reprieval check.
Other in place reprieval improvements
We can also improve the in place reprieval proposed in this KEP by adding a special check that would allow plugins to decide whether a given pod contains any inter pod constraints. If the Pod does not have any such constraints, we would know that it’s safe to reprieve all victims on the nodes that were not selected for the preemptor pods. This can provide a significant performance gain in large scale clusters for workloads without inter pod constraints.
Test Plan
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
Prerequisite testing updates
N/A
Unit tests
pkg/apis/scheduling/v1alpha1:2026-01-29-83.3%pkg/registry/scheduling/workload:2026-01-29-76.5%pkg/registry/scheduling/workload/storage:2026-01-29-83.3%pkg/scheduler/framework/plugins/defaultpreemption:2026-01-29-84.9%pkg/scheduler/framework/runtime:2026-01-29-81.5%
Integration tests
For Alpha we implemented integration tests to ensure basic functionalities of workload preemption:
- Pods from a single PodGroup with
DisruptionMode=Singlecan be preempted individually by the higher priority PodGroup - Pods from a single PodGroup with
DisruptionMode=Allare preempted all together even when preempting a single pod would be enough to free up the space for the higher priority PodGroup - Pods from a single PodGroup with
DisruptionMode=Singlecan be preempted individually by the higher priority individual pod. - Pods from a single PodGroup with
DisruptionMode=Allare preempted all together even when preempting a single pod would be enough to free up the space for the higher priority individual pod.
Those tests are located at podgrouppreemption_test.go .
For Beta we will expand the set of integration tests to cover the in-place reprieval logic. Namely, we will make sure that we have scenarios that covers the use of all in-tree Filter plugins during the workload aware preemption. We also aim to have a parity with all the existing test scenarios for the default preemption.
e2e tests
For alpha, given the new functionality is limited to kube-scheduler change and API extensions, we will rely on integration tests described above (as easier and faster to run and debug).
For beta we will add a new e2e test to the sig scheduling tests defined in test/e2e/scheduling/preemption.go . Those tests will cover the four basics basic functionalities describe in previous section.
For GA we will promote those tests to conformance.
Graduation Criteria
Alpha
- The API & feature is implemented behind the feature flag
- Base integration test showing preemption of whole PodGroup in the
PodGroupmode
Beta
- Decision about additional sorting/scoring preemption victims to minimize preemption cost
- Decision about additional mechanisms for detecting/preventing divergence of priorities between Workload and its Pods.
- Decision whether we support mutability of Priority for Beta
- Extended performance benchmarks to ensure satisfying scalability & performance
- E2E test that can then be promoted to conformance
- All known issues resolved
GA
- E2E test promoted to conformance
- Performance benchmarks have well defined thresholds and are run as part of the scheduler-perf of sig-scalability-benchmarks
- All known issues resolved
Upgrade / Downgrade Strategy
Standard procedures for features introducing new API fields should be used:
- on upgrade, kube-apiservers should be upgraded first before kube-scheduler can use the new fields to opt-in for different preemption mode
- on downgrade, kube-schedulers should be downgraded first (to stop using the new fields) before kube-apiservers are downgraded; note that downgrade of kube-apiserver(s) and/or disabling the new API fields will not clear their contents for objects already stored in the storage (etcd)
Version Skew Strategy
Once kube-apiserver and kube-scheduler are involved in the feature. The new API fields are needed to configure preemption behavior, thus kube-apiserver is required to run in not older version than kube-scheduler.
However, the new preemption algorithm itself is purely in-memory and version skew is not relevant for it.
Production Readiness Review Questionnaire
Feature Enablement and Rollback
How can this feature be enabled / disabled in a live cluster?
- Feature gate (also fill in values in
kep.yaml)- Feature gate name: GenericWorkload
- Components depending on the feature gate: kube-apiserver, kube-scheduler
Note that for Alpha this feature was using WorkloadAwarePreemption feature gate.
For Beta and GA we decided to merge it together with the GenericWorkload feature gate,
with the rationale provided in the rest of the KEP.
Does enabling the feature change any default behavior?
Yes - the preemption victims chosen when scheduling a pod group will be chosen using a slightly modified version of the algorithm. Thus the exact set of victims may slightly differ.
The bigger changes in preemption victims may appear when pod groups start using PodGroup
disruption mode, however that requires an explicit opt-in from the user (or controller)
creating the PodGroup object.
Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
Yes, the preemption algorithm changes can be disabled by simply disabling GenericWorkload feature gate
in kube-scheduler. However, this will also disable a Gang Scheduling feature.
The new API changes and admission can also be disabled by disabling the feature gate in kube-apiserver. However keep in mind that it doesn’t result in clearing the new fields for objects that already have them set in the storage.
What happens if we reenable the feature if it was previously rolled back?
The feature starts working again.
Are there any tests for feature enablement/disablement?
The scheduler algorithm changes are purely in-memory and doesn’t require any dedicated enablement/disablement tests - the logic will be covered by regular feature tests.
The API fields related to Workload Aware Preemption are no longer hidden behind a separate feature gate and will be promoted with the whole API to beta. There is no need for the dedicated enablement/disablement tests at the kube-apiserver registry layer.
Rollout, Upgrade and Rollback Planning
How can a rollout or rollback fail? Can it impact already running workloads?
Workloads that do not use the Workload and PodGroup APIs should not be impacted, since the functionality remains unchanged
for them. During a rolling upgrade, if the active scheduler instance has the feature disabled, it will schedule pods using the
standard pod-by-pod method, falling back to a default PostFilter methods. A default preemption algorithm will treat all pods
as single preemption units, even when they are a part of a PodGroup with All Disruption Mode.
This results in a fallback to the status quo behavior, meaning that pods will be still scheduled, but PodGroup-level scheduling constraints and preemption behavior won’t be applied.
What specific metrics should inform a rollback?
scheduler_podgroup_preemption_attempts_total{result="error"}: A sudden spike indicates internal errors or panics within the workload aware preemption logic.scheduler_podgroup_preemption_attempt_duration_seconds: A significant P99 latency would indicate that the performance of the new logic is unacceptable.plugin_execution_duration_seconds{plugin="DefaultPreemption", extension_point="PostFilter"}: A sudden increase of the latency of default preemption, especially if there is no PodGroup objects in the cluster indicates an issue with the workload aware default preemption.
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
We’ll perform manual testing of the upgrade -> downgrade -> upgrade path using the following sequences to verify workload aware preemption where Pod Group is either preemptor or a victim:
For the Pod Group is a preemptor case we will use the following sequence:
- Start a local Kubernetes v1.36 cluster with
GenericWorkloadfeature gate disabled (default behavior). - Fill the cluster with low priority idle pods so there is no room for new pods.
- Attempt to create a Pod with
spec.schedulingGroupset. - The
spec.schedulingGroupfield is dropped by the API server. The pod is created successfully but without theschedulingGroupreference, resulting in immediate standard scheduling (one-by-one). - Restart/Upgrade API Server and Scheduler to v1.37 with feature gate enabled.
- Create two PodGroup objects:
gang-test-Aandgang-test-B(both withminCount=2). - Create a Pod
test-pod-1withspec.schedulingGrouppointing togang-test-A. - The Pod stays in
Pendingstate (waiting for the gang). Verify thatscheduler_pending_entities{type="podgroup", queue="gated"}metric is incremented. - Create a Pod
test-pod-2pointing to the same pod group. - Both pods are scheduled successfully in the same cycle (Gang Scheduling with Workload Aware Preemption works).
- Verify that
scheduler_podgroup_preemption_attempts_totalmetric is incremented. - Downgrade API Server and Scheduler to v1.36 with feature gate disabled.
- Create
test-pod-3pointing togang-test-B. Note: We use a pod group created in step 5 because creating new PodGroup objects is disabled. - The pod is scheduled immediately (PodGroup logic is ignored because the schedulingGroup field is dropped by the v1.36 API server). If Gang Scheduling were active, this pod would hang pending waiting for a second member.
- Verify that
preemption_attempts_totalwas increased and thescheduler_podgroup_preemption_attempts_totalmetric did not increase. - Upgrade API Server and Scheduler back to v1.37 with feature gate enabled.
- Create
test-pod-4andtest-pod-5pointing togang-test-B; verifying that Gang Scheduling functionality is restored (these pods wait forminCount=2before scheduling). - Verify that the
scheduler_podgroup_preemption_attempts_totalmetric was increased andpreemption_attempts_totalwas not increased.
For the Pod Group is a victim case we will use the following sequence:
This sequence assumes that the pods and nodes are define in a way that ensures that only one pod can fit onto a node.
- Start a local Kubernetes v1.36 cluster with
GenericWorkloadfeature gate disabled (default behavior). - Attempt to create two low priority Pods with
spec.schedulingGroupset. - The
spec.schedulingGroupfield is dropped by the API server. The pod are created successfully but without theschedulingGroupreference. - Create a high priority pod with NodeName set to a node of one of the low priority Pods. The pod is scheduled successfully and it preempts one of the low priority Pods.
- Restart/Upgrade API Server and Scheduler to v1.37 with feature gate enabled.
- Create two PodGroup objects:
gang-test-Aandgang-test-B(both withminCount=2andPreemptionMode: AllandPriority: Low). - Create two low piority pods
test-pod-1andtest-pod-2withspec.schedulingGrouppointing togang-test-A. - Create a high priority pod that with NodeName set to a node name of
test-pod-1. Verify that preemption preempted bothtest-pod-1and -test-pod-2. - Downgrade API Server and Scheduler to v1.36 with feature gate disabled.
- Create
test-pod-3andtest-pod-4pointing togang-test-B. Note: We use a pod group created in step 6 because creating new PodGroup objects is disabled. - Create a high priority pod that with NodeName set to a node name of
test-pod-3. Verify that preemption preempted onlytest-pod-3and nottest-pod-4. - Upgrade API Server and Scheduler back to v1.37 with feature gate enabled.
- Create low priority pods
test-pod-5andtest-pod-6pointing togang-test-B. - Create a high priority pod that with NodeName set to a node name of
test-pod-5. Verify that preemption preemptedtest-pod-5andtest-pod-6.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
During promotion to Beta, the WorkloadAwarePreemption feature gate will be removed and
Workload Aware Preemption features will be moved behind the GenericWorkload feature gate.
Monitoring Requirements
How can an operator determine if the feature is in use by workloads?
Operators can check the new scheduler_podgroup_preemption_attempts_total metric. A value greater than zero indicates that the
scheduler is processing Workload Aware Preemption.
How can someone using this feature know that it is working for their instance?
- Events
{ "kind": "Event", "involvedObject": { "kind": "Pod", }, "related": { "kind": "PodGroup", }, "reason": "Preempted", "message": "Preempted by podgroup ... on node cluster", "source": { "component": "default-scheduler" }, "type": "Normal", "action": "Preempting", "reportingComponent": "default-scheduler" }
- API .status
- Object: PodGroup
- Condition Name:
PodGroupScheduled - reason:
Unschedulable - message:
pod group is waiting for podgroup preemption to complete
- API .status
- Object: Pod
- Condition Name:
DisruptionTarget - reason:
PreemptionByScheduler - message: `default-scheduler: preempting to accommodate a higher priority podgroup"
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
Since there are no formal SLOs for the kube-scheduler apart from scalability SLOs, we define the objectives for this feature primarily in terms of non-regression to ensure the workload aware preemption does not degrade the performance of the workload scheduling which in term would degrade the performance of the standard scheduling loop.
- Scheduling Throughput: There should be no significant regression in the system-wide scheduling throughput (pods/s)
when scheduling pods attached to a PodGroup that requires preemption compared to scheduling an equivalent number of
individual pods that would also require preemption.
This can be measured by the number of Pod binding API calls arriving to the API server
(
apiserver_request_total{resource="pods", subresource="binding"}). - Scheduling Throughput: There should be no significant regression in the system-wide scheduling throughput (pods/s)
when scheduling pods requires preemption of pods grouped in PodGroups (DisruptionMode = All) compared to scheduling
an equivalent number of pods in PodGroups that would require preemption of similar number of pods but not grouped in PodGroup.
This can be measured by the number of Pod binding API calls arriving to the API server
(
apiserver_request_total{resource="pods", subresource="binding"}). - Default Preemption Performance: There should be no significant regression in the default preemption performance,
especially when there are no PodGroups in the cluster. This can be measured by the
plugin_execution_duration_seconds{plugin="DefaultPreemption", extension_point="PostFilter"}metric.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- scheduler_podgroup_preemption_attempts_total
- scheduler_podgroup_preemption_attempt_duration_seconds
- scheduler_podgroup_preemption_victims
- scheduler_podgroup_preemption_reprieved_total
- plugin_execution_duration_seconds{plugin=“DefaultPreemption”, extension_point=“PostFilter”}
- Components exposing the metric: kube-scheduler
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature?
No.
Dependencies
Does this feature depend on any specific services running in the cluster?
No dependencies other than the components where the feature is implemented (kube-apiserver and kube-scheduler).
Scalability
Will enabling / using this feature result in any new API calls?
Not directly. However, with workload-aware preemption, more pods potentially needs to be preempted (in PodGroup mode) to free up space for new workloads to be scheduled.
Will enabling / using this feature result in introducing new API types?
No
Will enabling / using this feature result in any new calls to the cloud provider?
No
Will enabling / using this feature result in increasing size or count of the existing API objects?
Yes - new fields are added to the Workload API.
For PriorityClassName, Priority, DisruptionMode expected increase is O(130B) per PodGroupTemplate object in Workload object.
For PriorityClassName, Priority, DisruptionMode expected increase is O(130B) per PodGroup object.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Although we designed preemption with performance in mind, the scheduling latency (being part of Pod Startup SLO) may potentially increase. We will measure the exact impact using performance benchmarks and scalability tests and update the section based on the results. The complexity of a single preemption cycle is O(#pods), which is comparable to the current algorithm, so the benchmarks are primarily to validate the potential inefficiencies of the implementation.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, …) in any components?
We don’t expect non-negligible CPU increase for kube-scheduler, but it will be confirmed by tests.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No
Troubleshooting
How does this feature react if the API server and/or etcd is unavailable?
The behavior is consistent with the status quo. The removal of the victims selected by the Workload-Aware Preemption uses the same code as the standard preemption. Since the scheduler cannot remove pods or update them, any pod removal attempts that are an outcome of preemption will fail, not making space for the preemptor pods. Preemptor pods will not be marked with the NominatedNodeName.
When the call to delete victim pods fails, the preemptor is moved to active queue (with async preemption) or backoff/unschedulable queue (without async preemption).
Calls to update NominatedNodeNames for preemptor pods are using PatchPodStatus function which implements a retry mechanism and is shared by all occurrences that require updating pod status from scheduler.
What are other known failure modes?
- WorkloadAwarePreemption takes too long halting scheduling loop
- Detection: High values for metric:
scheduler_podgroup_preemption_attempt_duration_seconds - Mitigations: If intended, delete the PodGroup object and recreate the pods without
schedulingGroupto disable gang scheduling and workload aware preemption (fallback to best-effort scheduling and default preemption) if acceptable. - Diagnostics: Scheduler logs at V=6 searching for logs from podgrouppreemption.go file to trace where preemption slows down.
- Testing: The scheduler performance benchmarks should catch potential issues with a poor performance of the workload aware preemption
- WorkloadAwarePreemption does not remove low level pods to make a place for the preemptor
- Detection: Check Pod Events/Status. Expected reason: a message indicating why preemption failed
- Metrics:
scheduler_podgroup_preemption_attempts_total{result=error} - Mitigations:
- Scale up the cluster (add nodes) or delete other real-workloads to free up space.
- If intended, delete the PodGroup object and recreate the pods without
schedulingGroupto disable gang scheduling (fallback to best-effort scheduling) if acceptable.
- Diagnostics:
- Scheduler logs at V=6 searching f or logs from podgrouppreemption.go file to see detailed reasons why the workload aware preemption failed.
- Testing:
- Covered by integration tests
- WorkloadAwarePreemption removes more pods than necessary
- Detection: The amount of pods with status
preempted by podgroup Xis higher than expected for a given pod group - Mitigations: If intended, delete the PodGroup object and recreate the pods without
schedulingGroupto disable gang
scheduling and workload aware preemption (fallback to best-effort scheduling and default preemption) if acceptable. - Diagnostics: Search for log line (V6)
Pods are potential preemption victims on domain. This line is outputted after each failed victim reprieval. - Testing: The scheduler performance benchmarks should catch potential issues with a poor performance of the workload aware preemption
What steps should be taken if SLOs are not being met to determine the problem?
If workload aware preemption latency is the problem:
- Increase Log Verbosity: Set the default scheduler log level to
-v=6(or-v=10for deep tracing) to capture internal scheduling steps. - Examine Scheduler Logs: Filter logs for source file
podgrouppreemption.goand look for execution duration messages in the preemption evaluation flow. - Identify Expensive Filter Plugins: Determine which specific filter plugins are taking too long during victim reprieval.
Plugins that maintain complex topological indexing or search trees (such as
InterPodAffinityorPodTopologySpread) may be executing expensive logic repeatedly in theirFiltermethods over candidate domains. - Disable Feature: If the regression is critical and impacting cluster health, disable the GenericWorkload feature gate. This will revert the scheduler to the standard pod-by-pod logic, restoring baseline performance (at the cost of losing gang semantics together with workload aware preemption).
If preemptor workloads are stuck and preemption attempts fail:
- Inspect Preemptor Pod Status and Events:
- Run
kubectl describe pod <preemptor-pod>to inspect scheduler warnings or event logs. Look for errors with messagepod group preemption. - Check status conditions of the preemptor Pods and their parent
PodGroup:- Preemptor Pods: Check for
{type: PodScheduled, status: False, reason: Unschedulable}with a detailed descriptive message. - Preemptor PodGroup: Check for the
PodGroupScheduledstatus condition withreason: Unschedulableand message (e.g.,pod group is waiting for podgroup preemption to complete).
- Preemptor Pods: Check for
- Run
- Examine Scheduler Logs: Filter logs for source file
podgrouppreemption.goand look for messages indicating why the preemption attempts are failing. - Disable Feature: Instead of using PodGroup, create pods as separate pods and rely on the default preemption algorithm.
Implementation History
2025-11: Initial KEP-5710 proposal. 2026-02: KEP-5710 created for WAP alpha release. 2026-02: KEP-5710 updated to sync with decoupling of PodGroup/Workload API. 2026-05: KEP updated to promote to beta in v1.37.
Drawbacks
There are already multiple implementations of Gang Scheduling with Gang Preemption in the kubernetes ecosystem. However, we believe that workload awarness is critical enough that it deserves standardizing in core Kubernetes.
Alternatives
One alternative considered as a short-term workaround for unnecessary preemption was the introduction of “delayed preemption”. The proposed delayed preemption mechanism was structured as follows:
Modify the
DefaultPreemptionplugin to just compute preemptions, without actuating them.Extend the
PostFilterResultto include a set of victims (in addition to the existingNominationInfo). This will allows to clearly decouple the computation from actuation.For individual pods (not being part of a workload), adjust the scheduling framework implementation of
schedulingCycleto actuate preemptions of returned victims if callingPostFilterplugins resulted in finding a feasible placement.For pods being part of a workload, rely on the Workload Scheduling Cycle. There are two subcases here:
In the legacy case (without workload-aware preemption),
PostFilteris called individually for every pod from a PodGroup. However, the victims computed for already the already processed pods may affect placement decisions for the next pods. To accommodate for that, if a set of victims was returned from aPostFilterin addition to keeping them for further actuation, they are additionally stored inCycleState. More precisely, theCycleStatestores a new entry containing a map from anodeNameto a list of victims that were already chosen. With that, theDefaultPreemptionplugin is extended to remove all already chosen victims from a given node before processing that node.In the target case (with workload-aware preemption), there is no longer a need to process pods individually, so the additional mutations of
CycleStateare not needed.
In both above cases, an additional step is introduced to the scheduling algorithm at the end. If a feasible placement for the PodGroup is found, all the victims are taken and their preemption is actuated. If a feasible placement was not found, the victims are dropped. In both cases, the scheduling of the whole PodGroup (all its pods) is marked as unschedulable and got back to the scheduling queue.
This alternative was dropped as we decided that workload aware preemption is crucial for the Gang Scheduling effort. As we tied those two efforts together, there is no need for additional alternative approach for minimzing disruptions in Gang Scheduling without WAP.
Infrastructure Needed (Optional)
N/A