Skip to main content
Version: Next

Preemption

Preemption is an essential feature found in most schedulers, and it plays a crucial role in enabling key system functionalities like DaemonSets in K8s, as well as SLA and prioritization-based features.

This document provides a brief introduction to the concepts and configuration methods of preemption in YuniKorn. For a more comprehensive understanding of YuniKorn's design and practical ideas related to preemption, please refer to the design document.

Kubernetes Preemption

Preemption in Kubernetes operates based on priorities. Starting from Kubernetes 1.14, you can configure preemption by adding a preemptionPolicy to the PriorityClass. However, it is important to note that preemption in Kubernetes is solely based on the priority of the pod during scheduling. The full documentation can be found here.

While Kubernetes does support preemption, it does have some limitations. Preemption in Kubernetes only occurs during the scheduling cycle and does not change once the scheduling is complete. However, when considering batch or data processing workloads, it becomes necessary to account for the possibility of opting out at runtime.

YuniKorn Preemption

In Yunikorn, we offer two preemption types: general and DaemonSet. DaemonSet preemption is much more straightforward, as it ensures that pods which must run on a particular node are allowed to do so. The rest of the documentation only concerns generic preemption. For a comprehensive explanation of DaemonSet preemption, please consult the design document.

YuniKorn's generic preemption is based on a hierarchical queue model, enabling pods to opt out of running. Preemption is triggered after a specified delay, ensuring that each queue's resource usage reaches at least the guaranteed amount of resources. To configure the delay time for preemption triggering, you can utilize the preemption.delay property in the configuration.

To prevent the occurrence of preemption storms or loops, where subsequent preemption tasks trigger additional preemption tasks, we have designed seven preemption laws. These laws are as follows:

  1. Preemption policies are strong suggestions, not guarantees
  2. Preemption can never leave a queue lower than its guaranteed capacity
  3. A task cannot preempt other tasks in the same application
  4. A task cannot trigger preemption unless its queue is under its guaranteed capacity
  5. A task cannot be preempted unless its queue is over its guaranteed capacity
  6. A task can only preempt a task with lower or equal priority
  7. A task cannot preempt tasks outside its preemption fence

For a detailed explanation of these preemption laws, please refer to the preemption design document.

Next, we will provide a few examples to help you understand the functionality and impact of preemption, allowing you to deploy it effectively in your environment. You can find the necessary files for the examples in the yunikorn-k8shim/deployment/example/preemption directory.

Included in the files is a YuniKorn configuration that defines the queue configuration as follows:

queues.yaml: |
partitions:
- name: default
placementrules:
- name: provided
create: true
queues:
- name: root
submitacl: '*'
properties:
preemption.policy: fence
preemption.delay: 10s
queues:
- name: 1-normal ...
- name: 2-no-guaranteed ...
- name: 3-priority-class ...
- name: 4-priority-queue ...
- name: 5-fence ...

Each queue corresponds to a different example, and the preemption will be triggered 10 seconds after deployment, as indicated in the configuration preemption.delay: 10s.

General Preemption Case

In this case, we will demonstrate the outcome of triggering preemption when the queue resources are distributed unevenly in a general scenario.

We will deploy 10 pods with a resource requirement of 1 to both queue-1 and queue-2. First, we deploy to queue-1 and then introduce a few seconds delay before deploying to queue-2. This ensures that the resource usage in queue-1 will exceed that of queue-2, depleting all resources in the parent queue and triggering preemption.

QueueMax ResourceGuaranteed Resource
normal12- (not configured)
normal.queue-1105
normal.queue-2105

Result:

When a set of guaranteed resources is defined, preemption aims to ensure that all queues satisfy their guaranteed resources. Preemption stops once the guaranteed resources are met (law 4). A queue may be preempted if it has more resources than its guaranteed amount. For instance, in this case, if queue-1 has fewer resources than its guaranteed amount (<5), it will not be preempted (law 5).

QueueResource before preemptionResource after preemption
normal.queue-110 (victim)7
normal.queue-225 (guaranteed minimum)

preemption_normal_case

Priority

In general, a pod can preempt a pod with equal or lower priority. You can set the priority by defining a PriorityClass or by utilizing queue priorities.

While preemption allows service-type pods to scale up or down through preemption, it can also lead to the preemption of pods that should not be preempted in certain scenarios:

  1. Spark Jobs, where the driver pod manages a large number of jobs, and if preempted, all jobs will be affected.
  2. Interactive pods, such as Python notebooks, have a significant impact when restarted and should be avoided from preemption.

To address this issue, we have designed a "do not preempt me" flag. You can set the annotation yunikorn.apache.org/allow-preemption to false in the PriorityClass to prevent pod requests from being preempted.

NOTE: The flag yunikorn.apache.org/allow-preemption is a request only. It is not guaranteed but Pods annotated with this flag will be preempted last.

PriorityClass

In this example, we will demonstrate the configuration of yunikorn.apache.org/allow-preemption using PriorityClass and observe its effect. The default value for this configuration is set to true.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: preemption-not-allow
annotations:
"yunikorn.apache.org/allow-preemption": "false"
value: 0

We will deploy 8 pods with a resource requirement of 1 to queue-1, queue-2, and queue-3, respectively. We will deploy to queue-1 and queue-2 first, followed by a few seconds delay before deploying to queue-3. This ensures that the resource usage in queue-1 and queue-2 will be greater than that in queue-3, depleting all resources in the parent queue and triggering preemption.

QueueMax ResourceGuaranteed Resourceallow-preemption
rt16-
rt.queue-183true
rt.queue-283false
rt.queue-383true

Result:

When preemption is triggered, queue-3 will start searching for a victim. However, since queue-2 is set with allow-preemption as false, the resources of queue-1 will be preempted.

Please note that setting yunikorn.apache.org/allow-preemption is a strong recommendation but does not guarantee the lack of preemption. When this flag is set to false, it moves the Pod to the back of the preemption list, giving it a lower priority for preemption compared to other Pods. However, in certain scenarios, such as when no other preemption options are available, Pods with this flag may still be preempted.

For example, even with allow-preemption set to false, DaemonSet pods can still trigger preemption. Additionally, if an application in queue-1 has a higher priority than one in queue-3, the application in queue-2 will be preempted because an application can never preempt another application with a higher priority. In such cases where no other preemption options exist, the allow-preemption flag may not prevent preemption.

QueueResource before preemptionResource after preemption
rt.queue-18 (victim)5
rt.queue-288
rt.queue-303 (guaranteed minimum)

preemption_priorityclass_case

Priority Queue

In addition to utilizing the default PriorityClass in Kubernetes, you can configure priorities directly on a YuniKorn queue.

In the following example, we will demonstrate preemption based on queue priority.

We will deploy five pods with a resource demand of 3 in the high-pri queue, norm-pri queue, and low-pri queue, respectively. We will deploy them to the norm-pri queue first, ensuring that the resources in the root(parent queue) will be fully utilized. This will result in uneven resource distribution among the queues, triggering preemption.

QueueMax ResourceGuaranteed Resourcepriority.offset
root18-
root.high-pri106100
root.norm-pri1860
root.low-pri106-100

Result:

A queue with higher priority can preempt resources from a queue with lower priority, and preemption stops when the queue has preempted enough resources to satisfy its guaranteed resources.

QueueResource before preemptionResource after preemption
root.high-pri06 (guaranteed minimum)
root.norm-pri18 (victim)12
root.low-pri00

preemption_priority_queue_case

Preemption Fence

In a multi-tenant environment, it is essential to prevent one tenant from occupying the resources of another tenant. In YuniKorn, we map tenants to a queue hierarchy, the queue hierarchy can thus cross tenant boundaries.

To address this issue, YuniKorn introduces a preemption fence, which is a setting on the queue that prevents preemption from looking at queues outside the fence boundary. The fence is a one-way fence. It prevents going out (i.e. higher up the queue hierarchy), but does not prevent coming in (or down) the queue hierarchy.

...
queues:
- name: default
properties:
preemption.policy: fence

We will use the following diagram as an example:

preemption_fence_hierarchy_case

In this example, we will sequentially deploy 15 pods with a resource requirement of 1 to each sub-queue.

First, we deploy queue-1 in tenant-a and wait until the application in queue-1 occupies all the resources of tenant-a. Then, we deploy queue-2 after the resources of tenant-a are fully utilized. Next, we deploy the application ten-b.queue-3 and allocate resources to the system when the fence queue is full.

QueueMax ResourceGuaranteed Resourcefence
rt3-true
rt.ten-a155true
rt.ten-a.queue-1152
rt.ten-a.queue-2152true
rt.ten-b1510true
rt.ten-b.queue-31510
rt.sys1510

Result:

In this example, two imbalances are observed:

  • Within ten-a, queue-1 occupies all the resources, while queue-2 has no resources. However, since queue-2 is configured with a fence, it cannot acquire resources from outside the fence. preemption_fence_case1
  • Inside the rt queue, both ten-a and ten-b occupy all the resources, while the sys queue has no resources, and no fence is set up. Therefore, the sys queue can acquire resources from the queues in the hierarchy until its guaranteed resources are met. In this case, the sys queue acquires resources from both ten-a and ten-b. preemption_fence_case2
QueueResource before preemptionResource after preemption
rt.ten-a1510
rt.ten-a.queue-11510
rt.ten-a.queue-200
rt.ten-b1510
rt.ten-b.queue-31510
rt.sys010