Version: 1.3.0

Priority Scheduling

Introduction

Currently, YuniKorn can sort at two levels: the applications in a queue and the resource requests within the application. The queue sorting policy is configurable and supports a Fair or FIFO policy. StateAware is considered a filtered FIFO policy. Applications cannot be sorted based on priority.

Within an application, requests are sorted based on priority of the requests. For requests that have the same priority the submission time is used as a secondary sort key. Requests are sorted in FIFO order if they have the same priority. Note that the priority of requests was not propagated from the shim to the core until YUNIKORN-1235 was implemented in v1.0.0. Until that implementation all requests were created with the default priority.

Additionally, Kubernetes uses priority as a sort when performing preemption. YuniKorn currently only does preemption of requests if a node-specific task needs to be scheduled (i.e. a DaemonSet pod, YUNIKORN-1085).

Goals

Attempt to be compatible with Kubernetes standard priority handling wherever possible
Priority should be orthogonal to application sorting policies (i.e. Fair, FIFO and StateAware policies should support prioritization)
It should be possible to limit the scope of priority handling
Design of priority scheduling should consider the impact on preemption design

Non Goals

Design of preemption
Change AllocationAsk sorting within an application

Definitions

The following terms are defined as follows and will be used throughout this document.

Priority

Priority is a numeric value associated with an Ask. Higher-priority asks have higher numeric values. To be compatible with Kubernetes, all 32-bit signed integers are considered valid priorities. An ask without a specific defined priority is assumed to have the default priority.

Minimum Priority

Minimum priority is defined as the lowest possible priority, or -2,147,483,648 (- 2³¹).

Maximum Priority

Maximum priority is defined as the highest possible priority, or 2,147,483,647 (2³¹ - 1). Note that Kubernetes does not permit custom priority classes (see below) to define priorities greater than 1,000,000,000 (1 billion). Values higher are reserved for system-level priorities.

Default Priority

Default priority is defined as the value 0.

PriorityClass

A PriorityClass is a Kubernetes object which maps between a priority name and an integer priority value. The name cannot start with system- as that prefix is reserved for Kubernetes internal usage.

Additionally, a preemptionPolicy may be set on a PriorityClass. Official GA support was added in Kubernetes 1.24. The preemption policy is only used while scheduling the pod itself. It is not taken into account after the pod is scheduled.

The policy defaults to Never, implying that pods of this priority will never preempt other pods. PreemptLowerPriority is also possible, which allows the scheduler to preempt pods with lower priority values.

When a pod is submitted with its priorityClassName set to a valid priority class, a Kubernetes-provided admission controller will automatically update the priority and preemptionPolicy fields of the pod to match those of the referenced PriorityClass. Pods that specify a priority as part of the specification are rejected if the integer value set on the pod does not match the PriorityClass value.

Pre-defined PriorityClass objects

system-cluster-critical: 2,000,000,000
system-node-critical: 2,000,000,100

Current state

The framework for priority is minimally implemented within YuniKorn. The AllocationAsk structure within the scheduler interface already contains a priority field which is populated by the shim based on the priority field on a Pod.

One priority can be set on creation of the AllocationAsk. Each AllocationAsk within an application may have its own priority. AllocationAsk sorting is based on this priority. This means that within the application asks or pods are sorted based on priority. High-priority asks will be scheduled before low-priority asks.

Priorities are currently also used for the minimal DaemonSet preemption process implemented in YUNIKORN-1085. Possible victims are sorted based on their priority among other things. The priority of the Allocation is the same as the AllocationAsk they represent.

One use of the capability to set a priority for each AllocationAsk could be to enable dynamic allocations such as those utilized by Apache Spark. An application can submit a set of asks at high priority that will be scheduled as soon as possible, and a set of lower-priority asks that can be used for additional capacity if available.

Proposal

Priority fencing

By default, priority applies across queues globally. In other words, a higher-priority ask will be attempted to be satisfied before a lower-priority ask regardless of which queue the asks reside in.

However, it is often useful to limit the scope of priority handling. For example, it may be desirable to limit the ability of asks in one area of the queue hierarchy from jumping ahead of asks in another area. To support this, we add the concept of a Priority Fence. A queue (either leaf or parent) may be configured as priority-fenced, which prevents priorities from propagating upward.

If a queue is fenced, asks within that queue (and subqueues) will be prioritized relative to each other, but not relative to asks in other portions of the queue hierarchy. A priority fence is a YuniKorn queue subtree boundary above which priorities are no longer considered. A priority fenced queue does not expose priority levels of its children to its parent queue, thereby ensuring that priorities are only considered within that queue and its descendants.

Priority fences may be nested; for example a fenced queue tenant 1 may have child queues queue A (fenced) and queue B (unfenced). In this case tasks in queue A will show as the default priority and will not be prioritized above queue B, but tasks in queue B may be prioritized over those in queue A. In no case will tasks in either queue be prioritized over tasks outside the scope of queue tenant 1, as they show as the default priority (shown in the diagram below with yellow arrows).

A similar setup is created for the tenant 2 queue. However in this case the priorities from queue 1 to queue 2 and vice-versa are fully visible (shown in the diagram below with blue arrows).

None of the tasks running in the queues below tenant 1 or tenant 2 can be prioritized above the system queue. Tasks in the system queue are not fenced. That means that a task in the system queue can be prioritized above any queue in the hierarchy (shown in the diagram below with red arrows).

priority flows

Priority offset

Priorities need to be pre-defined at the cluster level via the PriorityClass objects. At the Kubernetes level there is no way to limit which priorities can be used in which part of the system. The only exception to that rule are the pre-defined system priorities from the Kubernetes system itself. This does not give an administrator a lot of control.

Prioritizing a specific queue above another queue based on the priorities of the tasks inside the queue is thus difficult. Any user can start a task with any priority anywhere in the cluster. To allow an administrator more control we need to be able to steer, or augment, the priority of a queue. To accomplish this, we introduce the priority.offset property of a queue.

The priority.offset property on a queue allows the administrator to increase (or decrease) a queue's priority. By using the offset a queue can be boosted to become a high-priority queue. It also can be used to decrease the priority compared to its siblings.

As a general rule: the priority of a leaf queue is equal to the maximum priority of the AllocationAsks in that queue, plus the priority.offset value. The priority of a parent queue is the maximum of the priority of its child queues plus the priority.offset value. This means that priority offsets are additive when traversing up the queue hierarchy. A priority offset may be any valid signed int32 value. The offset is applied after the dynamic priority for the queue is calculated.

As a note of caution: a side effect of a very large offset value for a queue could result in all pods in that queue being clamped at the maximum priority level or exceeding the priority of Kubernetes system priority classes. This could then adversely impact the scheduling of node and or cluster critical pods in sibling queues. This effect needs to be clearly documented. Logging a warning about a configured value that high should also be considered.

A second option that was considered is limiting the offset value to 999,999,999. This would prevent the offset priority from exceeding the pre-defined system priorities. This would be a safe option but does not take into account the fact that we can have negative priorities that we want to offset. The current choice is to go with the first option to document and log.

Fencing affects the priority calculation for the queue. When a queue is fenced (i.e. the priority.policy is set to fence), queue priority calculations are disabled for that queue. The result of application priority or child queue calculations are ignored for a fenced queue. The priority of the fenced queue is always set to the default priority or to the priority offset, if specified.

Extended application sorting

Applications are currently sorted based on one of the defined application sorting policies. All application sorting policies only take into account applications that have pending requests.

Currently defined are Fair, FIFO, and StateAware:

Fair sorting sorts based on relative usage within a queue
FIFO sorts applications by creation time in a first-in, first-out manner
StateAware is similar to FIFO except that only a single "new" application is allowed at once (this is an over-simplification, but will suffice here)

To support priority scheduling, the policies Fair and FIFO will get an equivalent policy that considers the priority of an application first. For applications that have the same priority, sorting will then fall back to the secondary sort criteria (Fair or FIFO).

StateAware as a special form of FIFO will however always filter applications first and then sort. A priority based version of StateAware must change the way it filters. The single "new" application that is considered can no longer be the oldest application. That would be a major behavioral change from the current policy.

The existing policies will not be changed to maintain backwards compatibility. None of the existing policies have a tiebreak built in. In the case the comparison returns equality the order of the elements is not changed.

Instead of introducing new policies to add priority to the existing behavior a flag will be added that allows switching priorities on or off. The property will be added to the queue.

New property: appication.sort.priority

Extended queue sorting

Queue sorting currently only supports one policy: Fair. Queues cannot be sorted using a FIFO mechanism. FIFO sorting of queues is not in scope of this design. However the design has allowed for a simple addition of a new policy.

The Fair policy works similarly to application Fair sorting, except that it sorts based on usage ratios between two queues. Allocations are compared to guaranteed resources. This results in the queue furthest below its guaranteed resource to be first in the sorted queue list as it is considered "more starved". For queues that have the same usage ratio, the queue with the highest pending requests is considered more starved.

To add priority support, a new property will be added to consider the current queue priority as the main sorting factor. To break the tie between queues with the same priority, starvation like in the Fair policy will be used. Again breaking the tie for queues with the same Fair usage using the pending requests.

The existing policy will not be changed to maintain backwards compatibility. The same property as used for application sorting will be reused to control priority behavior.

New property: application.sort.priority

Priority Configuration

Queue Configuration

Three new attributes will be added to the queue configuration object to allow specifying the priority fence type and priority offset and priority sorting. All properties priority.policy, priority.offset, and application.sort.priority, can be set on any queue. The attributes are part of the queue properties as follows:

queues:
  - name: fencedqueue
    properties:
      priority.policy: fence
      priority.offset: 100
      application.sort.priority: enabled

The priority.policy and priority.offset fields are not inherited from the parent queue to a child queue; they apply only to the queue they are specified on.

Note that the application.sort.priority field is only inherited if the value is set to disabled (the enabled value is already the default). This allows priorities to be ignored for an entire queue subtree by setting the property on the parent queue.

Properties are fully supported in child templates. Support for setting these properties for dynamic queues is included as part of the existing functionality.

The supported, not case sensitive, values for the application.sort.priority are:

enabled
disabled

The proposal is to default to enabled for the application.sort.priority property. Given that the Kubernetes default scheduler and preemption support are all based on priorities that choice seems more logical than disabled.

The application.sort.priority when set on a parent queue affects the way the queues are sorted. It will change queue sorting from the standard Fair sorting to priority-based sorting. To break the tie for equal priority queues it will use the Fair comparison. When set on a leaf queue it will change the application sorting itself. All three existing policies will be supported with application.sort.priority set to enabled or disabled. For the disabled case nothing will change and the old behavior is maintained. For the enabled case the primary sort will be priority and the configured policy (Fair, FIFO, or StateAware) will be used as tiebreakers.

The priority.policy and priority.offset can be set on the root queue but have no effect. The root queue as the top of the hierarchy has no parent. This means priority cannot traverse further up the tree. No messages will be logged if the priority.policy or priority.offset are set on the root queue.

The supported, not case sensitive, values for the priority.policy are:

default
fence

If the value fence is set on a queue, the queue on which the property is set does not propagate its priority outside of the queue's subtree. The value default does not fence the queue. If no policy is set the value default is used.

The priority.offset value must be a valid integer in string form. The value will be converted into an integer using the standard parsing methods as defined via: strconv.ParseInt(s, 10, 32).

If no priority.offset property is specified in the queue the default of 0 is used. Specifying the property with an empty value is equivalent to 0. If parsing of the string fails the default of 0 is used.

Scheduler storage object changes

Application

The Application object within the scheduler core will be updated with a dynamic priority value, defined as the maximum priority of all pending asks. To avoid sorting performance degradation, this value should be cached and recomputed only when allocation asks are added, allocated, or removed.

This recalculation can be avoided in many instances. When new asks are added, an application's priority need not be recalculated unless the priority of the new ask is greater than that of the application as a whole. Similarly, when removing an ask, the priority need not be adjusted unless the ask's priority is equal to the application's priority. When recalculating priority due to adding a high-priority ask, the application's priority can simply be set to the new ask's priority. When recalculating due to removal of an ask of equal priority to the application, we can stop considering additional asks as soon as any other ask of equal priority is encountered. This means that the performance impact of application priority updates will be O(1) when adding an ask, and at worst O(n) when removing an ask.

New field: askMaxPriority

Queue

The new configuration values need to be maintained on the queue. Besides those values the Queue object within the scheduler core will be updated with a dynamic priority value. This dynamic priority will be defined as the maximum priority of all applications within the queue, plus any "priority offset" configured.

To avoid sorting performance degradation, this value should be cached and recomputed only when necessary. For a leaf queue, this is required if application priorities in the queue have changed, and for a parent queue this is required if the priorities of child queues have changed.

New fields: priorityPolicy, priorityOffset, currentPriority

Introduction​

Goals​

Non Goals​

Definitions​

Priority​

Minimum Priority​

Maximum Priority​

Default Priority​

PriorityClass​

Pre-defined PriorityClass objects​

Current state​

Proposal​

Priority fencing​

Priority offset​

Extended application sorting​

Extended queue sorting​

Priority Configuration​

Queue Configuration​

Scheduler storage object changes​

Application​

Queue​