Kubernetes v1.36: Revolutionizing Workload-Aware Scheduling with PodGroup and Workload API Separation

Kubernetes v1.36 introduces a major architectural leap in workload-aware scheduling, building on the foundation laid in v1.35. The key innovation is the clean separation of scheduling concerns: the Workload API now serves as a static template, while the brand-new PodGroup API handles runtime state. This redesign streamlines the scheduler's logic, improves scalability, and paves the way for advanced features like topology-aware scheduling and workload-aware preemption. Below, we answer common questions about these changes and their practical implications.

What are the key changes in Kubernetes v1.36 for workload-aware scheduling?

Kubernetes v1.36 transforms workload-aware scheduling by decoupling the Workload and PodGroup APIs. Previously in v1.35, both the static template and runtime state were embedded in the Workload resource. Now, the Workload API becomes a pure template object, while a separate PodGroup API tracks the live scheduling state. This separation allows the kube-scheduler to read PodGroup directly without parsing Workload objects, boosting performance and scalability. Additionally, v1.36 introduces the first iterations of topology-aware scheduling and workload-aware preemption, plus ResourceClaim support for Dynamic Resource Allocation (DRA) within PodGroups. The Job controller also gets initial integration with the new APIs, demonstrating real-world readiness.

Kubernetes v1.36: Revolutionizing Workload-Aware Scheduling with PodGroup and Workload API Separation

How does the new Workload API differ from the previous version?

In v1.35, the Workload API (v1alpha1) included both the group template and its runtime status, making it a monolithic object. Starting with v1.36, the Workload API (now part of the scheduling.k8s.io/v1alpha2 group) serves exclusively as a static template. It defines podGroupTemplates—for example, specifying a gang scheduling policy with a minCount of 4. Controllers then create separate PodGroup runtime objects based on these templates. This separation reduces complexity in the scheduler and allows per-replica sharding of status updates via the PodGroup API, improving performance. The Workload object no longer holds any runtime state; all live scheduling information resides in the PodGroup.

What is the PodGroup API and how does it improve scheduling?

The PodGroup API is a new runtime resource introduced in v1.36 (scheduling.k8s.io/v1alpha2) that manages the live scheduling state of a group of pods. It is stamped out from a Workload's podGroupTemplates and contains the actual scheduling policy (e.g., gang scheduling parameters) and a status section that mirrors individual pod states and the overall group scheduling condition. By moving runtime state into PodGroup, the scheduler can directly access all necessary information without watching or parsing Workload objects. This improves performance and scalability, especially for large clusters with many workloads. It also enables atomic processing of the entire group during scheduling, paving the way for future enhancements like batch-aware decisions.

How does the kube-scheduler handle PodGroups in v1.36?

The kube-scheduler in v1.36 features a dedicated PodGroup scheduling cycle. When a PodGroup is created (by a controller like the Job controller), the scheduler reads the group's specification directly from the PodGroup API object. It then processes all pods of the group atomically, ensuring that gang scheduling constraints (e.g., a minimum number of pods) are met before committing resources. This cycle is optimized for batch and AI/ML workloads where pod interdependencies matter. By using the PodGroup as the source of truth, the scheduler avoids redundant lookups of Workload objects, making the scheduling loop faster and more scalable. This design also simplifies future additions like topology-aware placement and preemption logic.

What other scheduling features are introduced in v1.36 besides the API split?

Beyond the API separation, v1.36 debuts two advanced scheduling capabilities: topology-aware scheduling and workload-aware preemption. Topology-aware scheduling considers infrastructure topology (e.g., nodes, zones) to place pods of a group optimally—for instance, grouping them on the same rack for faster communication. Workload-aware preemption enables the scheduler to preempt lower-priority workloads in a way that respects group semantics, rather than handling pods individually. Additionally, ResourceClaim support for workloads unlocks Dynamic Resource Allocation (DRA) for PodGroups, allowing specialized hardware (like GPUs) to be claimed and assigned to all pods in a group. Finally, the Job controller's integration with the new APIs provides a concrete use case for real-world testing.

How does the integration with the Job controller work in v1.36?

The v1.36 release delivers the first phase of integration between the Kubernetes Job controller and the new Workload and PodGroup APIs. The Job controller is responsible for creating Workload objects (templates) and then stamping out PodGroup runtime instances as jobs progress. Previously, the controller had to manage scheduling state manually. Now, after defining a Workload with one or more podGroupTemplates, the Job controller automatically creates PodGroup objects that the scheduler can handle. This integration demonstrates real-world readiness for AI/ML training jobs and batch processing. It also shows how existing controllers can adopt the new APIs incrementally—the Job controller still manages pod creation, but scheduling decisions leverage the new PodGroup cycle, improving reliability and performance for gang-scheduled jobs.

What benefits does the ResourceClaim support bring for workloads?

ResourceClaim support in v1.36 allows PodGroups to use Dynamic Resource Allocation (DRA)—a feature that enables pods to claim and use specialized hardware (such as GPUs, FPGAs, or network accelerators) on demand. By extending DRA to workloads, PodGroups can request resources for all their member pods in a coordinated manner. For instance, a distributed training job requiring 4 GPUs across 4 pods can now have a single ResourceClaim that allocates the GPUs as a group, ensuring all necessary hardware is available before any pod starts. This avoids partial allocations and reduces scheduling delays. The integration is seamless: the PodGroup can reference ResourceClaims in its specification, and the scheduler evaluates availability alongside normal resource requirements. This makes v1.36 especially valuable for AI/ML workloads that need predictable, group-level hardware access.

Tags: