Thchere

Kubernetes 1.36 Beta: Dynamically Adjust Job Resources While Suspended

Published: 2026-05-05 17:35:54 | Category: Education & Careers

Kubernetes 1.36 beta introduces a game-changing feature for batch processing: the ability to modify CPU, memory, GPU, and extended resource requests and limits on a suspended Job's pod template. This flexibility empowers queue controllers and cluster administrators to fine-tune resource allocations before a Job resumes, solving a long-standing pain point for ML and batch workloads. Below, we explore what this means, how it works, and why it matters.

What is the new mutable pod resources feature for suspended jobs in Kubernetes 1.36?

Kubernetes 1.36 promotes to beta the capability to alter container resource requests and limits within the pod template of a suspended Job. First introduced as alpha in v1.35, this feature permits changes to CPU, memory, GPU, and extended resource specifications on a Job while it is suspended – that is, before it starts or resumes running. No new API types are required; the existing Job and pod template structures simply relax the immutability constraint for suspended Jobs. This means queue controllers or admins can adjust resources dynamically without deleting and recreating the Job, preserving metadata, status, and history.

Kubernetes 1.36 Beta: Dynamically Adjust Job Resources While Suspended

Why was this feature introduced? What problem does it solve?

Batch and machine learning workloads often have resource requirements that aren't precisely known at the time of Job creation. Optimal allocation depends on current cluster capacity, queue priorities, and the availability of specialized hardware like GPUs. Previously, once a Job's pod template resource fields were set, they were immutable. If a queue controller (e.g., Kueue) decided a suspended Job needed different resources, the only option was to delete and recreate the Job, losing associated metadata, status, and history. This feature removes that painful constraint, allowing adjustments like reducing GPU count when only partial capacity is available, or bumping up resources when they become free.

How does this feature work for queue controllers and administrators?

The Kubernetes API server relaxes the immutability rule on pod template resource fields exclusively for suspended Jobs. Controllers can simply issue an update to the Job object's spec.template.spec.containers[*].resources fields while spec.suspend is true. After the update, they set spec.suspend to false, and new Pods are created with the adjusted resource specifications. The feature requires no changes to existing APIs or CRDs – it's a controlled relaxation of an existing constraint. Administrators can use standard kubectl patch or programmatic API calls to modify the resources. The Job retains its identity, labels, annotations, and any past execution history.

Can you provide an example of adjusting resources on a suspended Job?

Consider a machine learning training Job initially requesting 4 GPUs with requests.cpu: "8" and requests.memory: "32Gi". The Job is created in suspended state (spec.suspend: true). A queue controller determines only 2 GPUs are available. Using this beta feature, the controller updates the Job's pod template: changes gpu count to 2, reduces CPU to 4, and memory to 16Gi. The Job remains suspended during the update. Once the resource fields are patched, the controller sets spec.suspend: false. The Job then creates new Pods with the revised resource specifications, and training begins with the adjusted allocation. No Job deletion or loss of metadata occurs.

What are the benefits for batch and machine learning workloads?

For batch and ML workloads, resource requirements are often fluid. This feature enables graceful adaptation to real-time cluster conditions. A training job that could not run at full GPU capacity can be scaled down temporarily rather than failing. Queue controllers can make smarter scheduling decisions without the overhead of job recreation. This also simplifies resource fairness – when multiple jobs are queued, a controller can reduce allocations for low-priority jobs to make room for higher-priority ones, then restore resources later. The feature preserves job history, logs, and associated metrics, which is critical for reproducibility and debugging in ML pipelines.

How does this affect CronJobs and slow progress scenarios?

A specific CronJob instance that would normally fail due to insufficient cluster capacity can instead be suspended and its resources reduced. For example, a CronJob might generate a Job that requires 4 CPUs, but the cluster is heavily loaded. With this feature, the Job can be adjusted to request 2 CPUs, allowing it to progress slowly rather than failing outright. Once the load decreases, another controller could increase resources for subsequent runs. This provides a graceful degradation path for time-sensitive workloads. The CronJob's history is maintained because the underlying Job object is never deleted; only its pod template resources are modified.