Kubernetes pending pods with fine parallel processing work queue job

matt.hibb · July 30, 2024, 8:42pm

Cluster information:

Kubernetes version: 1.29
Cloud being used: EKS

I have a Kubernetes job using fine parallel processing of an external work queue; completions is not set and parallelism is generally set between 200 and 300.

The available compute resources (EC2 nodes) are bound; increasing them is not an option.

Under heavy load, with multiple jobs running, there can be pods waiting in pending state due to resource starvation. When this happens, the pods that made it into the running state (say 100 of 200) can actually complete the job (i.e. work queue is empty and those pods have completed successfully). The job itself doesn’t complete because of the remaining pods in the pending state. This extends the lifetime of the job unnecessarily as resources slowly become available and the remaining pods run, realise there is nothing to do and exit.

I have downstream processes that depend on the completion of these jobs and they then get delayed unnecessarily. Here I am simply waiting for a job as described above to reach the status/condition “complete” (lets ignore failures for now).

Can anyone offer any advice on how to tighten this up, i.e. better detect when the work in my job is really done asap so I can start my downstream process earlier?

Topic	Replies	Views
Why should no pod do any work for a parallel job once the first pod has exited with success? General Discussions	264	March 3, 2023
How can I stop restarting completed job pod after scale down General Discussions	735	December 21, 2022
Pod is in pending state General Discussions	629	September 26, 2021
Jobs API performance bottleneck General Discussions	40	July 25, 2024
Pods requiring resources advertised by device plugin stucks at pending state General Discussions	514	December 9, 2021

Kubernetes pending pods with fine parallel processing work queue job

Cluster information:

Related topics