Cluster information:
Kubernetes version: 1.29
Cloud being used: EKS
I have a Kubernetes job using fine parallel processing of an external work queue; completions is not set and parallelism is generally set between 200 and 300.
The available compute resources (EC2 nodes) are bound; increasing them is not an option.
Under heavy load, with multiple jobs running, there can be pods waiting in pending state due to resource starvation. When this happens, the pods that made it into the running state (say 100 of 200) can actually complete the job (i.e. work queue is empty and those pods have completed successfully). The job itself doesn’t complete because of the remaining pods in the pending state. This extends the lifetime of the job unnecessarily as resources slowly become available and the remaining pods run, realise there is nothing to do and exit.
I have downstream processes that depend on the completion of these jobs and they then get delayed unnecessarily. Here I am simply waiting for a job as described above to reach the status/condition “complete” (lets ignore failures for now).
Can anyone offer any advice on how to tighten this up, i.e. better detect when the work in my job is really done asap so I can start my downstream process earlier?