GKE Autopilot: Preventing Scale-Down Interruptions for Long-Running Quartz Jobs

rockdrigo · October 29, 2025, 2:16pm

Hi,

I’ve been tasked with migrating an old monolith to GKE Autopilot (my first time with GKE Autopilot).
The monolith can scale horizontally, but it runs jobs using Quartz; these jobs can last for hours. The issue is that during scale-down, pod termination should be deferred until the jobs currently being processed finish, because some are critical and their processing could be lost.
The monolith is large, we don’t fully know everything it does, and a full rewrite isn’t feasible right now, but I can add a few patches—for example, two APIs:

one to stop enqueuing jobs
another API to report whether there are running executions.

Given that GKE Autopilot’s maximum terminationGracePeriodSeconds is 10 minutes,
what would be a robust strategy to migrate this to GKE Autopilot without interrupting in-flight work?

Ideas I’m considering:

Set safe-to-evict to false whenever a job is taken. The Autopilot documentation mentions “up to 7 days” of protection against scale-down and auto-upgrade.
A few questions here: does Autopilot honor this if a pod changes it dynamically? Or is it expected to live in the pod template and not be changed? Is the timer counted from the image pull rather than each time this annotation is toggled?
Set a high pod-deletion-cost whenever a job is enqueued to mitigate the issue.
Add a preStop hook that stops enqueuing new jobs and waits for running jobs to finish.
Even with these in place, I’m unsure about the robustness; I suspect concurrency issues—for example, scaling down a pod right as it picks up a job. Is there an alternative that eliminates this race?

Thanks!

Topic		Replies	Views
How can I stop restarting completed job pod after scale down General Discussions	0	758	December 21, 2022
GKE scaled-down nodes won't terminate General Discussions	6	2819	March 18, 2021
`kube-dns-autoscaler` preventing GKE standard cluster to scale down General Discussions	2	1526	October 18, 2023
Handling Long running request during HPA Scale-down General Discussions	2	1448	October 31, 2024
[Question] Does Kubernetes support hundreds PODs in “terminating” status for a week? General Discussions development	0	608	June 2, 2020

GKE Autopilot: Preventing Scale-Down Interruptions for Long-Running Quartz Jobs

Related topics