Hi,
I’ve been tasked with migrating an old monolith to GKE Autopilot (my first time with GKE Autopilot).
The monolith can scale horizontally, but it runs jobs using Quartz; these jobs can last for hours. The issue is that during scale-down, pod termination should be deferred until the jobs currently being processed finish, because some are critical and their processing could be lost.
The monolith is large, we don’t fully know everything it does, and a full rewrite isn’t feasible right now, but I can add a few patches—for example, two APIs:
- one to stop enqueuing jobs
- another API to report whether there are running executions.
Given that GKE Autopilot’s maximum terminationGracePeriodSeconds is 10 minutes,
what would be a robust strategy to migrate this to GKE Autopilot without interrupting in-flight work?
Ideas I’m considering:
- Set safe-to-evict to false whenever a job is taken. The Autopilot documentation mentions “up to 7 days” of protection against scale-down and auto-upgrade.
A few questions here: does Autopilot honor this if a pod changes it dynamically? Or is it expected to live in the pod template and not be changed? Is the timer counted from the image pull rather than each time this annotation is toggled? - Set a high pod-deletion-cost whenever a job is enqueued to mitigate the issue.
- Add a preStop hook that stops enqueuing new jobs and waits for running jobs to finish.
- Even with these in place, I’m unsure about the robustness; I suspect concurrency issues—for example, scaling down a pod right as it picks up a job. Is there an alternative that eliminates this race?
Thanks!