After upgrading to GKE 1.27, we get about 10K 502 “backend_timeout” daily responses. Before the upgrade, we were running 1.25 and had zero backend_timeout issues. The application is configured to timeout after 10 seconds and send back 503. These 502 backend timeouts show 30-second timeouts in the LB logs, mostly on POST requests. Those POST requests typically take <10ms.
The cluster was upgraded to 1.27.3 and later to 1.27.5 and 1.27.7 without any changes in the backend timeout behavior. There were no changes in the application level or deployment, services, ingress, or configurations. I looked into every available log in GCP and tried to pinpoint where the issue could be. It seems that it is happening around node pool scaling up or down, but it is inconsistent. The pods are being killed occasionally without a SIGTERM signal for no apparent reason found in the logs.
I would appreciate advice on how I could debug this further. I have also noticed some fixes around LB in the 1.27.8 version. Is upgrading again worth a shot?
Kubernetes version: 1.27.7
Cloud being used: GCP