Increased 502 "backend_timeout" responses after upgrading to GKE 1.27

Hello,

After upgrading to GKE 1.27, we get about 10K 502 “backend_timeout” daily responses. Before the upgrade, we were running 1.25 and had zero backend_timeout issues. The application is configured to timeout after 10 seconds and send back 503. These 502 backend timeouts show 30-second timeouts in the LB logs, mostly on POST requests. Those POST requests typically take <10ms.

The cluster was upgraded to 1.27.3 and later to 1.27.5 and 1.27.7 without any changes in the backend timeout behavior. There were no changes in the application level or deployment, services, ingress, or configurations. I looked into every available log in GCP and tried to pinpoint where the issue could be. It seems that it is happening around node pool scaling up or down, but it is inconsistent. The pods are being killed occasionally without a SIGTERM signal for no apparent reason found in the logs.

I would appreciate advice on how I could debug this further. I have also noticed some fixes around LB in the 1.27.8 version. Is upgrading again worth a shot?

Cluster information:

Kubernetes version: 1.27.7
Cloud being used: GCP

The root cause of the increased 502 “backend_timeout” responses after upgrading to GKE 1.27 could be related to changes in the load balancer behavior or node pool scaling mechanisms introduced in the newer Kubernetes version. Inconsistencies around node scaling and pod lifecycle events, like unexpected pod terminations without SIGTERM signals, might be contributing to these timeout issues.

The issue was resolved by updating services to container-native load balancing Container-native load balancing through Ingress  |  Google Kubernetes Engine (GKE)  |  Google Cloud