Increased 502 "backend_timeout" responses after upgrading to GKE 1.27

mdzigurski · February 1, 2024, 7:45pm

Hello,

After upgrading to GKE 1.27, we get about 10K 502 “backend_timeout” daily responses. Before the upgrade, we were running 1.25 and had zero backend_timeout issues. The application is configured to timeout after 10 seconds and send back 503. These 502 backend timeouts show 30-second timeouts in the LB logs, mostly on POST requests. Those POST requests typically take <10ms.

The cluster was upgraded to 1.27.3 and later to 1.27.5 and 1.27.7 without any changes in the backend timeout behavior. There were no changes in the application level or deployment, services, ingress, or configurations. I looked into every available log in GCP and tried to pinpoint where the issue could be. It seems that it is happening around node pool scaling up or down, but it is inconsistent. The pods are being killed occasionally without a SIGTERM signal for no apparent reason found in the logs.

I would appreciate advice on how I could debug this further. I have also noticed some fixes around LB in the 1.27.8 version. Is upgrading again worth a shot?

Cluster information:

Kubernetes version: 1.27.7
Cloud being used: GCP

timwolfe94022 · February 2, 2024, 4:36am

The root cause of the increased 502 “backend_timeout” responses after upgrading to GKE 1.27 could be related to changes in the load balancer behavior or node pool scaling mechanisms introduced in the newer Kubernetes version. Inconsistencies around node scaling and pod lifecycle events, like unexpected pod terminations without SIGTERM signals, might be contributing to these timeout issues.

mdzigurski · February 7, 2024, 3:07pm

The issue was resolved by updating services to container-native load balancing Container-native load balancing through Ingress | Google Kubernetes Engine (GKE) | Google Cloud

Topic		Replies	Views
GKE HTTP(s) Ingress returns 502 error for 5-6 minutes after deploying General Discussions	15	7746	July 25, 2022
Application still receives requests after SIGTERM General Discussions network	1	359	December 6, 2023
pod/Kube-dns and metrics-server in CrashLoopBackOff State on GKE General Discussions	5	2263	February 13, 2019
Issues with KubeProxy Network Programming Duration General Discussions	1	1984	September 22, 2020
Weird performance issue in GKE General Discussions	0	1062	April 15, 2019

Increased 502 "backend_timeout" responses after upgrading to GKE 1.27

Cluster information:

Related topics