Hey y’all! I’m having some issues with GKE’s HTTP(s) ingress. I set it up with GKE Autopilot and a simple container-native load balancing setup, and the initial config seems to work just fine. After deploying, however, something happens and the Ingress returns a 502 error for about 5-6 minutes. The pods are online during this time, and the service does seem functional (the pods pass service health checks). It seems like there’s a large delay between when the pods get created and when they actually get added to the ingress as a backend:
Not entirely sure if that’s it, but I’ve included all the config that I’m using to make this happen. It’s probably a dumb mistake that I’m missing somewhere. Does anyone know what’s going on with this?
Cluster information:
Kubernetes version: 1.18.12-gke.1210
Cloud being used: Google Cloud with GKE Autopilot
YAML
Config for the deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: foundation-deployment
spec:
strategy:
# Added after reading this GitHub issue: https://github.com/kubernetes/ingress-gce/issues/34#issuecomment-398831429
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
selector:
matchLabels:
app: foundation-web
template:
metadata:
labels:
app: foundation-web
spec:
serviceAccountName: # Connects to our database account
containers:
# Run Cloud SQL proxy so we can safely connect to Postgres on localhost.
- name: cloud-sql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.17
resources:
requests:
cpu: "250m"
memory: 100Mi
limits:
cpu: "500m"
memory: 100Mi
command:
# Connects to Cloud SQL Proxy
securityContext:
runAsNonRoot: true
- name: foundation-web
image: # Pulls latest version of image from GCR
imagePullPolicy: Always
env:
# Env-specific config
resources:
requests:
memory: "500Mi"
cpu: "2"
limits:
memory: "1000Mi"
cpu: "2"
livenessProbe:
httpGet:
path: /healthz
port: 4000
initialDelaySeconds: 5
periodSeconds: 5
readinessProbe:
httpGet:
path: /healthz
port: 4000
initialDelaySeconds: 5
periodSeconds: 5
ports:
- containerPort: 4000
Config for the service:
apiVersion: v1
kind: Service
metadata:
name: foundation-web-service
annotations:
cloud.google.com/backend-config: '{"ports": {"4000": "foundation-ingress-config"}}'
cloud.google.com/neg: '{"ingress": true}'
spec:
type: NodePort
selector:
app: foundation-web
ports:
- port: 4000
targetPort: 4000
Config for the BackendConfig:
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
name: foundation-ingress-config
spec:
timeoutSec: 40
connectionDraining:
drainingTimeoutSec: 60
Config for the Ingress:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: foundation-web-ingress
labels:
name: foundation-web-ingress
spec:
backend:
serviceName: foundation-web-service
servicePort: 4000
You can format your yaml by highlighting it and pressing Ctrl-Shift-C, it will make your output easier to read.
Are you scaling the endpoints to 0 before scaling up? Deployment lets you set maxSurge so you never go to 0.
Thanks for the reply! I currently only have one pod running, so I would think that maxSurge would be 1? The default seems to be 25% and it rounds up. I’ll try setting it to a higher value and see if that helps any.
When I deploy, I’m just re-applying the file with the deployment config. Should I be recreating the service each time and then updating the ingress to point to the new service, like a bue/green deploy?
The LB takes a little time to program, so if it find no endpoints at any stage, it could return 5xx. I think there are a bunch of variables.
Is it a VPC-Native cluster?
Make sure there’s never a period of 0 ready endpoints. I just confirmed that updating a deployment should round up on max-surge and the new pod should be ready before the old one is torn down.
What should I look at to confirm there’s never a period of zero ready endpoints? Is a ready endpoint one that’d marked as healthy by the LB?
Yep, pretty sure this is a VPC-native cluster (I think that’s the default with Autopilot, I’ll double check to be sure). It should be using container-native load balancing.
It seems like the pods launch at the same time, the service confirms that they work successfully, and then for wherever reason the old pod shuts down before the LB adds the new pod successfully and confirms that it works.
Okay so I got some logs and I’m even more confused by what’s going on. It looks like the updated deployment creates the pods successfully (it waits for Autopilot to provision another 2.5 CPU node then creates the pods). The control plane waits a minute for health checks to pass, then it adds a network endpoint and immediately deletes the old pod. Then two minutes later it removes the network endpoint attached to that pod.
It looks like the control plane (for whatever reason) isn’t waiting enough time after adding the network endpoint for the Ingress’ health checks to pass. Really confused about this.
I checked GitHub and StackOverflow for info about this, but there doesn’t seem to be an issue that applies to this. There’s this issue that describes services being marked as UNHEALTHY (mine aren’t marked as unhealthy, they’re just unknown for two minutes). I double checked the health checks to make sure something isn’t up, and I think that might be an issue?
The health check port is supposed to be 4000, but it seems like it’s port 80? But the health checks return successfully, so I’m not sure if that’s an issue or not.
Screenshot of Kubernetes logs showing this happening (can’t put them on the same post for some reason lol)
Looking at the deployment spec, there is only 1 pod. So during the rolling update, there is likely a race between the old pod getting shut down and the new pod starting up. If the new pod starts up very slowly while the old pod got removed from LB, LB would return 502s in this period. You can probably find the exact reason why LB returns 502s in the LB access logs.
One solution is to set maxUnavailable to 0 so that it forces the deployment controller to wait for the new pod to become ready before killing the old pod.
Here is some documentation on troubleshooting problems related 502s:
maxAvailable
is already set to 0
in the deployment spec
I also upped the number of pods to 3. The issue still persists, except this time it sticks around for longer.
Really, really disappointed that such a basic config just… isn’t working?? And I can’t seem to find what the issue is???
I looked through the documentation for 502 errors and added all of the suggested stuff. Does not seem to make a difference.
Hmmm. If the usual suspect for 502s does not apply here, then we need to dig deeper. I am very curious.
Here is a way to look up HTTP LB access logs. In the log, you should be able to pin point the 502s requests/responses.
In the access log, you should be able to find out the exact reason of the 502s. HTTP(S) Load Balancing Logging and Monitoring | Google Cloud
I’m having this exact issue. Causing problems on a production system whenever we deploy an update. The new containers are ready and passing health checks, but for 3-5m, the load balancer will just be returning 502 errors. Did you manage to find a solution?
I also having the exact same issue. kubernetes says the ingress and service are healthy but I can see multiple backends with ‘No endpoints configured.’. It seems to happen almost at random on new deploys. I think it might be related to how long the deploy takes. Some of them take longer because autopilot temporarily can’t schedule the pod until a new node becomes available.
Something I noticed is that this only happens to my deployment with cloud sql auth proxy sidecar. It never happens to my other deployment which is the frontend with no sidecar
For me it was related to one of the container in my pod (ironically, GCP’s own cloud sql auth proxy) not handling SIGTERM gracefully. I posted what worked for me here docker - GKE Autopilot Ingress returns 502 error for 5-15 minutes after deploying - Stack Overflow
Been observing this issue as well We don’t use cloud sql auth proxy
but still the issue persists. Wondering if anyone found a resolution? Tried all the possible resolution from this post: docker - GKE Autopilot Ingress returns 502 error for 5-15 minutes after deploying - Stack Overflow
But nothing worked