GKE HTTP(s) Ingress returns 502 error for 5-6 minutes after deploying

Hey y’all! I’m having some issues with GKE’s HTTP(s) ingress. I set it up with GKE Autopilot and a simple container-native load balancing setup, and the initial config seems to work just fine. After deploying, however, something happens and the Ingress returns a 502 error for about 5-6 minutes. The pods are online during this time, and the service does seem functional (the pods pass service health checks). It seems like there’s a large delay between when the pods get created and when they actually get added to the ingress as a backend:

Not entirely sure if that’s it, but I’ve included all the config that I’m using to make this happen. It’s probably a dumb mistake that I’m missing somewhere. Does anyone know what’s going on with this?

Cluster information:

Kubernetes version: 1.18.12-gke.1210
Cloud being used: Google Cloud with GKE Autopilot

YAML

Config for the deployment:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: foundation-deployment
    spec:
      strategy:
        # Added after reading this GitHub issue: https://github.com/kubernetes/ingress-gce/issues/34#issuecomment-398831429
        type: RollingUpdate
        rollingUpdate:
          maxUnavailable: 0
      selector:
        matchLabels:
          app: foundation-web
      template:
        metadata:
          labels:
            app: foundation-web
        spec:
          serviceAccountName: # Connects to our database account
          containers:
            # Run Cloud SQL proxy so we can safely connect to Postgres on localhost.
            - name: cloud-sql-proxy
              image: gcr.io/cloudsql-docker/gce-proxy:1.17
              resources:
                requests:
                  cpu: "250m"
                  memory: 100Mi
                limits:
                  cpu: "500m"
                  memory: 100Mi
              command:
                # Connects to Cloud SQL Proxy
              securityContext:
                runAsNonRoot: true
            - name: foundation-web
              image: # Pulls latest version of image from GCR
              imagePullPolicy: Always
              env:
               # Env-specific config
              resources:
                requests:
                  memory: "500Mi"
                  cpu: "2"
                limits:
                  memory: "1000Mi"
                  cpu: "2"
              livenessProbe:
                httpGet:
                  path: /healthz
                  port: 4000
                initialDelaySeconds: 5
                periodSeconds: 5
              readinessProbe:
                httpGet:
                  path: /healthz
                  port: 4000
                initialDelaySeconds: 5
                periodSeconds: 5
              ports:
                - containerPort: 4000

Config for the service:

apiVersion: v1
kind: Service
metadata:
  name: foundation-web-service
  annotations:
    cloud.google.com/backend-config: '{"ports": {"4000": "foundation-ingress-config"}}'
    cloud.google.com/neg: '{"ingress": true}'
spec:
  type: NodePort
  selector:
    app: foundation-web
  ports:
    - port: 4000
      targetPort: 4000

Config for the BackendConfig:

apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: foundation-ingress-config
spec:
  timeoutSec: 40
  connectionDraining:
    drainingTimeoutSec: 60

Config for the Ingress:

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: foundation-web-ingress
  labels:
    name: foundation-web-ingress
spec:
  backend:
    serviceName: foundation-web-service
    servicePort: 4000

You can format your yaml by highlighting it and pressing Ctrl-Shift-C, it will make your output easier to read.

Are you scaling the endpoints to 0 before scaling up? Deployment lets you set maxSurge so you never go to 0.

Thanks for the reply! I currently only have one pod running, so I would think that maxSurge would be 1? The default seems to be 25% and it rounds up. I’ll try setting it to a higher value and see if that helps any.

When I deploy, I’m just re-applying the file with the deployment config. Should I be recreating the service each time and then updating the ingress to point to the new service, like a bue/green deploy?

The LB takes a little time to program, so if it find no endpoints at any stage, it could return 5xx. I think there are a bunch of variables.

Is it a VPC-Native cluster?

Make sure there’s never a period of 0 ready endpoints. I just confirmed that updating a deployment should round up on max-surge and the new pod should be ready before the old one is torn down.

What should I look at to confirm there’s never a period of zero ready endpoints? Is a ready endpoint one that’d marked as healthy by the LB?

Yep, pretty sure this is a VPC-native cluster (I think that’s the default with Autopilot, I’ll double check to be sure). It should be using container-native load balancing.

It seems like the pods launch at the same time, the service confirms that they work successfully, and then for wherever reason the old pod shuts down before the LB adds the new pod successfully and confirms that it works.

Okay so I got some logs and I’m even more confused by what’s going on. It looks like the updated deployment creates the pods successfully (it waits for Autopilot to provision another 2.5 CPU node then creates the pods). The control plane waits a minute for health checks to pass, then it adds a network endpoint and immediately deletes the old pod. Then two minutes later it removes the network endpoint attached to that pod.

It looks like the control plane (for whatever reason) isn’t waiting enough time after adding the network endpoint for the Ingress’ health checks to pass. Really confused about this.

I checked GitHub and StackOverflow for info about this, but there doesn’t seem to be an issue that applies to this. There’s this issue that describes services being marked as UNHEALTHY (mine aren’t marked as unhealthy, they’re just unknown for two minutes). I double checked the health checks to make sure something isn’t up, and I think that might be an issue?

The health check port is supposed to be 4000, but it seems like it’s port 80? But the health checks return successfully, so I’m not sure if that’s an issue or not.

Screenshot of Kubernetes logs showing this happening (can’t put them on the same post for some reason lol)

Looking at the deployment spec, there is only 1 pod. So during the rolling update, there is likely a race between the old pod getting shut down and the new pod starting up. If the new pod starts up very slowly while the old pod got removed from LB, LB would return 502s in this period. You can probably find the exact reason why LB returns 502s in the LB access logs.

One solution is to set maxUnavailable to 0 so that it forces the deployment controller to wait for the new pod to become ready before killing the old pod.

Here is some documentation on troubleshooting problems related 502s:

maxAvailable is already set to 0 in the deployment spec :frowning:

I also upped the number of pods to 3. The issue still persists, except this time it sticks around for longer.

Really, really disappointed that such a basic config just… isn’t working?? And I can’t seem to find what the issue is???

I looked through the documentation for 502 errors and added all of the suggested stuff. Does not seem to make a difference.

Hmmm. If the usual suspect for 502s does not apply here, then we need to dig deeper. I am very curious.

Here is a way to look up HTTP LB access logs. In the log, you should be able to pin point the 502s requests/responses.

In the access log, you should be able to find out the exact reason of the 502s. HTTP(S) Load Balancing Logging and Monitoring  |  Google Cloud