GCE Internal Load Balancer and preemptible instance/k8s node issue

We’ve been experiencing some long standing issues regarding GCE internal load balancers and those load balancers reattaching preemptible instances. In our case these GCE instances are acting as GKE nodes which is why I’m posting this on a Kubernetes forum. After hours of troubleshooting and working with GCP support(to no avail) I’m strongly leaning in the direction that this issue is caused by the GCE load balancer health check failing to our “internal” pod /healthz endpoints when the preemptible instance is terminated, recreated(due to being provisioned by a managed instance group), and a new “internal” pod is deployed again to the preemptible instance.

There has been an existing issue open in the following which describes the issue we’re experiencing.https://github.com/kubernetes/kubernetes/issues/69362
Summary:
We are using GKE with internal load balancer cloud.google.com/load-balancer-type: “internal” and preemptible instances. Sometimes, when instances are recreated, new instance is not added to google load balancer. We have nginx ingress behind this balancer, so cluster loses it’s ingress in that case.

We don’t experience this issue during every 24 hour rotation in which preemptible instances are terminated/recreated but we have 10+ GKE clusters in various projects and do experience the issue at least once a week if not more frequently.

The odd part of this is that the managed instance groups all have healthy instances, the instances themselves respond and appear healthy, and “kubectl get nodes” shows those instances as active members of our GKE cluster node pools.

The only culprit I have been able to really track down is the GCE load balancer health check potentially timing out during the instance termination/recreation and internal GKE pod deployment process. We are creating the ILB using cloud.google.com/load-balancer-type: “internal” (which is declared in our deployment files and deployed through a CI/CD pipeline). With the ILB creation it looks like some default health check settings are being configured and these settings have fairly low timeout settings, especially considering the nature of our configuration and the longer time frame for a healthy “internal” pod to come back online. The health check timeout settings being created with the ILB are below:
timeoutSec: 1
type: HTTP
unhealthyThreshold: 3

My best guess is the health check timeouts are failing while searching for a pod with the correct “app: internal” label (the LoadBalancer service is configured with a selector to match app: internal). Frequently preemptible instances are not recreated right away either as there looks to be frequent resource contention from the pool that preemptible instances draw from(I can see this in GCE instance logs).

Has anybody run across a similar issue or have any other avenues of troubleshooting? As far as I know GCP hasn’t enabled health check logs for customer access so don’t have much ability to troubleshoot the ILB health checks. To get ahead of “why don’t you just use non-preemptible instances” these GKE clusters are being used purely in a development environment(before staging) for our multiple development teams which frequently scale up and down so using preemptible nodes is a huge cost savings for us in environments running workloads that won’t care about the instances restarting.

I don’t have much experience with GCP, but:

  • Which kubernetes version are you using? Do you know around which version it started happening? Or have you checked the changelog just in case something seems relevant?

  • It surprises me that you can’t get logs or charts for the health checks. Isn’t there anything? :frowning:

And Google support didn’t help to shed any light on this? :frowning:

Once it is broken, how do you fix it? Modifying the service triggers a resync and fixes it, for example? Or what ways do you know to fix it once it is broken?

I’d see which ways can be fixed, to see if you can easily have a workaround or to automatically fix it when it breaks (if it’s for dev, seems like it’s no problem), or just having the ingress on another node.

Once you have a ““working”” setup, you might want to dig into the kubernetes integration with GCP. I’d try to increase logging and see what you can get from the nodes POV.

Thanks for the reply Rata.

  • We’re currently on version 1.12.7-gke.10. I don’t know the exact version around which this started happening but I personally have witnessed the issue occurring for the past 3+ months. Our development teams have only started heavily using these clusters within the last several weeks which has made the issue much more noticeable.

  • Yep, GCE health checks currently aren’t exposed to customers which has been confirmed by GCP support. I believe there’s an alpha feature to view their global HTTPS health check logs but nothing that I know of for their L4 internal load balancers.

Google support has edged around admitting that this isn’t standard behavior. The closest they came was stating that preemptible nodes with GKE and internal load balancers is a beta feature and thus not subject to SLAs though I can’t find or have missed any documentation supporting this claim. The issue I linked in my original post had some /lifecycle settings updated just a few hours ago by somebody from the Google team which makes me believe this really is a bug and they’re finally looking into it (prompted by me annoying them for weeks via my ticket or not :slight_smile: )

When the internal load balancer loses all associated nodes the only fix I’ve found is to remove our “internal” HelmRelease (we manage our Kube cluster deployments using Flux) which triggers associated resources to be deleted and when the HelmRelease deploys again, a new internal load balancer is created when the “internal” service is created, along with new “internal” pods that the new load balancer associates with and the issue is the temporarily fixed for some time until the load balancer again loses all of the nodes.

I’m looking into some automation to stop/start our preemptible nodes to see if that helps alleviate the issue, mainly based on how GCP terminates preemptible nodes ungracefully and then recreates them as part of our instance groups scaling back up. My thoughts are perhaps the ungraceful nature of the instance termination is causing some odd behavior with the “internal” pods on those instances though all of the other data i’ve seen doesn’t support any evidence to support this. Overall I truly believe this is a bug, especially given how GCP support won’t give an answer that this isn’t a supported setup and skirts around my questions digging deeper. While we can hopefully find some sort of work around on our own I think we ultimately need the GKE team to put out a fix.

Minhan is looking into this and has a theory,

1 Like

cuzz333

    June 12

Thanks for the reply Rata.

  • We’re currently on version 1.12.7-gke.10. I don’t know the exact version around which this started happening but I personally have witnessed the issue occurring for the past 3+ months. Our development teams have only started heavily using these clusters within the last several weeks which has made the issue much more noticeable.
  • Yep, GCE health checks currently aren’t exposed to customers which has been confirmed by GCP support. I believe there’s an alpha feature to view their global HTTPS health check logs but nothing that I know of for their L4 internal load balancers.

Google support has edged around admitting that this isn’t standard behavior. The closest they came was stating that preemptible nodes with GKE and internal load balancers is a beta feature and thus not subject to SLAs though I can’t find or have missed any documentation supporting this claim. The issue I linked in my original post had some /lifecycle settings updated just a few hours ago by somebody from the Google team which makes me believe this really is a bug and they’re finally looking into it (prompted by me annoying them for weeks via my ticket or not :slight_smile: )

Cool, Tim just answered that they are looking into it! :slight_smile:

When the internal load balancer loses all associated nodes the only fix I’ve found is to remove our “internal” HelmRelease (we manage our Kube cluster deployments using Flux) which triggers associated resources to be deleted and when the HelmRelease deploys again, a new internal load balancer is created when the “internal” service is created, along with new “internal” pods that the new load balancer associates with and the issue is the temporarily fixed for some time until the load balancer again loses all of the nodes.

Oh, quite an invasive fix to recreate :-/

I’m looking into some automation to stop/start our preemptible nodes to see if that helps alleviate the issue, mainly based on how GCP terminates preemptible nodes ungracefully and then recreates them as part of our instance groups scaling back up. My thoughts are perhaps the ungraceful nature of the instance termination is causing some odd behavior with the “internal” pods on those instances though all of the other data i’ve seen

I doubt internal pods have anything to do with this. I’m not sure how this is handled on Google cloud, but my bet is that just their API is failing to associate nodes.

I used spot instances in AWS and LB registration was done completely independent of kubernetes, and never had an issue with this. But, again, not sure how this is articulated on Google.

Another work around to have in mind is if you can’t have the load balancer associated with the autoscaling group (not sure about the names in GCP), instead of using type service with annotations to create the LB.

You can handle it via terraform, and see if that way works reliably? Not sure it differs on what is done today by kubernetes, though :slight_smile:

doesn’t support any evidence to support this. Overall I truly believe this is a bug, especially given how GCP support won’t give an answer that this isn’t a supported setup and skirts around my questions digging deeper. While we can hopefully find some sort of work around on our own I think we ultimately need the GKE team to put out a fix.

Definitely, this seems to be an issue where they "ASG"s are not properly registered to their LB (in this special case).

Until they fix it, probably we can only try to workaround or (if it is in kubernetes) fix it. But most likely is on Google side and difficult to debug without their insight on what is happening on their side :slight_smile:

Thanks for the update @thockin