We’ve been experiencing some long standing issues regarding GCE internal load balancers and those load balancers reattaching preemptible instances. In our case these GCE instances are acting as GKE nodes which is why I’m posting this on a Kubernetes forum. After hours of troubleshooting and working with GCP support(to no avail) I’m strongly leaning in the direction that this issue is caused by the GCE load balancer health check failing to our “internal” pod /healthz endpoints when the preemptible instance is terminated, recreated(due to being provisioned by a managed instance group), and a new “internal” pod is deployed again to the preemptible instance.
There has been an existing issue open in the following which describes the issue we’re experiencing.https://github.com/kubernetes/kubernetes/issues/69362
We are using GKE with internal load balancer cloud.google.com/load-balancer-type: “internal” and preemptible instances. Sometimes, when instances are recreated, new instance is not added to google load balancer. We have nginx ingress behind this balancer, so cluster loses it’s ingress in that case.
We don’t experience this issue during every 24 hour rotation in which preemptible instances are terminated/recreated but we have 10+ GKE clusters in various projects and do experience the issue at least once a week if not more frequently.
The odd part of this is that the managed instance groups all have healthy instances, the instances themselves respond and appear healthy, and “kubectl get nodes” shows those instances as active members of our GKE cluster node pools.
The only culprit I have been able to really track down is the GCE load balancer health check potentially timing out during the instance termination/recreation and internal GKE pod deployment process. We are creating the ILB using cloud.google.com/load-balancer-type: “internal” (which is declared in our deployment files and deployed through a CI/CD pipeline). With the ILB creation it looks like some default health check settings are being configured and these settings have fairly low timeout settings, especially considering the nature of our configuration and the longer time frame for a healthy “internal” pod to come back online. The health check timeout settings being created with the ILB are below:
My best guess is the health check timeouts are failing while searching for a pod with the correct “app: internal” label (the LoadBalancer service is configured with a selector to match app: internal). Frequently preemptible instances are not recreated right away either as there looks to be frequent resource contention from the pool that preemptible instances draw from(I can see this in GCE instance logs).
Has anybody run across a similar issue or have any other avenues of troubleshooting? As far as I know GCP hasn’t enabled health check logs for customer access so don’t have much ability to troubleshoot the ILB health checks. To get ahead of “why don’t you just use non-preemptible instances” these GKE clusters are being used purely in a development environment(before staging) for our multiple development teams which frequently scale up and down so using preemptible nodes is a huge cost savings for us in environments running workloads that won’t care about the instances restarting.