GKE: node is not usable but still marked as ready

minhdanh · June 10, 2019, 3:50am

I’ve just had a problem with a Kubernetes node on GKE. The pods that were scheduled on the node were killed and failed to start. When I checked I see such error below:

 Normal   Killing    <invalid> (x130 over <invalid>)   kubelet, gke-pool-01-91602ba2-rdpc  Killing container with id docker://portus-postgresql:Container failed liveness probe.. Container will be killed and recreated.
  Warning  BackOff    <invalid> (x1532 over <invalid>)  kubelet, gke-pool-01-91602ba2-rdpc  Back-off restarting failed container
  Warning  Unhealthy  <invalid> (x2330 over <invalid>)  kubelet, gke-pool-01-91602ba2-rdpc  (combined from similar events): Readiness probe failed: OCI runtime exec failed: write /tmp/runc-process647055078: no space left on device: unknown

The error clearly was OCI runtime exec failed: write /tmp/runc-process647055078: no space left on device: unknown.
This looks like an issue with the node rather than with a docker container or docker volume. I then tried to SSH to the K8s node that was hosting the pod, but the server wouldn’t respond. I couldn’t ssh.
But interestingly it is still considered “Ready” by K8s:

gke-pool-01-91602ba2   Ready    <none>   13d    v1.13.5-gke.10

I had to cordon the node and drain it so the pods rescheduled to other nodes.
But I’m afraid this can happen again, as the there was no way to figure out the reason.
Anybody seeing same problem?

Cluster information:

Kubernetes version: 1.13.5-gke.10
Cloud being used: GKE
Installation method:
Host OS: Container-Optimized OS (cos)
CNI and version: Not sure
CRI and version: Not sure

Topic		Replies	Views
Readiness and liveness probes fail randomly on GKE General Discussions	5	4147	February 11, 2019
pod/Kube-dns and metrics-server in CrashLoopBackOff State on GKE General Discussions	5	2245	February 13, 2019
Testing nic failure on bare-metal kubernetes cluster General Discussions	0	736	May 5, 2020
How to reschedule pod on another node if node fails? How to speed up rescheduling? General Discussions	1	11625	July 17, 2019
How to detect readiness when node becomes unavailable? General Discussions	1	1976	July 26, 2019

GKE: node is not usable but still marked as ready

Cluster information:

Related topics