Readiness and liveness probes fail randomly on GKE

Hello,

I’m seeing readiness and liveness probes fail randomly on a GKE cluster. It happens almost everyday for deployed pods but under no specific condition.

Here’s the the error message for the failed probes:

Liveness probe failed: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:87: adding pid 1523879 to cgroups caused \"failed to write 1523879 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods/besteffort/pod2e7c2d61-ecba-11e8-87ce-42010aa40071/c0da6a7b84c1cd7b48080128f0581f51207e61473e9287769d3825f974537032/cgroup.procs: invalid argument\""

Almost the same with readiness probes:

Readiness probe failed: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:87: adding pid 2588835 to cgroups caused \"failed to write 2588835 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod2e54f377-ecba-11e8-87ce-42010aa40071/53699bd79a393e6ac2ca887dedb4ca0578a8488e4017132a5807fa15647d1721/cgroup.procs: invalid argument\""

My GKE cluster version is 1.10.9-gke.5. Anybody has the same problem?

Hi, GKE engineer here. We are aware of the issue. This is most likely to be caused by the runc bug, which has been fixed recently. We are working on updating our images to include the fix.

In the meantime, you may want to relax the failure thresholds for your probes to tolerate the flakes.

1 Like

Ok. Thank you.

Also, how can I know that when the images have been fixed? Do I need to upgrade the cluster manually then?

The fix will likely to be included in GKE 1.13. Please check the GKE/COS release note to ensure that before upgrading.

1 Like

The current GKE version until now is 1.11.6-gke.2.
It looks like GKE 1.13 will need a long time to be released.
I wonder why it takes so long for a fix to be included in GKE. Do you have an estimation when GKE 1.13 will be released @yujuhong?