How has Kubernetes failed for you?


#1

Let’s share how Kubernetes has failed for us and what we’ve learned from it. It will probably help more than sharing our successes.

I’ll go first.

Last year I had 2 etcd nodes go down. I wasn’t monitoring etcd and the cluster went into read only mode and was unhealthy. The first sign of a problem was kube-dns started failing to resolve new pod IPs. I’m not exactly sure how long etcd was down (My best guess is 2 days). It took me a while to debug it because I suspected the problem was with kube-proxy which was a common problem for me.

I’d love to hear other stories.


Kubernetes Weekly Community Meeting
#2

Hi Justin - what did you do to fix your etcd issue?

We recently had a master node go down due to an issue on the underlying hardware - a new master came up automatically and there were no issues apart from a delay in the etcd volumes detaching from the old master.

We’ve also seen a problem with kube-dns taking a long time to resolve dns queries while we were rolling an update to the cluster but that went away with a pod restart.


#3

In my case the etcd cluster was external from the kubernetes cluster and the services were just not running. The data was still available so all I had to do was restart failed etcd containers and they automatically re-synced the data and kubernetes was happy again once etcd was healthy.


#4

I have two: one for Kops / AWS a year ago (between k8s 1.4 to 1.6), another one happens one month ago.

For the one happened in one year ago, I had checked all masters and nodes finding no clue at that time. Killing nodes and arranging another won’t help - the nodes just stop working with network communication problem (master has NetworkUnavailable False 39 minutes 39 minutes RouteCreated RouteController created a route the rest of nodes have:
NetworkUnavailable True 5 seconds 5 seconds NoRouteCreated RouteController failed to create a route, whereas ssh and ping to other nodes work totally fine). Then the root cause for this is due to at that time my cluster used spot instances for cost saving, but didn’t check shrink down / hibernate all the cluster would create a lot of “blackhole” in AWS VPC routing table, and it has limitation for entries per region (maximum 50, each blackhole count one). That takes me a whole day to figure it out, and it seems got fixed after k8s 1.6.

For the one happens recently, the AMI I used is 2018-01-xx version Debian based. During that time, since I just watched AWS re:Invent 2017 and thought I could benefit from m5 family instances. So I change original stable cluster to m5 family then the nodes start evict all pods to other nodes, just like to pass the buck to each other. And then I ssh into at least two nodes and found although there’re plenty of spaces but the nodes did not allocated them due to the driver incompatibility. Switch back to m4 then everything back to normal.


#5

Cluster installed a little over a year ago. Recently we learned that the kubernetes api certificate was setup to only last 12 months; it was a bunch of fuss to recover from.


#6

Hi Justin,

I just started a compilation of links to Kubernetes Failure Stories: https://github.com/hjacobs/kubernetes-failure-stories

The list is rather short so far, please contribute with your failure stories / outages :slight_smile:

Thanks!


#7

When HPA is cooled down say cpu utilisation comes down to 80% ( say stable for 5 min). Downscale happens.
What happens to the other process/jobs running on downscaled pod ? Is is moves to the active pod?