I have two: one for Kops / AWS a year ago (between k8s 1.4 to 1.6), another one happens one month ago.
For the one happened in one year ago, I had checked all masters and nodes finding no clue at that time. Killing nodes and arranging another won’t help - the nodes just stop working with network communication problem (master has
NetworkUnavailable False 39 minutes 39 minutes RouteCreated RouteController created a route the rest of nodes have:
NetworkUnavailable True 5 seconds 5 seconds NoRouteCreated RouteController failed to create a route, whereas ssh and ping to other nodes work totally fine). Then the root cause for this is due to at that time my cluster used spot instances for cost saving, but didn’t check shrink down / hibernate all the cluster would create a lot of “blackhole” in AWS VPC routing table, and it has limitation for entries per region (maximum 50, each blackhole count one). That takes me a whole day to figure it out, and it seems got fixed after k8s 1.6.
For the one happens recently, the AMI I used is 2018-01-xx version Debian based. During that time, since I just watched AWS re:Invent 2017 and thought I could benefit from m5 family instances. So I change original stable cluster to m5 family then the nodes start evict all pods to other nodes, just like to pass the buck to each other. And then I ssh into at least two nodes and found although there’re plenty of spaces but the nodes did not allocated them due to the driver incompatibility. Switch back to m4 then everything back to normal.