Got Kubernetes 1.11.9 deployed by kops
on AWS and facing exactly the same problem.
After stopping 1 master node there are some issues with canal
networking.
My pod is locked in state: ContainerCreating
.
When I run kubectl describe
on it I get:
Warning FailedCreatePodSandBox 28m kubelet, ip-10-26-11-110.eu-west-1.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container “f661e658410e063e53bf2e554c0b3134a8b62e239b21982746f4fdc5ff94a95b” network for pod “project-api-db75bcb89-8j7gh”: NetworkPlugin cni failed to set up pod “project-api-db75bcb89-8j7gh_default” network: error getting ClusterInformation: Get https://[100.64.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 100.64.0.1:443: connect: no route to host, failed to clean up sandbox container “f661e658410e063e53bf2e554c0b3134a8b62e239b21982746f4fdc5ff94a95b” network for pod “project-api-db75bcb89-8j7gh”: NetworkPlugin cni failed to teardown pod “project-api-db75bcb89-8j7gh_default” network: error getting ClusterInformation: Get https://[100.64.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 100.64.0.1:443: connect: no route to host]
Normal SandboxChanged 3m19s (x100 over 28m) kubelet, ip-10-26-11-110.eu-west-1.compute.internal Pod sandbox changed, it will be killed and re-created.
I’ve got 3 nodes and 1 master. The canal
is not working OK on 2 nodes:
kube-system canal-5k54f 2/3 Running 0 73d 10.26.56.35 ip-10-26-56-35.eu-west-1.compute.internal
kube-system canal-7lj5j 3/3 Running 5 73d 10.26.36.234 ip-10-26-36-234.eu-west-1.compute.internal
kube-system canal-d9zhr 3/3 Running 0 4d 10.26.46.136 ip-10-26-46-136.eu-west-1.compute.internal
kube-system canal-nrxc8 2/3 Running 0 73d 10.26.35.244 ip-10-26-35-244.eu-west-1.compute.internal
When I display log on the kube-flannel
I get:
kubectl logs canal-5k54f -n kube-system kube-flannel
E0729 11:15:12.083011 1 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:284: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: getsockopt: no route to host
I’ve found people have similar issue: Restarting master node causes cluster outage · Issue #6349 · kubernetes/kops · GitHub
and the solution is to do rolling-update
with -cloudonly
flag on master node, however it does not help on my side.
13:49 $ kops rolling-update cluster --instance-group master-eu-west-1a --yes --force --cloudonly
Using cluster from kubectl context: live.k8s.local
NAME STATUS NEEDUPDATE READY MIN MAX
master-eu-west-1a Ready 0 1 1 1
W0729 13:50:18.226978 17572 instancegroups.go:160] Not draining cluster nodes as ‘cloudonly’ flag is set.
I0729 13:50:18.226993 17572 instancegroups.go:301] Stopping instance “i-017e5…”, in group “master-eu-west-1a.masters.live.k8s.local” (this may take a while).
I0729 13:50:18.509124 17572 instancegroups.go:198] waiting for 5m0s after terminating instance
W0729 13:55:18.509423 17572 instancegroups.go:206] Not validating cluster as cloudonly flag is set.
I0729 13:55:18.509588 17572 rollingupdate.go:184] Rolling update completed for cluster “live.k8s.local”!
When I do non-cloud only rolling-update I get:
I0729 13:47:00.832851 11250 instancegroups.go:273] Cluster did not pass validation, will try again in “30s” until duration “5m0s” expires: kube-system pod “canal-5k54f” is not healthy.
E0729 13:47:28.712062 11250 instancegroups.go:214] Cluster did not validate within 5m0s
master not healthy after update, stopping rolling-update: “error validating cluster after removing a node: cluster did not validate within a duration of "5m0s"”