Cluster communication

Hi Folks,

I have kube cluster which was deployed via kops. There is a one master server & 6 hosts. The mster server was rebooted few days back and new master was appointed/create by kops . However now I am seeing following errors in cluster dump. Appreciate if you could advice me how to resolve these. Due to this communication between pods are not working as expected.

E0207 05:59:28.845507 1 reflector.go:201] k8s.io/dns/pkg/dns/dns.go:189: Failed to list *v1.Endpoints: Get https://100.64.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 100.64.0.1:443: getsockopt: no route to host

E0204 22:54:14.963385 1 autoscaler_server.go:86] Error while getting cluster status: Get https://100.64.0.1:443/api/v1/nodes: dial tcp 100.64.0.1:443: getsockopt: connection refused

Thanks

1 Like

Kops uses an ASG for masters. And I think instances terminate when you stop them. But maybe reboot is handled different? I don’t know.

What if you do a snapshot of the EBS things and stop the master that you rebooted? Also, are EBS for etcd, etc. Mounted in the new master?

Probably something that works is just stop all the masters, the ASG will create them again and properly mount the volumes.

1 Like

Got Kubernetes 1.11.9 deployed by kops on AWS and facing exactly the same problem.
After stopping 1 master node there are some issues with canal networking.

My pod is locked in state: ContainerCreating.
When I run kubectl describe on it I get:

Warning FailedCreatePodSandBox 28m kubelet, ip-10-26-11-110.eu-west-1.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container “f661e658410e063e53bf2e554c0b3134a8b62e239b21982746f4fdc5ff94a95b” network for pod “project-api-db75bcb89-8j7gh”: NetworkPlugin cni failed to set up pod “project-api-db75bcb89-8j7gh_default” network: error getting ClusterInformation: Get https://[100.64.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 100.64.0.1:443: connect: no route to host, failed to clean up sandbox container “f661e658410e063e53bf2e554c0b3134a8b62e239b21982746f4fdc5ff94a95b” network for pod “project-api-db75bcb89-8j7gh”: NetworkPlugin cni failed to teardown pod “project-api-db75bcb89-8j7gh_default” network: error getting ClusterInformation: Get https://[100.64.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 100.64.0.1:443: connect: no route to host]
Normal SandboxChanged 3m19s (x100 over 28m) kubelet, ip-10-26-11-110.eu-west-1.compute.internal Pod sandbox changed, it will be killed and re-created.

I’ve got 3 nodes and 1 master. The canal is not working OK on 2 nodes:

kube-system canal-5k54f 2/3 Running 0 73d 10.26.56.35 ip-10-26-56-35.eu-west-1.compute.internal
kube-system canal-7lj5j 3/3 Running 5 73d 10.26.36.234 ip-10-26-36-234.eu-west-1.compute.internal
kube-system canal-d9zhr 3/3 Running 0 4d 10.26.46.136 ip-10-26-46-136.eu-west-1.compute.internal
kube-system canal-nrxc8 2/3 Running 0 73d 10.26.35.244 ip-10-26-35-244.eu-west-1.compute.internal

When I display log on the kube-flannel I get:

kubectl logs canal-5k54f -n kube-system kube-flannel
E0729 11:15:12.083011 1 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:284: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: getsockopt: no route to host

I’ve found people have similar issue: Restarting master node causes cluster outage · Issue #6349 · kubernetes/kops · GitHub
and the solution is to do rolling-update with -cloudonly flag on master node, however it does not help on my side.

13:49 $ kops rolling-update cluster --instance-group master-eu-west-1a --yes --force --cloudonly
Using cluster from kubectl context: live.k8s.local

NAME STATUS NEEDUPDATE READY MIN MAX
master-eu-west-1a Ready 0 1 1 1
W0729 13:50:18.226978 17572 instancegroups.go:160] Not draining cluster nodes as ‘cloudonly’ flag is set.
I0729 13:50:18.226993 17572 instancegroups.go:301] Stopping instance “i-017e5…”, in group “master-eu-west-1a.masters.live.k8s.local” (this may take a while).
I0729 13:50:18.509124 17572 instancegroups.go:198] waiting for 5m0s after terminating instance
W0729 13:55:18.509423 17572 instancegroups.go:206] Not validating cluster as cloudonly flag is set.
I0729 13:55:18.509588 17572 rollingupdate.go:184] Rolling update completed for cluster “live.k8s.local”!

When I do non-cloud only rolling-update I get:

I0729 13:47:00.832851 11250 instancegroups.go:273] Cluster did not pass validation, will try again in “30s” until duration “5m0s” expires: kube-system pod “canal-5k54f” is not healthy.
E0729 13:47:28.712062 11250 instancegroups.go:214] Cluster did not validate within 5m0s

master not healthy after update, stopping rolling-update: “error validating cluster after removing a node: cluster did not validate within a duration of "5m0s"”