Etcd cluster: one pod is not working... why?


#1

Hi, forgive me for my bad english.

I’m trying to spawn an etcd cluster with 3 pods. I have a 5 servers k8s (3 manager and 5 nodes).
I ha created 3 pods:

NAME    READY   STATUS    RESTARTS   AGE   IP               NODE           NOMINATED NODE   READINESS GATES
etcd0   1/1     Running   0          20m   10.233.74.21     kube-node2     <none>           <none>
etcd1   1/1     Running   0          20m   10.233.73.84     kube-node1     <none>           <none>
etcd2   1/1     Running   0          20m   10.233.105.143   kube-master2   <none>           <none>

and 3 services to get a valid dns resoluton:

NAME          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE   SELECTOR
etcd0         ClusterIP   10.233.32.88    <none>        2379/TCP,2380/TCP   21m   etcd_node=etcd0
etcd1         ClusterIP   10.233.8.202    <none>        2379/TCP,2380/TCP   21m   etcd_node=etcd1
etcd2         ClusterIP   10.233.60.23    <none>        2379/TCP,2380/TCP   21m   etcd_node=etcd2

From etcd0 and etcd2, no problems. They can reach all other nodes without any problems. But etcd1 can not reach any of etc0 end etcd2:

2019-01-16 13:18:23.898150 W | etcdserver: failed to reach the peerURL(http://etcd0:2380) of member cf1d15c5d194b5c9 (Get http://etcd0:2380/version: dial tcp 10.233.32.88:2380: i/o timeout)
2019-01-16 13:18:23.898176 W | etcdserver: cannot get the version of member cf1d15c5d194b5c9 (Get http://etcd0:2380/version: dial tcp 10.233.32.88:2380: i/o timeout)
2019-01-16 13:18:25.898418 W | etcdserver: failed to reach the peerURL(http://etcd2:2380) of member d282ac2ce600c1ce (Get http://etcd2:2380/version: dial tcp 10.233.60.23:2380: i/o timeout)
2019-01-16 13:18:25.898444 W | etcdserver: cannot get the version of member d282ac2ce600c1ce (Get http://etcd2:2380/version: dial tcp 10.233.60.23:2380: i/o timeout)
2019-01-16 13:18:28.497224 W | rafthttp: health check for peer cf1d15c5d194b5c9 could not connect: dial tcp 10.233.32.88:2380: i/o timeout
2019-01-16 13:18:28.501596 W | rafthttp: health check for peer d282ac2ce600c1ce could not connect: dial tcp 10.233.60.23:2380: i/o timeout

etcd1 is actually on kube-node1, but if I delete all pods and recreate them, it is another pod on another node that have a problem. I I try to use the pods IP from etcd1, it works, but it is not a good solution.

[EDIT] I have tested some combinaisons of nodes and it seems that pods en kube-node1 and jube-master1 encounter a connectivity problem. I don’t know why … :face_with_symbols_over_mouth::angry::triumph::sob:

k8s version: 1.13.0

Any idea ??