Hi you all,
I’m in a little puzzling problem.
I created, just for test purpose, a 2 control plane and 2 worker node cluster.
everything’s fine until I decide to turn off one cp-node2 and the cp-node1 stops to respond, turned the node2 on and the cluster turns back.
is this normal?
this is the contrary of fail over…
Could someone help me to understand this behavior ?
thanks in advance
Cluster information:
Kubernetes version:
Cloud being used: (put bare-metal if not on a public cloud)
Installation method: kubeadm
Host OS: ubuntu
CNI and version: calico
CRI and version: containerd containerd.io 1.6.21 3dce8eb055cbb6872793272b4f20ed16117344f8
You can format your yaml by highlighting it and pressing Ctrl-Shift-C, it will make your output easier to read.
A 2 node control plane cluster is already in a degraded state. Etcd requires an odd # of nodes for HA.
Hi @mrbobbytables and thanks for your reply,
I knew this but I didn’t aspect that the the whole cluster would stop.
I "gracefully excluded the node cordoning it etc … but after some minutes that the 2nd node wasn’t responding the cluster stuck and after a reboot the kubernetes control plane ( single ) node didn’t want to restart, becoming unable to contact itself
root@k-master-1:~# kg po
E0907 07:40:24.503955 2136 memcache.go:265] couldn’t get current server API group list: Get “https://k-master-1.cluster:6443/api?timeout=32s”: dial tcp 192.168.1.110:6443: connect: connection refused
E0907 07:40:24.504203 2136 memcache.go:265] couldn’t get current server API group list: Get “https://k-master-1.cluster:6443/api?timeout=32s”: dial tcp 192.168.1.110:6443: connect: connection refused
E0907 07:40:24.505445 2136 memcache.go:265] couldn’t get current server API group list: Get “https://k-master-1.cluster:6443/api?timeout=32s”: dial tcp 192.168.1.110:6443: connect: connection refused
E0907 07:40:24.509766 2136 memcache.go:265] couldn’t get current server API group list: Get “https://k-master-1.cluster:6443/api?timeout=32s”: dial tcp 192.168.1.110:6443: connect: connection refused
E0907 07:40:24.510938 2136 memcache.go:265] couldn’t get current server API group list: Get “https://k-master-1.cluster:6443/api?timeout=32s”: dial tcp 192.168.1.110:6443: connect: connection refused
The connection to the server k-master-1.cluster:6443 was refused - did you specify the right host or port?
Could it be an etcd problem?
How could I investigate it? and, more important, how could I solve this?
If it would happen in a production environment, would be a “little” problem to reset a cluster.
thanks in advance
Yes, it is completely normal. HA requires at least 3 nodes. A 3 node etcd cluster can take 1 failure as long as 2 of can still communicate. You have created an instance where the cluster was already in a degraded state and stopping one put the other into thinking it was the failed node. Please read the docs on etcd, K8s and HA.
thanks a lot!
it was puzzling me a lot