Kubernetes multpiple control plane nodes cluster, not working when one control plane node fails

Roberto_D_Maggi · September 5, 2023, 10:14am

Hi you all,
I’m in a little puzzling problem.

I created, just for test purpose, a 2 control plane and 2 worker node cluster.
everything’s fine until I decide to turn off one cp-node2 and the cp-node1 stops to respond, turned the node2 on and the cluster turns back.
is this normal?
this is the contrary of fail over…

Could someone help me to understand this behavior ?
thanks in advance

Cluster information:

Kubernetes version:
Cloud being used: (put bare-metal if not on a public cloud)
Installation method: kubeadm
Host OS: ubuntu
CNI and version: calico
CRI and version: containerd containerd.io 1.6.21 3dce8eb055cbb6872793272b4f20ed16117344f8

You can format your yaml by highlighting it and pressing Ctrl-Shift-C, it will make your output easier to read.

mrbobbytables · September 5, 2023, 11:01am

A 2 node control plane cluster is already in a degraded state. Etcd requires an odd # of nodes for HA.

Roberto_D_Maggi · September 7, 2023, 8:01am

Hi @mrbobbytables and thanks for your reply,
I knew this but I didn’t aspect that the the whole cluster would stop.
I "gracefully excluded the node cordoning it etc … but after some minutes that the 2nd node wasn’t responding the cluster stuck and after a reboot the kubernetes control plane ( single ) node didn’t want to restart, becoming unable to contact itself

root@k-master-1:~# kg po
E0907 07:40:24.503955 2136 memcache.go:265] couldn’t get current server API group list: Get “https://k-master-1.cluster:6443/api?timeout=32s”: dial tcp 192.168.1.110:6443: connect: connection refused
E0907 07:40:24.504203 2136 memcache.go:265] couldn’t get current server API group list: Get “https://k-master-1.cluster:6443/api?timeout=32s”: dial tcp 192.168.1.110:6443: connect: connection refused
E0907 07:40:24.505445 2136 memcache.go:265] couldn’t get current server API group list: Get “https://k-master-1.cluster:6443/api?timeout=32s”: dial tcp 192.168.1.110:6443: connect: connection refused
E0907 07:40:24.509766 2136 memcache.go:265] couldn’t get current server API group list: Get “https://k-master-1.cluster:6443/api?timeout=32s”: dial tcp 192.168.1.110:6443: connect: connection refused
E0907 07:40:24.510938 2136 memcache.go:265] couldn’t get current server API group list: Get “https://k-master-1.cluster:6443/api?timeout=32s”: dial tcp 192.168.1.110:6443: connect: connection refused
The connection to the server k-master-1.cluster:6443 was refused - did you specify the right host or port?
Could it be an etcd problem?
How could I investigate it? and, more important, how could I solve this?
If it would happen in a production environment, would be a “little” problem to reset a cluster.
thanks in advance

mrbobbytables · September 7, 2023, 9:42am

Yes, it is completely normal. HA requires at least 3 nodes. A 3 node etcd cluster can take 1 failure as long as 2 of can still communicate. You have created an instance where the cluster was already in a degraded state and stopping one put the other into thinking it was the failed node. Please read the docs on etcd, K8s and HA.

Roberto_D_Maggi · September 7, 2023, 9:50am

thanks a lot!
it was puzzling me a lot

Topic		Replies	Views
Failover of a control plane node in single node/ HA Kubernetes Cluster created using kubeadm General Discussions	0	588	May 30, 2020
Ha kubernetes General Discussions	6	933	December 20, 2019
How to create HA cluster after having created a single control-pane with kubeadm? General Discussions	2	675	September 20, 2019
How has Kubernetes failed for you? General Discussions	6	4619	February 10, 2019
How to restore master failure in kubernetes General Discussions	0	1505	June 9, 2020

Kubernetes multpiple control plane nodes cluster, not working when one control plane node fails

Cluster information:

Related topics