Stacked etcd cluster & Kube-api server problem

Knot_a_problem · May 26, 2024, 9:28am

Hi everyone,

I’m relatively new to production kubernetes, and we’re trying to build a dedicated cluster for some security functions. It contains the following at present:

Stacked etcd cluster using host based haproxy and keepalived (according to documentation)
Multiple worker nodes in different subnets

The problem: Kube-api server runs on a VIP using port 2443 externally and 6443 internally. Worker nodes access the cluster just fine on VIP using 2443 but when we deploy workloads they start trying to access the direct IP and 6443.

Cluster information:

Kubernetes version: 1.29.5
Cloud being used: Bare-Metal
Installation method:Kubeadm (Manual?)
Host OS: Ubuntu/RHEL
CNI and version: Flannel (0.25.2)
CRI and version: Containerd (1.6.32)

Further configuration info and error messages below:

Bootstrap primary control node

sudo kubeadm init --control-plane-endpoint 192.168.5.20:2443 --pod-network-cidr=.../16 --upload-certs
Flannel deployment (standard aside from CIDR)
Provisioning of backup HA nodes
kubeadm join 192.168.5.20:2443 --token *
–discovery-token-ca-cert-hash sha256:*
–control-plane --certificate-key *

Output for Control Plane kube-flannel kube-flannel-ds-gm7kb kube-flannel kube-flannel-ds-jf5v8 kube-flannel kube-flannel-ds-n5mh2 kube-system coredns-76f75df574-g9w9v kube-system coredns-76f75df574-xlv6x kube-system etcd-k8s-cp-01 kube-system etcd-k8s-wn-01 kube-system etcd-k8s-wn-02 kube-system kube-apiserver-k8s-cp-01 kube-system kube-apiserver-k8s-wn-01 kube-system kube-apiserver-k8s-wn-02 kube-system kube-controller-manager-k8s-cp-01 kube-system kube-controller-manager-k8s-wn-01 kube-system kube-controller-manager-k8s-wn-02 kube-system kube-proxy-55lzw kube-system kube-proxy-g25h9 kube-system kube-proxy-sd5j6 kube-system kube-scheduler-k8s-cp-01 kube-system kube-scheduler-k8s-wn-01 kube-system kube-scheduler-k8s-wn-02 g>
1/1 Running 0 2m29s
1/1 Running 0 25s
1/1 Running 0 18s
1/1 Running 0 3m59s
1/1 Running 0 3m59s
1/1 Running 0 4m11s
1/1 Running 0 25s
1/1 Running 0 16s
1/1 Running 0 4m11s
1/1 Running 0 25s
1/1 Running 0 16s
1/1 Running 0 4m12s
1/1 Running 0 25s
0/1 Running 0 16s
1/1 Running 0 25s
1/1 Running 0 3m59s
1/1 Running 0 18s
1/1 Running 0 4m11s
1/1 Running 0 25s
1/1 Running 0 16s

Adding Worker Node
kubeadm join 192.168.5.20:2443 --token *
–discovery-token-ca-cert-hash sha256:*

After doing this, all the kube-system pods look healthy, they have successfully come up on the worker node utilizing 192.168.5.20:2443 but the worker node pod shows:
kube-flannel-ds-nf4ps 0/1 CrashLoopBackOff 3 (37s ago) 3m29s 192.168.2.121 k8s-wn-test

Interestingly enough, doing a kubectl logs command gives me the following:
Defaulted container “kube-flannel” out of: kube-flannel, install-cni-plugin (init), install-cni (init)
Unable to connect to the server: unexpected EOF

Kubectl describe tells me the following events:
Events:
Type Reason Age From Message

Normal Scheduled 7m37s default-scheduler Successfully assigned kube-flannel/kube-flannel-ds-nf4ps to k8s-wn-test
Normal Pulling 7m35s kubelet Pulling image “docker.io/flannel/flannel-cni-plugin:v1.4.1-flannel1”
Normal Pulled 7m32s kubelet Successfully pulled image “docker.io/flannel/flannel-cni-plugin:v1.4.1-flannel1” in 2.729s (2.729s including waiting). Image size: 4710551 bytes.
Normal Created 7m32s kubelet Created container install-cni-plugin
Normal Started 7m32s kubelet Started container install-cni-plugin
Normal Pulling 7m32s kubelet Pulling image “docker.io/flannel/flannel:v0.25.2”
Normal Pulled 7m25s kubelet Successfully pulled image “docker.io/flannel/flannel:v0.25.2” in 4.365s (6.485s including waiting). Image size: 31403254 bytes.
Normal Created 7m25s kubelet Created container install-cni
Normal Started 7m25s kubelet Started container install-cni
Normal Pulled 5m15s (x4 over 7m24s) kubelet Container image “docker.io/flannel/flannel:v0.25.2” already present on machine
Normal Created 5m15s (x4 over 7m24s) kubelet Created container kube-flannel
Normal Started 5m15s (x4 over 7m24s) kubelet Started container kube-flannel
Warning BackOff 2m31s (x12 over 6m22s) kubelet Back-off restarting failed container kube-flannel in pod kube-flannel-ds-nf4ps_kube-flannel(2e95d79b-9d32-4a77-8ac3-93466ae2128b)

When double checking the firewall I can see that the worker node is doing most of its traffic on 2443, it’s doing some on 6443 and the ‘real’ IP’s of the control nodes. It’s completely ignoring the VIP.

I’ve looked at the kube-apiserver.yaml manifest and see the --advertise-address and if I change that and configure the --bind-address it will start sending it to the correct IP address but still not the correct port.

Am I missing something stupidly obvious? Whatever the control nodes do among themselves (6443) is not a problem. I just can’t have the worker nodes use anything else than the VIP otherwise it would defeat having a redundant control cluster altogether. I cannot utilize the 6443 port on the loadbalancer and currently I may be thinking about just reconfiguring everything altogether to be non-standard. But it seems like I’m missing something obvious.

I’m happy to be educated if I’m doing something fundamentally wrong!

Thank you

Topic		Replies	Views
Kubeadm HA setup behind VPN General Discussions	0	878	October 28, 2021
Can't add control-plane nodes? General Discussions	0	520	July 17, 2023
Create load balancer for kube-apiserver then nc -v VIP got: No route to host General Discussions docs , loadbalancer , network	1	47	September 30, 2024
Crashing cluster after kubeadm on VPS General Discussions development , network	0	404	June 10, 2023
Kubectl connection refused General Discussions	0	811	May 10, 2021

Stacked etcd cluster & Kube-api server problem

Cluster information:

Related topics