3 node ha-cluster confusion regarding datastore master node IP addressing

So without thinking about the consequences and wanting to get up and running quickly to start investigating viability of a RP 4 based ha-cluster to run my local home based workloads, I neglected to first wire up each node to an ethernet switch when i first started playing and 2 of the three nodes had ONLY their wlan interfaces active when i joined them to the first wired node. This has resulted in a situation that concerns me but i cannot find the best way to resolve it which will result in the least amount of damage to the cluster.

# microk8s status
microk8s is running
high-availability: yes
  datastore master nodes: 192.168.1.240:19001 192.168.1.117:19001 192.168.1.118:19001
  datastore standby nodes: none
addons:
  enabled:
    dns                  # CoreDNS
    ha-cluster           # Configure high availability on the current node
    storage              # Storage class; allocates storage from host directory

So we are basically dealing with two ip ranges in the same /24 subnet. Ethernet is the .240 range and the wifi range is the .117 and .118 IP address you see above. You will notice above that the ha-cluster addon lists the wlan interfaces (which were the only ones available on those nodes when they joined)

Butā€¦ Interestingly enough, there is nothing else in k8s that has decided to prefer the wlan interface over the ethernet ones except this ha-cluster addon. This:

# kubectl get node -o wide
NAME   STATUS   ROLES    AGE   VERSION                    INTERNAL-IP     EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION      CONTAINER-RUNTIME
m1     Ready    <none>   13d   v1.21.1-3+08fd9d63ea534e   192.168.1.240   <none>        Ubuntu 21.04   5.11.0-1012-raspi   containerd://1.4.4
w1     Ready    <none>   13d   v1.21.1-3+08fd9d63ea534e   192.168.1.241   <none>        Ubuntu 21.04   5.11.0-1012-raspi   containerd://1.4.4
w2     Ready    <none>   13d   v1.21.1-3+08fd9d63ea534e   192.168.1.242   <none>        Ubuntu 21.04   5.11.0-1012-raspi   containerd://1.4.4

and this:

# kubectl get all -A -o wide | grep 192.168.1.117
# kubectl get all -A -o wide | grep 192.168.1.118

(no result)

Of furthur concern now, I have just noticed that while the two ā€œwifiā€ nodes used to output the same output as the wired one when running microk8s status but when starting to take a shot at tackling this today i noticed the following worrying change. The two ā€˜wifiā€™ nodes are both reporting the following:

microk8s is running
high-availability: no
  datastore master nodes: none
  datastore standby nodes: none
addons:
  enabled:
    dns                  # CoreDNS
    ha-cluster           # Configure high availability on the current node
    storage              # Storage class; allocates storage from host directory

But the cluster is still fully functional for about almost 2 weeks as you can see above from the output of: kubectl get nodes -o wide

I currently have some important workloads running now in this cluster and I want to get the cluster ā€˜healthyā€™ with minimal downtime for these workloads.

What would be the least impactful method of resolving this? My initial thought was to just microk8s stop on all nodes, disable all wlan interfaces, and see if it would start up again and if the nodes would become ready. Failing that, I was thinking to try leaving each of the two nodes that joined with the wrong ip and joining them again with wlan disabled.

Any ideas of the best route forward? There is data in PVā€™s that i donā€™t want to lose or have to restore. I am hoping to just be able to scale my deployments down to zero replicas, stop microk8s, ā€œfix the clusterā€ and network settings, start it back up, and scale the deployments back up to 1 replica for each workload.

Thanks!

It was quite a while ago, so I hope you remember ā€¦ did you get anywhere with this?

Iā€™m having the exact same problem with my 5 node 1.23 cluster, and I canā€™t upgrade nodes, I canā€™t do anything, itā€™s just stuck like this ā€¦ workloads are working fine, but everything is slow as molasses and short of migrating all my workloads to another cluster, I canā€™t seem to resolve the issue :-/

Not very much related to microk8s, but i do backup my cluster using velero to s3.
Setup a new cluster and then sync the backedup resources and deploy those to the new cluster.

Yeah I was looking at that yesterday but wasnā€™t sure if an on-prem Microk8s cluster would be supported ā€¦ the docs were a little, unclear.

ANYTHING of the sort would make getting off this mess of a mess easier, as itā€™s all full of Argo stuff with CRDs and just, blehā€¦ it would not be fun to try to migrate by hand! REALLY hoping I can get this dang cluster back to healthy and upgraded to 1.24 though! :-/

Microk8s is a kubernetes distro. Should be supported. I have seen some implementation where they use argocd to deploy it to another cluster. So that will work too.

1 Like