Recovery of HA MicroK8s Clusters

Recovering failed ha-cluster enabled microk8s deployments

Background

When using the ha-cluster add-on for microk8s 1.19 and newer, which is now the default microk8s clustering method, microk8s uses an alternate backing store which utilises dqlite instead of etcd.

Whilst work is always ongoing to improve the stability and ability to self-heal in both microk8s and dqlite, in case of catastrophic cluster failure or data corruption, you may find yourself in a situation where you have to manually recover your cluster. The process is different to recovering an etcd cluster, but is fairly straight forward.

The following steps will restore a failed microk8s cluster assuming that at least one node in the cluster has non-corrupted data.

Identify a single node with healthy data

Identifying which node has the most up to date data is the first step to recovering a dqlite/microk8s ha-cluster datastore. The dqlite data in microk8s is stored in the /var/snap/microk8s/current/var/kubernetes/backend directory when ha-cluster is enabled.

An example follows. In this cluster, there are two healthy nodes, and one corrupted or out of sync node. The entire cluster is down, and will not return to a healthy state if started.

First and second example node backend dqlite data:

# ls -l /var/snap/microk8s/current/var/kubernetes/backend
total 180808
-rw-rw---- 1 root microk8s  8381048 Sep 21 20:42 0000000002833997-0000000002834348
-rw-rw---- 1 root microk8s  8369456 Sep 21 20:43 0000000002834349-0000000002834881
-rw-rw---- 1 root microk8s  8383280 Sep 21 20:43 0000000002834882-0000000002835492
-rw-rw---- 1 root microk8s   437016 Sep 21 20:59 0000000002835493-0000000002835520
-rw-rw---- 1 root microk8s  8381768 Sep 21 20:59 0000000002835521-0000000002836110
-rw-rw---- 1 root microk8s  8370896 Sep 21 20:59 0000000002836111-0000000002836720
-rw-rw---- 1 root microk8s  6001392 Sep 21 21:00 0000000002836721-0000000002837139
-rw-rw---- 1 root microk8s  8388304 Sep 21 21:10 0000000002837140-0000000002837536
-rw-rw---- 1 root microk8s  8378144 Sep 21 21:11 0000000002837537-0000000002838019
-rw-rw---- 1 root microk8s  8380544 Sep 21 21:14 0000000002838020-0000000002838364
-rw-rw---- 1 root microk8s  7976984 Sep 21 21:33 0000000002838365-0000000002838809
-rw-rw---- 1 root microk8s  8383424 Sep 21 21:33 0000000002838810-0000000002839422
-rw-rw---- 1 root microk8s  8375704 Sep 21 21:49 0000000002839423-0000000002840044
-rw-rw---- 1 root microk8s  3894160 Sep 21 21:49 0000000002840045-0000000002840323
-rw-rw---- 1 root microk8s  8382136 Sep 21 22:06 0000000002840324-0000000002840862
-rw-rw---- 1 root microk8s  8175792 Sep 21 22:06 0000000002840863-0000000002841499
-rw-rw---- 1 root microk8s  4116408 Sep 21 22:23 0000000002841500-0000000002841736
-rw-rw---- 1 root microk8s  8384048 Sep 21 22:41 0000000002841737-0000000002842247
-rw-rw---- 1 root microk8s  8163120 Sep 21 22:41 0000000002842248-0000000002842879
-rw-rw---- 1 root microk8s  8248608 Sep 21 22:58 0000000002842880-0000000002843333
-rw-rw---- 1 root microk8s     2224 Sep 17 10:14 cluster.crt
-rw-rw---- 1 root microk8s     3276 Sep 17 10:14 cluster.key
-rw-rw---- 1 root microk8s      200 Sep 21 22:58 cluster.yaml
-rw-rw---- 1 root microk8s       60 Sep 17 10:21 info.yaml
srw-rw---- 1 root microk8s        0 Sep 21 22:41 kine.sock
-rw-rw---- 1 root microk8s       32 Sep 21 22:58 metadata1
-rw-rw---- 1 root microk8s       32 Sep 21 22:58 metadata2
-rw-rw---- 1 root microk8s 12230792 Sep 21 22:06 snapshot-2036-2841277-351540538
-rw-rw---- 1 root microk8s      128 Sep 21 22:06 snapshot-2036-2841277-351540538.meta
-rw-rw---- 1 root microk8s 11588072 Sep 21 22:41 snapshot-2041-2842301-353599873
-rw-rw---- 1 root microk8s      128 Sep 21 22:41 snapshot-2041-2842301-353599873.meta
-rw-rw---- 1 root microk8s 13676544 Sep 21 22:58 snapshot-2043-2843325-354658372
-rw-rw---- 1 root microk8s      128 Sep 21 22:58 snapshot-2043-2843325-354658372.meta

Compare this to another node in the cluster, and we can see from the dqlite replication sequence numbers, that the first node has the most recent data, and should be used as the source of truth for recovering the cluster.

Third out-of-sync node backend dqlite data:

# ls -l /var/snap/microk8s/current/var/kubernetes/backend
total 175716
-rw-rw---- 1 root microk8s  8380744 Sep 21 06:17 0000000002713664-0000000002714240
-rw-rw---- 1 root microk8s  8380976 Sep 21 06:19 0000000002714241-0000000002714591
-rw-rw---- 1 root microk8s  8379032 Sep 21 06:21 0000000002714592-0000000002714972
-rw-rw---- 1 root microk8s  6830856 Sep 21 06:21 0000000002714973-0000000002715461
-rw-rw---- 1 root microk8s  8277328 Sep 21 06:22 0000000002715462-0000000002716085
-rw-rw---- 1 root microk8s  8377088 Sep 21 06:24 0000000002716086-0000000002716439
-rw-rw---- 1 root microk8s  8388392 Sep 21 06:26 0000000002716440-0000000002716836
-rw-rw---- 1 root microk8s  8384648 Sep 21 06:26 0000000002716837-0000000002717466
-rw-rw---- 1 root microk8s  7310848 Sep 21 06:32 0000000002717467-0000000002718008
-rw-rw---- 1 root microk8s  8380912 Sep 21 06:33 0000000002718009-0000000002718598
-rw-rw---- 1 root microk8s  8381696 Sep 21 06:34 0000000002718599-0000000002719187
-rw-rw---- 1 root microk8s  8375000 Sep 21 06:34 0000000002719188-0000000002719797
-rw-rw---- 1 root microk8s  8386736 Sep 21 06:34 0000000002719798-0000000002720399
-rw-rw---- 1 root microk8s  8363584 Sep 21 06:42 0000000002720400-0000000002721086
-rw-rw---- 1 root microk8s    28808 Sep 21 06:42 0000000002721087-0000000002721087
-rw-rw---- 1 root microk8s  8387112 Sep 21 06:42 0000000002721088-0000000002721696
-rw-rw---- 1 root microk8s  8380256 Sep 21 06:42 0000000002721697-0000000002722322
-rw-rw---- 1 root microk8s     2224 Sep 17 10:14 cluster.crt
-rw-rw---- 1 root microk8s     3276 Sep 17 10:14 cluster.key
-rw-rw---- 1 root microk8s      200 Sep 21 06:42 cluster.yaml
-rw-rw---- 1 root microk8s       61 Sep 17 10:14 info.yaml
srw-rw---- 1 root microk8s        0 Sep 21 06:42 kine.sock
-rw-rw---- 1 root microk8s       32 Sep 21 06:42 metadata1
-rw-rw---- 1 root microk8s       32 Sep 21 06:42 metadata2
-rw-rw---- 1 root microk8s  8388608 Sep 21 06:43 open-3
-rw-rw---- 1 root microk8s  8388608 Sep 21 06:42 open-4
-rw-rw---- 1 root microk8s  8388608 Sep 21 06:42 open-5
-rw-rw---- 1 root microk8s 10887672 Sep 21 06:42 snapshot-1930-2721395-296105524
-rw-rw---- 1 root microk8s      128 Sep 21 06:42 snapshot-1930-2721395-296105524.meta
-rw-rw---- 1 root microk8s 12412072 Sep 21 06:42 snapshot-1930-2722419-296124261
-rw-rw---- 1 root microk8s      128 Sep 21 06:42 snapshot-1930-2722419-296124261.meta

It is apparent from the replication sequence numbers on the third node, that it is out of date and replication has failed for this node.

Recovery

  1. Ensure all cluster nodes are not running with sudo snap stop microk8s or sudo microk8s stop
  2. Take a backup of a known good node (in this example, node 1 or 2) and exclude the info.yaml, metadata1, metadata2 files. An example, creating a tarball of the data: tar -c -v -z --exclude=*.yaml --exclude=metadata* -f dqlite-data.tar.gz /var/snap/microk8s/current/var/kubernetes/backend. This will create dqlite-data.tar.gz, containing a known-good replica of the data.
  3. Copy the dqlite-data.tar.gz to any nodes with older data. For example, use scp.
  4. On a node(s) with non-fresh data, take the copied archive, switch to the root user with sudo su, and change directory to the / directory with cd /.
  5. Again on the node(s) with non-fresh data, decompress the archive. If you copied the archive to the /home/ubuntu directory with scp, then run tar zxfv /home/ubuntu/dqlite-data.tar.gz
  6. Verify that the updated files have been decompressed into /var/snap/microk8s/current/var/kubernetes, the latest sequence numbers on the data file filenames should match between hosts.
  7. Prior to the next step, check the files in /var/snap/microk8s/current/var/kubernetes/backend and compare the files on each node. Make sure that the data files (the numbered dqlite files, e.g. 0000000002834690-0000000002835307 match on each host. You can check sha256sum results for each file to be sure. The list of files should match on each node. Also check the same for the snapshot-* files in the same directory. Once you are sure these files match, proceed to the next step.
  8. Start each node, one at a time, starting with a server which previously had up to date data (in this example, that would be node 1 or node 2, not node 3). If all data files are now in sync, microk8s should start after a short delay, when running sudo microk8s start.

Verification

Once each node has started, and the microk8s start command has finished running on the last node, verify each node once started successfully has finished replicating and starting up using microks8 status. After you start microk8s, it may take up to 5-10 minutes before the replication is up to date and all nodes are caught up and running, so please be aware of this, the status my show an error connecting during this time, but waiting will show the cluster has returned to full health. If you have to wait more than 10-15 minutes, validate the data files, and repeat this process as necessary.

You should also be able to issue microk8s kubectl get all -A on each node and see all cluster resources once replication has recovered to validate microk8s is back to fully health.

2 Likes

thanks for this. Shall I find a place in the documentation for it?

3 Likes

I vote for yes please! As this will be super helpful. :blush: