Recovering failed ha-cluster enabled microk8s deployments
Background
When using the ha-cluster
add-on for microk8s 1.19 and newer, which is now the default microk8s clustering method, microk8s uses an alternate backing store which utilises dqlite instead of etcd.
Whilst work is always ongoing to improve the stability and ability to self-heal in both microk8s and dqlite, in case of catastrophic cluster failure or data corruption, you may find yourself in a situation where you have to manually recover your cluster. The process is different to recovering an etcd cluster, but is fairly straight forward.
The following steps will restore a failed microk8s cluster assuming that at least one node in the cluster has non-corrupted data.
Identify a single node with healthy data
Identifying which node has the most up to date data is the first step to recovering a dqlite/microk8s ha-cluster datastore. The dqlite data in microk8s is stored in the /var/snap/microk8s/current/var/kubernetes/backend
directory when ha-cluster is enabled.
An example follows. In this cluster, there are two healthy nodes, and one corrupted or out of sync node. The entire cluster is down, and will not return to a healthy state if started.
First and second example node backend dqlite data:
# ls -l /var/snap/microk8s/current/var/kubernetes/backend
total 180808
-rw-rw---- 1 root microk8s 8381048 Sep 21 20:42 0000000002833997-0000000002834348
-rw-rw---- 1 root microk8s 8369456 Sep 21 20:43 0000000002834349-0000000002834881
-rw-rw---- 1 root microk8s 8383280 Sep 21 20:43 0000000002834882-0000000002835492
-rw-rw---- 1 root microk8s 437016 Sep 21 20:59 0000000002835493-0000000002835520
-rw-rw---- 1 root microk8s 8381768 Sep 21 20:59 0000000002835521-0000000002836110
-rw-rw---- 1 root microk8s 8370896 Sep 21 20:59 0000000002836111-0000000002836720
-rw-rw---- 1 root microk8s 6001392 Sep 21 21:00 0000000002836721-0000000002837139
-rw-rw---- 1 root microk8s 8388304 Sep 21 21:10 0000000002837140-0000000002837536
-rw-rw---- 1 root microk8s 8378144 Sep 21 21:11 0000000002837537-0000000002838019
-rw-rw---- 1 root microk8s 8380544 Sep 21 21:14 0000000002838020-0000000002838364
-rw-rw---- 1 root microk8s 7976984 Sep 21 21:33 0000000002838365-0000000002838809
-rw-rw---- 1 root microk8s 8383424 Sep 21 21:33 0000000002838810-0000000002839422
-rw-rw---- 1 root microk8s 8375704 Sep 21 21:49 0000000002839423-0000000002840044
-rw-rw---- 1 root microk8s 3894160 Sep 21 21:49 0000000002840045-0000000002840323
-rw-rw---- 1 root microk8s 8382136 Sep 21 22:06 0000000002840324-0000000002840862
-rw-rw---- 1 root microk8s 8175792 Sep 21 22:06 0000000002840863-0000000002841499
-rw-rw---- 1 root microk8s 4116408 Sep 21 22:23 0000000002841500-0000000002841736
-rw-rw---- 1 root microk8s 8384048 Sep 21 22:41 0000000002841737-0000000002842247
-rw-rw---- 1 root microk8s 8163120 Sep 21 22:41 0000000002842248-0000000002842879
-rw-rw---- 1 root microk8s 8248608 Sep 21 22:58 0000000002842880-0000000002843333
-rw-rw---- 1 root microk8s 2224 Sep 17 10:14 cluster.crt
-rw-rw---- 1 root microk8s 3276 Sep 17 10:14 cluster.key
-rw-rw---- 1 root microk8s 200 Sep 21 22:58 cluster.yaml
-rw-rw---- 1 root microk8s 60 Sep 17 10:21 info.yaml
srw-rw---- 1 root microk8s 0 Sep 21 22:41 kine.sock
-rw-rw---- 1 root microk8s 32 Sep 21 22:58 metadata1
-rw-rw---- 1 root microk8s 32 Sep 21 22:58 metadata2
-rw-rw---- 1 root microk8s 12230792 Sep 21 22:06 snapshot-2036-2841277-351540538
-rw-rw---- 1 root microk8s 128 Sep 21 22:06 snapshot-2036-2841277-351540538.meta
-rw-rw---- 1 root microk8s 11588072 Sep 21 22:41 snapshot-2041-2842301-353599873
-rw-rw---- 1 root microk8s 128 Sep 21 22:41 snapshot-2041-2842301-353599873.meta
-rw-rw---- 1 root microk8s 13676544 Sep 21 22:58 snapshot-2043-2843325-354658372
-rw-rw---- 1 root microk8s 128 Sep 21 22:58 snapshot-2043-2843325-354658372.meta
Compare this to another node in the cluster, and we can see from the dqlite replication sequence numbers, that the first node has the most recent data, and should be used as the source of truth for recovering the cluster.
Third out-of-sync node backend dqlite data:
# ls -l /var/snap/microk8s/current/var/kubernetes/backend
total 175716
-rw-rw---- 1 root microk8s 8380744 Sep 21 06:17 0000000002713664-0000000002714240
-rw-rw---- 1 root microk8s 8380976 Sep 21 06:19 0000000002714241-0000000002714591
-rw-rw---- 1 root microk8s 8379032 Sep 21 06:21 0000000002714592-0000000002714972
-rw-rw---- 1 root microk8s 6830856 Sep 21 06:21 0000000002714973-0000000002715461
-rw-rw---- 1 root microk8s 8277328 Sep 21 06:22 0000000002715462-0000000002716085
-rw-rw---- 1 root microk8s 8377088 Sep 21 06:24 0000000002716086-0000000002716439
-rw-rw---- 1 root microk8s 8388392 Sep 21 06:26 0000000002716440-0000000002716836
-rw-rw---- 1 root microk8s 8384648 Sep 21 06:26 0000000002716837-0000000002717466
-rw-rw---- 1 root microk8s 7310848 Sep 21 06:32 0000000002717467-0000000002718008
-rw-rw---- 1 root microk8s 8380912 Sep 21 06:33 0000000002718009-0000000002718598
-rw-rw---- 1 root microk8s 8381696 Sep 21 06:34 0000000002718599-0000000002719187
-rw-rw---- 1 root microk8s 8375000 Sep 21 06:34 0000000002719188-0000000002719797
-rw-rw---- 1 root microk8s 8386736 Sep 21 06:34 0000000002719798-0000000002720399
-rw-rw---- 1 root microk8s 8363584 Sep 21 06:42 0000000002720400-0000000002721086
-rw-rw---- 1 root microk8s 28808 Sep 21 06:42 0000000002721087-0000000002721087
-rw-rw---- 1 root microk8s 8387112 Sep 21 06:42 0000000002721088-0000000002721696
-rw-rw---- 1 root microk8s 8380256 Sep 21 06:42 0000000002721697-0000000002722322
-rw-rw---- 1 root microk8s 2224 Sep 17 10:14 cluster.crt
-rw-rw---- 1 root microk8s 3276 Sep 17 10:14 cluster.key
-rw-rw---- 1 root microk8s 200 Sep 21 06:42 cluster.yaml
-rw-rw---- 1 root microk8s 61 Sep 17 10:14 info.yaml
srw-rw---- 1 root microk8s 0 Sep 21 06:42 kine.sock
-rw-rw---- 1 root microk8s 32 Sep 21 06:42 metadata1
-rw-rw---- 1 root microk8s 32 Sep 21 06:42 metadata2
-rw-rw---- 1 root microk8s 8388608 Sep 21 06:43 open-3
-rw-rw---- 1 root microk8s 8388608 Sep 21 06:42 open-4
-rw-rw---- 1 root microk8s 8388608 Sep 21 06:42 open-5
-rw-rw---- 1 root microk8s 10887672 Sep 21 06:42 snapshot-1930-2721395-296105524
-rw-rw---- 1 root microk8s 128 Sep 21 06:42 snapshot-1930-2721395-296105524.meta
-rw-rw---- 1 root microk8s 12412072 Sep 21 06:42 snapshot-1930-2722419-296124261
-rw-rw---- 1 root microk8s 128 Sep 21 06:42 snapshot-1930-2722419-296124261.meta
It is apparent from the replication sequence numbers on the third node, that it is out of date and replication has failed for this node.
Recovery
- Ensure all cluster nodes are not running with
sudo snap stop microk8s
orsudo microk8s stop
- Take a backup of a known good node (in this example, node 1 or 2) and exclude the
info.yaml
,metadata1
,metadata2
files. An example, creating a tarball of the data:tar -c -v -z --exclude=*.yaml --exclude=metadata* -f dqlite-data.tar.gz /var/snap/microk8s/current/var/kubernetes/backend
. This will createdqlite-data.tar.gz
, containing a known-good replica of the data. - Copy the
dqlite-data.tar.gz
to any nodes with older data. For example, usescp
. - On a node(s) with non-fresh data, take the copied archive, switch to the root user with
sudo su
, and change directory to the/
directory withcd /
. - Again on the node(s) with non-fresh data, decompress the archive. If you copied the archive to the
/home/ubuntu
directory with scp, then runtar zxfv /home/ubuntu/dqlite-data.tar.gz
- Verify that the updated files have been decompressed into
/var/snap/microk8s/current/var/kubernetes
, the latest sequence numbers on the data file filenames should match between hosts. - Prior to the next step, check the files in
/var/snap/microk8s/current/var/kubernetes/backend
and compare the files on each node. Make sure that the data files (the numbered dqlite files, e.g.0000000002834690-0000000002835307
match on each host. You can checksha256sum
results for each file to be sure. The list of files should match on each node. Also check the same for thesnapshot-*
files in the same directory. Once you are sure these files match, proceed to the next step. - Start each node, one at a time, starting with a server which previously had up to date data (in this example, that would be node 1 or node 2, not node 3). If all data files are now in sync, microk8s should start after a short delay, when running
sudo microk8s start
.
Verification
Once each node has started, and the microk8s start
command has finished running on the last node, verify each node once started successfully has finished replicating and starting up using microks8 status
. After you start microk8s
, it may take up to 5-10 minutes before the replication is up to date and all nodes are caught up and running, so please be aware of this, the status my show an error connecting during this time, but waiting will show the cluster has returned to full health. If you have to wait more than 10-15 minutes, validate the data files, and repeat this process as necessary.
You should also be able to issue microk8s kubectl get all -A
on each node and see all cluster resources once replication has recovered to validate microk8s is back to fully health.