Hello,
we had the same issue today and I was able to solve it as follows:
On my cluster manager node:
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
endpoint status --write-out=table
Output looks like this:
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://127.0.0.1:2379 | 123456789ABCDEFG | 3.5.6 | 2.1 GB | true | 985 | 269601165 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
etcd database is really big, exceeds quota. Check the alarm message of etcd:
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
alarm list
Output shows the alarm: NOSPACE:
memberID:123456789ABCDEFG alarm:NOSPACE
Get the current revision and compact it and defrag the database:
REVISION=$(ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
compact ${REVISION}
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt \
--key=/etc/kubernetes/pki/etcd/peer.key \
defrag
Output should look like this:
compacted revision 111111111
Finished defragmenting etcd member[https://127.0.0.1:2379]
When you now repeat the status request from the beginning, it shoul look like this:
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://127.0.0.1:2379 | 123456789ABCDEFG | 3.5.6 | 228 MB | true | 985 | 269601165 |
+------------------------+------------------+---------+---------+-----------+-----------+------------+
Note that the DB SIZE is now a lot smaller again.
After that, my cluster worked again.
We will now activate auto compacting and to be sure, increase the quota. See details for etcd maintenance here: etcd maintenance docs
Hope this helps someone solve their problems, too.
Kind regards
Timo (b+m Informatik AG)