Asking for help? Comment out what you need so we can get more information to help you!
Cluster information:
Kubernetes version: 1.28.15
Cloud being used: bare-metal
Installation method: kubeadm
Host OS: Ubuntu 22.04.4 LTS
CNI and version: weaveworks/weave-kube:latest
CRI and version: cri-dockerd 0.3.4
Issue Summary:
I had a single-master, single-worker cluster running Kubernetes 1.27. Since Kubernetes 1.27 was no longer available via the package repository, I upgraded the cluster to 1.28 before attempting to add a new master node.
After preparing the new node with the required prerequisites, I attempted to join it using kubeadm
, but the process got stuck at the following step:
[etcd] Announced new etcd member joining to the existing etcd cluster
[etcd] Creating static Pod manifest for "etcd"
[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
[kubelet-check] Initial timeout of 40s passed.
At the same time, the cluster became unstable— the kube-apiserver pod started crashing due to an etcd failure. Checking the etcd container logs revealed the following errors:
{"level":"info","ts":"2025-03-10T18:28:48.145367Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d is starting a new election at term 2"}
{"level":"warn","ts":"2025-03-10T18:28:48.285205Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"2176616c9266be39","rtt":"0s","error":"dial tcp [server-ip]:2380: connect: connection refused"}
{"level":"warn","ts":"2025-03-10T18:28:48.354150Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":7587885302587698060,"retry-timeout":"500ms"}
It seems that attempting to join the new master node disrupted the etcd setup, rendering the cluster unstable.
Current etcd Member Status (etcdctl member list
):
2176616c9266be39, unstarted, , https://[new-server-ip]:2380,
8e9e05c52164694d, started, k8s-master, http://localhost:2380, https://[current-server-ip]:2379
The new etcd member (2176616c9266be39
) appears unstarted, while the existing member (8e9e05c52164694d
) is running.
Key Observations:
- Adding the new master node disrupts etcd, causing the kube-apiserver to crash.
- The new etcd member is stuck in an unstarted state.
- If I attempt to add the new node as a worker instead of a master, it joins successfully without any issues.
Request for Help:
- What steps should I take to successfully add this new master node without breaking etcd?
Additional Notes:
- The OS versions, container runtime versions (Docker), and cri-dockerd versions on the new and existing servers differ.
- This discrepancy is why I had to upgrade Kubernetes first before adding the new master.
Would appreciate any insights or guidance. Thanks in advance!