Issue Joining a New Master Node to an Existing Cluster

Asking for help? Comment out what you need so we can get more information to help you!

Cluster information:

Kubernetes version: 1.28.15
Cloud being used: bare-metal
Installation method: kubeadm
Host OS: Ubuntu 22.04.4 LTS
CNI and version: weaveworks/weave-kube:latest
CRI and version: cri-dockerd 0.3.4

Issue Summary:

I had a single-master, single-worker cluster running Kubernetes 1.27. Since Kubernetes 1.27 was no longer available via the package repository, I upgraded the cluster to 1.28 before attempting to add a new master node.

After preparing the new node with the required prerequisites, I attempted to join it using kubeadm, but the process got stuck at the following step:

[etcd] Announced new etcd member joining to the existing etcd cluster
[etcd] Creating static Pod manifest for "etcd"
[etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s

[kubelet-check] Initial timeout of 40s passed.

At the same time, the cluster became unstable— the kube-apiserver pod started crashing due to an etcd failure. Checking the etcd container logs revealed the following errors:

{"level":"info","ts":"2025-03-10T18:28:48.145367Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d is starting a new election at term 2"}
{"level":"warn","ts":"2025-03-10T18:28:48.285205Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"2176616c9266be39","rtt":"0s","error":"dial tcp [server-ip]:2380: connect: connection refused"}
{"level":"warn","ts":"2025-03-10T18:28:48.354150Z","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":7587885302587698060,"retry-timeout":"500ms"}

It seems that attempting to join the new master node disrupted the etcd setup, rendering the cluster unstable.

Current etcd Member Status (etcdctl member list):

2176616c9266be39, unstarted, , https://[new-server-ip]:2380, 
8e9e05c52164694d, started, k8s-master, http://localhost:2380, https://[current-server-ip]:2379

The new etcd member (2176616c9266be39) appears unstarted, while the existing member (8e9e05c52164694d) is running.

Key Observations:

  • Adding the new master node disrupts etcd, causing the kube-apiserver to crash.
  • The new etcd member is stuck in an unstarted state.
  • If I attempt to add the new node as a worker instead of a master, it joins successfully without any issues.

Request for Help:

  1. What steps should I take to successfully add this new master node without breaking etcd?

Additional Notes:

  • The OS versions, container runtime versions (Docker), and cri-dockerd versions on the new and existing servers differ.
  • This discrepancy is why I had to upgrade Kubernetes first before adding the new master.

Would appreciate any insights or guidance. Thanks in advance!