Error while dialing dial tcp 192.168.95.10:2379: connect: connection refused

Hi,
I am quite new into administration stuff and kubernetes generally, but I had to took over the management of our cluster from the previous person. The cluster consists of 3 worker nodes and 3 master nodes that are vms build on the same physical machines. (Every master node is vm based on the machine that the worker node is running on).

My my current problem is that the cluster crashed and I can’t really think of any safe solution to bring it back to life. I have a suspicion, that there is a problem with memory - some old snapshots may occupy the cache, as no one has done any memory cleaning recently on the nodes.

While executing any command requesting access to kubernetes, such as kubectl get pods the response that I am recieving is as follows:
The connection to the server localhost:6445 was refused - did you specify the right host or port?

But since I there was nothing changed in the configuration files and previously everything worked, I believe that the host and port specified are correct.

While executing command ETCDCTL_API=3 etcdctl endpoint health, with additional endpoints and certificates specified, the output is:
{"level":"warn","ts":"2023-10-02T10:50:54.191Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-ddd51cd3-4591-4c30-91ce-cc1268f240d3/192.168.95.10:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 192.168.95.10:2379: connect: connection refused\""} Error: failed to fetch endpoints from etcd cluster member list: context deadline exceeded

I have a suspicion that I should clean the memory used by snapshots, etc. in the worker nodes, but I am not sure how it achieve it without removing the data saved in some of the pvcs.

Therefore I have a question - how to fix this issue and where to search reliable information about administration of k8s, so I could avoid such as situations in the future.

Thanks in advance!

Cluster information:

Kubernetes version:
Cloud being used: bare-metal
Installation method: kubeadm
Host OS: Ubuntu 22.04.3 LTS
CNI and version: calico
CRI and version: containerd

1 Like