AWS Unable to connect to the Kubernetes API server on port 6443

Cluster information:

Kubernetes version:1.28.2
Cloud being used: AWS
Installation method: command line
Host OS: Ubuntu

I have installed Kubernetes on 3 servers, one control plane and two workers. if I issue kubectl get nodes I am receiving error: The connection to the server 172.31.20.146:6443 was refused - did you specify the right host or port?

Once I reboot server , I can see all my nodes:

ubuntu@controlplane:~$ kubectl get nodes
NAME           STATUS   ROLES           AGE    VERSION
controlplane   Ready    control-plane   136m   v1.28.2
worker1        Ready    <none>          109m   v1.28.2
worker2        Ready    <none>          108m   v1.28.2
ubuntu@controlplane:~$ kubectl get all
NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
service/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   136m

But after a minute it is going down and I can see the error again: The connection to the server 172.31.20.146:6443 was refused - did you specify the right host or port?

I have enough resources, like CPU and Meomory.

I was on the call with AWS support and they confirmed that security groups have the correct access

There are no issues with any of the VPC components for this EC2 instance. Total info as follows: - Security Group allows all inbound traffic from 172.31.0.0/16 and allows all outbound traffic. - Network ACL is open for all incoming/outgoing traffic. - Route Table routes VPC traffic 172.31.0.0/16 locally.

Hi,
Worth checking kubelet and kubeapi logs. ATM no info to work with.

The issue you’re encountering, where the Kubernetes API server at 172.31.20.146:6443 becomes inaccessible shortly after a reboot, suggests a problem that could be related to the Kubernetes control plane components themselves, networking on your host, or resource constraints that only manifest under certain conditions.

After rebooting and when the nodes are visible, quickly check the status of all control plane components. You can do this by running the following command on your control plane node:

bashCopy code

sudo kubectl get pods --namespace=kube-system

Look for any components that are not in the Running state or are showing repeated restarts.

Thank you @timwolfe94022 for this CHATGPT generated answer. Maybe I have no great knowledge about Kubernetes, I know how to use chatgpt.

@fox-md from journalctl -u kubelet :

Feb 16 13:15:17 controlplane kubelet[19063]: E0216 13:15:17.265276   19063 run.go:74] "command failed" err="failed to l>
Feb 16 13:15:17 controlplane systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Feb 16 13:15:17 controlplane systemd[1]: kubelet.service: Failed with result 'exit-code'.
Feb 16 13:15:27 controlplane systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 9.
Feb 16 13:15:27 controlplane systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Feb 16 13:15:27 controlplane systemd[1]: Started kubelet: The Kubernetes Node Agent.
Feb 16 13:15:27 controlplane kubelet[19069]: E0216 13:15:27.517399   19069 run.go:74] "command failed" err="failed to l>
Feb 16 13:15:27 controlplane systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Feb 16 13:15:27 controlplane systemd[1]: kubelet.service: Failed with result 'exit-code'.
Feb 16 13:15:37 controlplane systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 10.
Feb 16 13:15:37 controlplane systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Feb 16 13:15:37 controlplane systemd[1]: Started kubelet: The Kubernetes Node Agent.
Feb 16 13:15:37 controlplane kubelet[19075]: E0216 13:15:37.771138   19075 run.go:74] "command failed" err="failed to l>
Feb 16 13:15:37 controlplane systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Feb 16 13:15:37 controlplane systemd[1]: kubelet.service: Failed with result 'exit-code'.
Feb 16 13:15:47 controlplane systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 11.

journalctl -u kubelet | grep kube-apiserver

Feb 19 10:49:10 controlplane kubelet[12745]: I0219 10:49:10.489152   12745 topology_manager.go:215] "Topology Admit Handler" podUID="b9adef3da2c2babf52f5eae940481211" podNamespace="kube-system" podName="kube-apiserver-controlplane"
Feb 19 10:49:10 controlplane kubelet[12745]: I0219 10:49:10.594373   12745 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"usr-share-ca-certificates\" (UniqueName: \"kubernetes.io/host-path/b9adef3da2c2babf52f5eae940481211-usr-share-ca-certificates\") pod \"kube-apiserver-controlplane\" (UID: \"b9adef3da2c2babf52f5eae940481211\") " pod="kube-system/kube-apiserver-controlplane"
Feb 19 10:49:10 controlplane kubelet[12745]: I0219 10:49:10.594603   12745 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"ca-certs\" (UniqueName: \"kubernetes.io/host-path/b9adef3da2c2babf52f5eae940481211-ca-certs\") pod \"kube-apiserver-controlplane\" (UID: \"b9adef3da2c2babf52f5eae940481211\") " pod="kube-system/kube-apiserver-controlplane"
Feb 19 10:49:10 controlplane kubelet[12745]: I0219 10:49:10.594759   12745 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"etc-ca-certificates\" (UniqueName: \"kubernetes.io/host-path/b9adef3da2c2babf52f5eae940481211-etc-ca-certificates\") pod \"kube-apiserver-controlplane\" (UID: \"b9adef3da2c2babf52f5eae940481211\") " pod="kube-system/kube-apiserver-controlplane"
Feb 19 10:49:10 controlplane kubelet[12745]: I0219 10:49:10.594794   12745 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"k8s-certs\" (UniqueName: \"kubernetes.io/host-path/b9adef3da2c2babf52f5eae940481211-k8s-certs\") pod \"kube-apiserver-controlplane\" (UID: \"b9adef3da2c2babf52f5eae940481211\") " pod="kube-system/kube-apiserver-controlplane"
Feb 19 10:49:10 controlplane kubelet[12745]: I0219 10:49:10.594825   12745 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"usr-local-share-ca-certificates\" (UniqueName: \"kubernetes.io/host-path/b9adef3da2c2babf52f5eae940481211-usr-local-share-ca-certificates\") pod \"kube-apiserver-controlplane\" (UID: \"b9adef3da2c2babf52f5eae940481211\") " pod="kube-system/kube-apiserver-controlplane"
Feb 19 10:49:40 controlplane kubelet[12745]: E0219 10:49:40.668709   12745 kubelet.go:1890] "Failed creating a mirror pod for" err="Post \"https://172.31.20.146:6443/api/v1/namespaces/kube-system/pods\": dial tcp 172.31.20.146:6443: connect: connection refused" pod="kube-system/kube-apiserver-controlplane"
Feb 19 10:49:41 controlplane kubelet[12745]: E0219 10:49:41.400598   12745 kubelet.go:1890] "Failed creating a mirror pod for" err="Post \"https://172.31.20.146:6443/api/v1/namespaces/kube-system/pods\": dial tcp 172.31.20.146:6443: connect: connection refused" pod="kube-system/kube-apiserver-controlplane"
Feb 19 10:49:43 controlplane kubelet[12745]: E0219 10:49:43.725226   12745 kubelet.go:1890] "Failed creating a mirror pod for" err="pods \"kube-apiserver-controlplane\" already exists" pod="kube-system/kube-apiserver-controlplane"
Feb 19 10:49:44 controlplane kubelet[12745]: E0219 10:49:44.423426   12745 kubelet.go:1890] "Failed creating a mirror pod for" err="pods \"kube-apiserver-controlplane\" already exists" pod="kube-system/kube-apiserver-controlplane"
Feb 19 10:49:50 controlplane kubelet[12745]: E0219 10:49:50.677600   12745 kubelet.go:1890] "Failed creating a mirror pod for" err="pods \"kube-apiserver-controlplane\" already exists" pod="kube-system/kube-apiserver-controlplane"

Hi,
Thank you for logs. Seems that kubelet keeps crashing. Check syslogs.

Hey @fox-md ,

Here is the output from syslog, looks like there is an issue with the Calico network, but I am not too sure if I am getting this correct.

Feb 18 08:59:48 controlplane kubelet[838]: E0218 08:59:48.050058     838 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-apiserver\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=kube-apiserver pod=kube-apiserver-controlplane_kube-system(b9adef3da2c2babf52f5eae940481211)\"" pod="kube-system/kube-apiserver-controlplane" podUID="b9adef3da2c2babf52f5eae940481211"
Feb 18 08:59:48 controlplane kubelet[838]: E0218 08:59:48.086927     838 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"9970fd54df49fc7e575a725a62cb9d78e2be0a9d85d3a3c8e515ba11a99acba3\": plugin type=\"calico\" failed (delete): error getting ClusterInformation: Get \"https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default\": dial tcp 10.96.0.1:443: connect: connection refused" podSandboxID="9970fd54df49fc7e575a725a62cb9d78e2be0a9d85d3a3c8e515ba11a99acba3"
Feb 18 08:59:48 controlplane kubelet[838]: E0218 08:59:48.086974     838 kuberuntime_manager.go:1375] "Failed to stop sandbox" podSandboxID={"Type":"containerd","ID":"9970fd54df49fc7e575a725a62cb9d78e2be0a9d85d3a3c8e515ba11a99acba3"}
Feb 18 08:59:48 controlplane kubelet[838]: E0218 08:59:48.122038     838 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"1cbe503c80d7ac609455ae1097678c32b222ba77473c2398e331bf2ca9106598\": plugin type=\"calico\" failed (delete): error getting ClusterInformation: Get \"https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default\": dial tcp 10.96.0.1:443: connect: connection refused" podSandboxID="1cbe503c80d7ac609455ae1097678c32b222ba77473c2398e331bf2ca9106598"
Feb 18 08:59:48 controlplane kubelet[838]: E0218 08:59:48.122085     838 kuberuntime_manager.go:1375] "Failed to stop sandbox" podSandboxID={"Type":"containerd","ID":"1cbe503c80d7ac609455ae1097678c32b222ba77473c2398e331bf2ca9106598"}
Feb 18 08:59:48 controlplane kubelet[838]: E0218 08:59:48.156718     838 remote_runtime.go:222] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"432025be92372752bc77e920d5549be53d12a473db22c086cde90ab25292a570\": plugin type=\"calico\" failed (delete): error getting ClusterInformation: Get \"https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default\": dial tcp 10.96.0.1:443: connect: connection refused" podSandboxID="432025be92372752bc77e920d5549be53d12a473db22c086cde90ab25292a570"
Feb 18 08:59:48 controlplane kubelet[838]: E0218 08:59:48.156780     838 kuberuntime_manager.go:1375] "Failed to stop sandbox" podSandboxID={"Type":"containerd","ID":"432025be92372752bc77e920d5549be53d12a473db22c086cde90ab25292a570"}
Feb 18 08:59:48 controlplane kubelet[838]: E0218 08:59:48.156829     838 kuberuntime_manager.go:1075] "killPodWithSyncResult failed" err="failed to \"KillPodSandbox\" for \"9ecf61e6-9e90-40c7-a63b-fe42a151d6f0\" with KillPodSandboxError: \"rpc error: code = Unknown desc = failed to destroy network for sandbox \\\"432025be92372752bc77e920d5549be53d12a473db22c086cde90ab25292a570\\\": plugin type=\\\"calico\\\" failed (delete): error getting ClusterInformation: Get \\\"https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default\\\": dial tcp 10.96.0.1:443: connect: connection refused\"

"

Before I started I made sure I had connectivity between all the nodes.

Any clue what to do next?

Hi,
Calico cannot connect to kubeapi server. There must be a reason of the kubelet failure. You need to have it in the running state, otherwise static pods will not start.
Can you check /var/log/syslog file? There should be some info related to kubelet.

Thank you @fox-md , it was Calico, I rebuilt the cluster with new custom resources from Calico and everything works as a charm.

https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/custom-resources.yaml