Kubernetes cluster doesn't start after installation - I am clueless

Dears
I created a single master 2 node cluster with a cluster end point. Everything went smooth until I restarted the master node after which the cluster didn’t get started.

The details of the cluster
192.168.0.201 kube001 - MASTER
192.168.0.201 cluster-endpoint - ENDPPOINT same as master
192.168.0.202 kube002 - NODE
192.168.0.203 kube003 - NODE

network = WEAVE
lb = METALB

The only additional step I used here from typical cluster installation was the installation of cri-dockerd after which I had to do the below change in config.toml to start the cluster.
#/etc/containerd/config.toml
Changed from
disabled_plugins = [“cri”]
to
disabled_plugins =

After the restart, I gets the following error.
The connection to the server cluster-endpoint:6443 was refused - did you specify the right host or port?

Upon checking the logs, I can see the cluster start fails although kubelet and docker services are running. The below logs extracted from messages while starting the kubelet service indicating cluster related issues.

Jun 16 22:59:30 kube001 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jun 16 22:59:30 kube001 kubelet[10245]: Flag --container-runtime has been deprecated, will be removed in 1.27 as the only valid value is ‘remote’
Jun 16 22:59:30 kube001 kubelet[10245]: Flag --pod-infra-container-image has been deprecated, will be removed in 1.27. Image garbage collector will get sandbox image information from CRI.
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.620368 10245 server.go:193] “–pod-infra-container-image will not be pruned by the image garbage collector in kubelet and should also be set in the remote runtime”
Jun 16 22:59:30 kube001 kubelet[10245]: Flag --container-runtime has been deprecated, will be removed in 1.27 as the only valid value is ‘remote’
Jun 16 22:59:30 kube001 kubelet[10245]: Flag --pod-infra-container-image has been deprecated, will be removed in 1.27. Image garbage collector will get sandbox image information from CRI.
Jun 16 22:59:30 kube001 systemd[1]: Started Kubernetes systemd probe.
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.642129 10245 server.go:399] “Kubelet version” kubeletVersion=“v1.24.1”
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.642213 10245 server.go:401] “Golang settings” GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.642913 10245 server.go:813] “Client rotation is on, will bootstrap in background”
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.646478 10245 certificate_store.go:130] Loading cert/key pair from “/var/lib/kubelet/pki/kubelet-client-current.pem”.
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.654266 10245 dynamic_cafile_content.go:157] “Starting controller” name=“client-ca-bundle::/etc/kubernetes/pki/ca.crt”
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.654826 10245 server.go:648] “–cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to /”
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.655544 10245 container_manager_linux.go:262] “Container manager verified user specified cgroup-root exists” cgroupRoot=
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.655703 10245 container_manager_linux.go:267] “Creating Container Manager object based on Node Config” nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: KubeletOOMScoreAdj:-999 ContainerRuntime: CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:systemd KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map SystemReserved:map HardEvictionThresholds:[{Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:} {Signal:nodefs.available Operator:LessThan Value:{Quantity: Percentage:0.1} GracePeriod:0s MinReclaim:} {Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity: Percentage:0.05} GracePeriod:0s MinReclaim:} {Signal:imagefs.available Operator:LessThan Value:{Quantity: Percentage:0.15} GracePeriod:0s MinReclaim:}]} QOSReserved:map ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerPolicyOptions:map ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory: ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.655749 10245 topology_manager.go:133] “Creating topology manager with policy per scope” topologyPolicyName=“none” topologyScopeName=“container”
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.655776 10245 container_manager_linux.go:302] “Creating device plugin manager” devicePluginEnabled=true
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.655835 10245 state_mem.go:36] “Initialized new in-memory state store”
Jun 16 22:59:30 kube001 systemd[1]: run-rcf63cf64925c41e6a2088afccdcc07f8.scope: Succeeded.
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.662295 10245 kubelet.go:376] “Attempting to sync node with API server”
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.662374 10245 kubelet.go:267] “Adding static pod path” path="/etc/kubernetes/manifests"
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.662415 10245 kubelet.go:278] “Adding apiserver pod source”
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.662439 10245 apiserver.go:42] “Waiting for node sync before watching apiserver pods”
Jun 16 22:59:30 kube001 kubelet[10245]: W0616 22:59:30.664849 10245 reflector.go:324] vendor/k8s.io/client-go/informers/factory.go:134: failed to list *v1.Node: Get “https://cluster-endpoint:6443/api/v1/nodes?fieldSelector=metadata.name%3Dkube001&limit=500&resourceVersion=0”: dial tcp 192.168.0.201:6443: connect: connection refused
Jun 16 22:59:30 kube001 kubelet[10245]: E0616 22:59:30.665055 10245 reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get “https://cluster-endpoint:6443/api/v1/nodes?fieldSelector=metadata.name%3Dkube001&limit=500&resourceVersion=0”: dial tcp 192.168.0.201:6443: connect: connection refused
Jun 16 22:59:30 kube001 kubelet[10245]: W0616 22:59:30.665180 10245 reflector.go:324] vendor/k8s.io/client-go/informers/factory.go:134: failed to list *v1.Service: Get “https://cluster-endpoint:6443/api/v1/services?limit=500&resourceVersion=0”: dial tcp 192.168.0.201:6443: connect: connection refused
Jun 16 22:59:30 kube001 kubelet[10245]: E0616 22:59:30.665275 10245 reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Get “https://cluster-endpoint:6443/api/v1/services?limit=500&resourceVersion=0”: dial tcp 192.168.0.201:6443: connect: connection refused
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.668267 10245 kuberuntime_manager.go:239] “Container runtime initialized” containerRuntime=“containerd” version=“1.6.6” apiVersion=“v1”
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.669183 10245 server.go:1181] “Started kubelet”
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.672216 10245 fs_resource_analyzer.go:67] “Starting FS ResourceAnalyzer”
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.682781 10245 server.go:150] “Starting to listen” address=“0.0.0.0” port=10250
Jun 16 22:59:30 kube001 kubelet[10245]: I0616 22:59:30.685606 10245 server.go:410] “Adding debug handlers to kubelet server”
Jun 16 22:59:30 kube001 kubelet[10245]: E0616 22:59:30.692191 10245 cri_stats_provider.go:455] “Failed to get the info of the filesystem with mountpoint” err=“unable to find data in memory cache” mountpoint="/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs"
Jun 16 22:59:30 kube001 kubelet[10245]: E0616 22:59:30.692236 10245 kubelet.go:1298] “Image garbage collection failed once. Stats initialization may not have completed yet” err=“invalid capacity 0 on image filesystem”


Could you please help me to identify the problem to start the cluster without losing my deployments in the cluster.

Try comment out the entire line like the following

# disabled_plugins = ["cri"]

save the file and then execute the following command
$systemctl restart containerd
than try again

Try to comment out the following line in /etc/containerd/config.toml

#disabled_plugins = ["cri"] 

save the above config file than execute the following line

$systemctl restart containerd

Thanks for your kind reply. I figured out the problem using below two commands.

crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a
sudo crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs -f a729fdd387b0a

where a729fdd387b0a is etcd container. As per the logs , an abrupt master node shutdown created inconsistent etcd database. After deleting all the files under /var/lib/etcd/member/wal and /var/lib/etcd/member/snap, the cluster is started but without any data(ns and pods).

But API invocations like below still fails

https://10.96.0.1:443/api/v1/namespaces/XXXXX dur to which secret or config in yaml files doesn’t work. May be need help there.

Hi,
calico is crashing after starting cluster with fresh etcd by deleting etcd files.

Cluster info is below
[root@kube001 ~]kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 443/TCP 16h
nginx LoadBalancer 10.99.16.7 80:32648/TCP 15h
[root@kube001 ~]#

[root@kube001 ~]# crictl --runtime-endpoint unix:///var/run/cri-dockerd.sock logs -f e1ea631fd5201
2022-06-17 16:15:36.575 [INFO][9] startup/startup.go 425: Early log level set to info
2022-06-17 16:15:36.576 [INFO][9] startup/utils.go 127: Using NODENAME environment for node name kube001
2022-06-17 16:15:36.576 [INFO][9] startup/utils.go 139: Determined node name: kube001
2022-06-17 16:15:36.576 [INFO][9] startup/startup.go 94: Starting node kube001 with version v3.23.1
2022-06-17 16:15:36.577 [INFO][9] startup/startup.go 430: Checking datastore connection
2022-06-17 16:16:06.614 [INFO][9] startup/startup.go 445: Hit error connecting to datastore - retry error=Get “https://10.96.0.1:443/api/v1/nodes/foo”: dial tcp 10.96.0.1:443: i/o timeout
2022-06-17 16:16:08.630 [INFO][9] startup/startup.go 445: Hit error connecting to datastore - retry error=Get “https://10.96.0.1:443/api/v1/nodes/foo”: dial tcp 10.96.0.1:443: connect: no route to host
2022-06-17 16:16:09.642 [INFO][9] startup/startup.go 445: Hit error connecting to datastore - retry error=Get “https://10.96.0.1:443/api/v1/nodes/foo”: dial tcp 10.96.0.1:443: connect: no route to host

Why is the cluster IP not reachable?