CNI not starting after switching from docker to containerd

Cluster information:

Kubernetes version: 1.16.13
Cloud being used: bare metal
Installation method: kubeadm
Host OS: Ubuntu 18.04
CNI and version: canal: calico v3.8.4, flannel v0.11.0
CRI and version: 1.2.13-2

I am trying to remove docker from a cluster, so that it runs with pure containerd via CRI.

It looked to be straightforward at first. What I did was to reconfigure containerd to enable its CRI interface:

mv /etc/containerd/config.toml{,.old}
containerd config default > /etc/containerd/config.toml
vi /etc/containerd/config.toml   # set systemd_cgroup = true
systemctl restart containerd

Rejoin the node: kubeadm reset followed by kubeadm join ... --cri-socket /run/containerd/containerd.sock. Aside: this is because if both containerd and docker are detected, docker takes precedence

Disable docker: systemctl stop docker; systemctl disable docker

It starts its system pods happily:

root@dar7:~# crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT
bfddb2d712e7b       13 days ago         Ready               kube-proxy-7ftdm    kube-system         0
71ec1a994af1d       13 days ago         Ready               canal-r76bs         kube-system         0
root@dar7:~# crictl ps
CONTAINER ID        IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID
b3f2b9362b693       83b416d242055       13 days ago         Running             calico-node         6                   71ec1a994af1d
3172760b7ea99       9b65a0f78b091       13 days ago         Running             kube-proxy          3                   bfddb2d712e7b
12aba76a0797c       8a9c4ced3ff92       13 days ago         Running             kube-flannel        0                   71ec1a994af1d

I have not noticed anything out-of-the-ordinary in those pod logs.

However, what I find is that the node stays in a NotReady state, saying that cni plugin is not initialized:

  Ready            False   Fri, 31 Jul 2020 16:39:34 +0100   Fri, 31 Jul 2020 13:55:32 +0100   KubeletNotReady              runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

I am stuck now trying to work out how to fix this.

If I compare this node with an old node that still has docker:

# Node which has been switched from docker to containerd
root@dar7:~# grep KUBELET_KUBEADM_ARGS /var/lib/kubelet/kubeadm-flags.env
KUBELET_KUBEADM_ARGS="--container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock --resolv-conf=/run/systemd/resolve/resolv.conf"

# Node where docker still being used
root@dar25:~# grep KUBELET_KUBEADM_ARGS /var/lib/kubelet/kubeadm-flags.env
KUBELET_KUBEADM_ARGS="--cgroup-driver=systemd --network-plugin=cni --resolv-conf=/run/systemd/resolve/resolv.conf"

I can see that the new node doesn’t have --network-plugin=cni. I found this in the documentation:

Depending on the CRI runtime your cluster uses, you may need to specify different flags to the kubelet. For instance, when using Docker, you need to specify flags such as --network-plugin=cni , but if you are using an external runtime, you need to specify --container-runtime=remote and specify the CRI endpoint using the --container-runtime-endpoint=<path> .

The way I read this is that --network-plugin=cni is not required when not using Docker. Or is it saying that CNI only works when Docker is present?? That would be very surprising.

This cluster was set up by someone else around Nov 2019, so I don’t know exactly how networking was set up. I can see there’s a canal daemonset:

$ kubectl get daemonset  --all-namespaces
kube-system   canal        28        28        27      28           27   253d
kube-system   kube-proxy   28        28        27      28           27   253d

The daemonset YAML references images calico/node:v3.8.4,, calico/cni:v3.8.4 and calico/pod2daemon-flexvol:v3.8.4

I have tried googling various combinations of “calico without docker” or “calico with containerd” but not had any useful results.

Any clues please?



Update: on the off-chance, I rebooted the nodes. Now they come up in state “Ready”, so I can deploy pods to them.

Of the four nodes I’ve reconfigured, three have working pod networking (i.e. I can ping the pod’s IP, and I can exec into the pod and access other pods and the external network). One doesn’t, so I’ll need to dig into that one a bit more, but otherwise it seems whatever was wrong was fixed by a simple reboot.

EDIT: see below, they were only working because docker had come up.

Oops - docker restarted on the reboot. I had disabled docker.service but not docker.socket.

After disabling that and rebooting, the nodes come up as Ready but networking is not working to any of the nodes.

Any clues where to look next?