We have a bare metal k8s cluster spread on about 12 nodes at the moment. This has been working fine for almost 2 years at this point, however in the last few days strange things started to happen.
Seemingly pods hang in CreateContainer state
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container network for pod : networkPlugin cni failed to set up pod network: netplugin failed with no error message: signal: killed
We noticed this first with our cron jobs in the cluster (we don’t currently dynamicly scale anything so really the only things that are created periodically are the container that run the cron tasks atm - we deploy our own apps on to it, but nothing has been deployed for over 2 weeks, and nothing was changed in terms of the base k8s install in a very long time). These started to pile up in this state with this error message. Unfortunately it doesn’t actually say what the problem is, and we couldn’t find anything meaningful. Tried looking for answers but everything I find always has a concrete error message so it is easy to resolve, this one doesn’t.
What seems to help it is if we kill all the hanging jobs, and/or if we restart the weave pods on nodes.
Within a day the issue would be back again, does anyone have any idea how to diagnose this further?
Actually, slight edit, clearing the stuck stuff and cycling weave on all nodes doesn’t work anymore, the really annoying bit about this is if I hard reboot nodes, the services start up on them no problem, however once they all up literally a minute after I am just not able to start anything on the nodes anymore. Tried just random hello world deployments, they just not start with the same issue, tried restarting services that started up just fine after rebooting the node and they do not start either… (nodes are not tainted in any way)
Kubernetes version: 1.21.1
Cloud being used: bare-metal
Host OS: ubuntu 20.04
CNI and version: weave 2.8.1