Asking for help? Comment out what you need so we can get more information to help you!
Root cause for containers stuck in worker-node when it was restarted or due to power cycle.
When one of the worker-node goes down due to power cycle while master is scheduling the pods between worker nodes, Once the worker-node comes up and running master is able to schedule the remaining pods to worker-node which came up However all the pods which are scheduled to the worker-node are stuck in the container creating state for a long time and the worker-node which came up after the power cycle is in no use.
The networking pod -(weave) skipped the network check of all these containers on the restarted node. With no ip assigned to these pods, they stay in ContainerCreating state.
Jul 26 15:45:06 k8sworker3 kubelet: W0726 15:45:06.622277 1832 docker_sandbox.go:384] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod “job-5d242b93c6ba2500011bfe3b-1564172924508-h9vw5_”: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container “bf4a489a1d46705163fdc228486398d8d33d2c6e41dc354f32de5f5d6986abcc”
Recovery: Delete all pods to release the max- resources(110 pods) on worker3. Once the resources are replenished, newly created pods should receive IP address and proceed to completion
Kubernetes version: 1.15
Cloud being used: (put bare-metal if not on a public cloud)
Installation method:Ansible Script
Host OS: Centos 7
CNI and version:
CRI and version: