Recently I’m facing a very strange network issue with our k8s cluster. Some pods suddenly lose all connection to all pods located on other nodes while still being able to connect to the internet (for example). Whenever that happens, if the pod is deleted and another one replaces it, sometimes that fixes the problem with that pod and the new one can work correctly. The pods that suffer from that issue seem to be randomly picked. When this issue started we used Kubernetes version 16.9 and now we have migrated to version 16.10 but due to the somewhat illusive nature of this problem we’re having a bit of a hard time replicate the problem on our test cluster or catching it “red handed” on other environments.
Is this a known issue? Is it something that is definitely resolved in version 16.10?
Here’s our information:
Kubernetes version: 16.9/16.10
Cloud being used: AWS
Installation method: Kops
Host OS: debian buster
CNI and version: v1.6.0
CRI and version: docker://18.9.9, docker://18.6.3
KERNEL-VERSION: 4.19.0-8-cloud-amd64, 4.19.86-coreos