Kubectl works and all the pods are up but no traffic goes through

We have a small in house cluster consists of 5 nodes which we run our platform on it. Our platform consists of different components which they communicate via http or amqp both among themselves and also to outside of the cluster.

From yesterday no traffic goes to the components and they became unreachable while they are up, there is no error, neither in our components nor in k8s components (dns, proxy,etc.) BUT I can access the cluster and components via kubectl and all of the kubectl commands work properly. What I mean is I can run kubectl exec , kubectl logs , helm install , etc but if I want to go to webpage I receive This site can’t be reached but there is no logs in neither nginx pod nor any of the k8s components which means they haven’t received the request and no traffic goes through.

How can I troubleshoot this situation? Can anyone help please?

  • Kubernetes version (use kubectl version ):
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-08T17:11:31Z", GoVersion:"go1.12.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    In house private cluster contains of 5 nodes and set up with kubeadm.
  • OS (e.g: cat /etc/os-release ):
    All machines are running Ubuntu 18.04.3
  • Kernel (e.g. uname -a ):
Linux k8s-master 4.15.0-62-generic #69-Ubuntu SMP Wed Sep 4 20:55:53 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
    kubeadm
  • Network plugin and version (if this is a network-related bug):
    Weave net

This sounds like a CNI networking issue – anything in the weave net container logs?

1 Like

Yes I agree. I had this problem since yesterday but just 3 hours ago I saw this error in weave logs:
ERRO: 2019/09/27 13:00:03.360130 Captured frame from MAC (d2:14:2a:47:62:d9) to (02:01:5b:b9:8e:fd) associated with another peer 4a:8d:75:d7:59:ff(serflex-argus-2)
I am trying to troubleshoot it at the moment. And this is the status of weave on my master node:

/home/weave$ ./weave --local status
        Version: 2.5.2 (up to date; next check at 2019/09/27 15:12:49)
        Service: router
       Protocol: weave 1..2
           Name: 02:01:5b:b9:8e:fd(k8s-master)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 1
    Connections: 5 (4 established, 1 failed)
          Peers: 5 (with 20 established connections)
 TrustedSubnets: none
        Service: ipam
         Status: ready

I don’t know much about weave but would do following to debug.

  1. Assume that any kind of traffic does not go thro’ kubernetes cluster. Can you attach to a random pod and ping to another pod to confirm this ?

  2. On weave log, why does one connection fail as below ?. Can you compare this with working case by restarting weave ?
    Connections: 5 (4 established, 1 failed)

  3. The ERROR on weave is does to sound. The k8-master has mac address 02:01:5b:b9:8e:fd(k8s-master) and receives an ethernet packet with 02:01:5b:b9:8e:fd. Why does he complain that it is associated with another peer ?

from MAC (d2:14:2a:47:62:d9) to (02:01:5b:b9:8e:fd) associated with another peer 4a:8d:75:d7:59:ff(serflex-argus-2)

  1. Sometimes I can curl other services within a pod and sometimes it fails.
  2. That one connection which fails is OK (I think) because that is a connection to self and it is like that for all the nodes.
  3. I have no idea but after digging in weave git issues I think the main problem is weave is hitting memory limit. I saw this error on their issues mostly when weave hits the 200M memory limit and now I checked and I saw that 4 out of 5 of my weave nodes also were hitting mem limits and I think this is the main problem.

The thing that worry me is we need a reliable cluster and such very ambiguous errors does not help at all. I will restart weave to see what will happen but one question, do you think other CNIs like Calico are more stable?

[AV]: Sometimes I can curl other services within a pod and sometimes it fails.
[BB]: Do you know as to why it fails sometimes ?. I would expect it to be passing all the time. Is it due #2 ?

[AV]: I have no idea but after digging in weave git issues I think the main problem is weave is hitting memory limit. I saw this error on their issues mostly when weave hits the 200M memory limit and now I checked and I saw that 4 out of 5 of my weave nodes also were hitting mem limits and I think this is the main problem.
[BB]:

  1. I really don’t know much about them but if taking all the traffic to their application(userspace) for switching/routing, you would deal with memory or CPU issues all the time while calico handles in the kernel itself.
  2. Before getting down that path, would really check if memory is the issue and then decide on any drastic step.

Whats the scale of your deployment ?