We have a small in house cluster consists of 5 nodes which we run our platform on it. Our platform consists of different components which they communicate via http or amqp both among themselves and also to outside of the cluster.
From yesterday no traffic goes to the components and they became unreachable while they are up, there is no error, neither in our components nor in k8s components (dns, proxy,etc.) BUT I can access the cluster and components via kubectl and all of the kubectl commands work properly. What I mean is I can run kubectl exec , kubectl logs , helm install , etc but if I want to go to webpage I receive This site can’t be reached but there is no logs in neither nginx pod nor any of the k8s components which means they haven’t received the request and no traffic goes through.
How can I troubleshoot this situation? Can anyone help please?
Yes I agree. I had this problem since yesterday but just 3 hours ago I saw this error in weave logs: ERRO: 2019/09/27 13:00:03.360130 Captured frame from MAC (d2:14:2a:47:62:d9) to (02:01:5b:b9:8e:fd) associated with another peer 4a:8d:75:d7:59:ff(serflex-argus-2)
I am trying to troubleshoot it at the moment. And this is the status of weave on my master node:
/home/weave$ ./weave --local status
Version: 2.5.2 (up to date; next check at 2019/09/27 15:12:49)
Service: router
Protocol: weave 1..2
Name: 02:01:5b:b9:8e:fd(k8s-master)
Encryption: disabled
PeerDiscovery: enabled
Targets: 1
Connections: 5 (4 established, 1 failed)
Peers: 5 (with 20 established connections)
TrustedSubnets: none
Service: ipam
Status: ready
I don’t know much about weave but would do following to debug.
Assume that any kind of traffic does not go thro’ kubernetes cluster. Can you attach to a random pod and ping to another pod to confirm this ?
On weave log, why does one connection fail as below ?. Can you compare this with working case by restarting weave ?
Connections: 5 (4 established, 1 failed)
The ERROR on weave is does to sound. The k8-master has mac address 02:01:5b:b9:8e:fd(k8s-master) and receives an ethernet packet with 02:01:5b:b9:8e:fd. Why does he complain that it is associated with another peer ?
from MAC (d2:14:2a:47:62:d9) to (02:01:5b:b9:8e:fd) associated with another peer 4a:8d:75:d7:59:ff(serflex-argus-2)
Sometimes I can curl other services within a pod and sometimes it fails.
That one connection which fails is OK (I think) because that is a connection to self and it is like that for all the nodes.
I have no idea but after digging in weave git issues I think the main problem is weave is hitting memory limit. I saw this error on their issues mostly when weave hits the 200M memory limit and now I checked and I saw that 4 out of 5 of my weave nodes also were hitting mem limits and I think this is the main problem.
The thing that worry me is we need a reliable cluster and such very ambiguous errors does not help at all. I will restart weave to see what will happen but one question, do you think other CNIs like Calico are more stable?
[AV]: Sometimes I can curl other services within a pod and sometimes it fails.
[BB]: Do you know as to why it fails sometimes ?. I would expect it to be passing all the time. Is it due #2 ?
[AV]: I have no idea but after digging in weave git issues I think the main problem is weave is hitting memory limit. I saw this error on their issues mostly when weave hits the 200M memory limit and now I checked and I saw that 4 out of 5 of my weave nodes also were hitting mem limits and I think this is the main problem.
[BB]:
I really don’t know much about them but if taking all the traffic to their application(userspace) for switching/routing, you would deal with memory or CPU issues all the time while calico handles in the kernel itself.
Before getting down that path, would really check if memory is the issue and then decide on any drastic step.