Mr_TK
December 20, 2022, 5:13pm
#1
Hello Community,
Being a newbie here in this forum but not with the kubernetes stuffs, kindly apologize for any noise.
I was going through the konnectivity server & agent docs to understand the communication between the control plane & nodes. From the design perspective, what I can see is, we need to have the konnectivity agent running as a daemon set in all the nodes and based on the below github issue, I think if the konnectivity agent on the node fails, we might lose kubectl exec/logs/cp & other commands even though our kubelet guy is healthy. The failure I’m more concerned here is the one that occurs due to volume mounts/crio/imagepull and all other failures excluding network related(node to node communication majorly).
opened 11:58AM - 15 Oct 20 UTC
closed 08:00PM - 04 Feb 21 UTC
lifecycle/stale
Hello,
I have Kubernetes 1.18 cluster where Konnectivity v0.0.12 is running.
M… ost of the time, everything works perfectly, but after ~20 days of konnectivity-server uptime, some DIAL_REQ are getting nowhere.
For example, `kubectl logs -n kube-system konnectivity-agent-h6bt9 -f` only worked after the third execution.
Server log:
```
I1015 11:35:55.013608 1 server.go:224] proxy request from client, userAgent [grpc-go/1.26.0]
I1015 11:35:55.013887 1 server.go:257] start serving frontend stream
I1015 11:35:55.013926 1 server.go:268] >>> Received DIAL_REQ
I1015 11:35:55.013940 1 backend_manager.go:170] pick agentID=f42e628f-f3a5-42cc-bb75-8065768b1be1 as backend
I1015 11:35:55.014079 1 server.go:290] >>> DIAL_REQ sent to backend
I1015 11:36:10.378211 1 server.go:224] proxy request from client, userAgent [grpc-go/1.26.0]
I1015 11:36:10.378345 1 server.go:257] start serving frontend stream
I1015 11:36:10.378357 1 server.go:268] >>> Received DIAL_REQ
I1015 11:36:10.378363 1 backend_manager.go:170] pick agentID=f42e628f-f3a5-42cc-bb75-8065768b1be1 as backend
I1015 11:36:10.378418 1 server.go:290] >>> DIAL_REQ sent to backend
I1015 11:36:18.243790 1 server.go:224] proxy request from client, userAgent [grpc-go/1.26.0]
I1015 11:36:18.243873 1 server.go:257] start serving frontend stream
I1015 11:36:18.244049 1 server.go:268] >>> Received DIAL_REQ
I1015 11:36:18.244062 1 backend_manager.go:170] pick agentID=15fb1f4a-55a5-454d-9257-a717e70dc5dd as backend
I1015 11:36:18.244138 1 server.go:290] >>> DIAL_REQ sent to backend
I1015 11:36:18.247179 1 server.go:522] <<< Received DIAL_RSP(rand=2352227733928774978), agentID 15fb1f4a-55a5-454d-9257-a717e70dc5dd, connID 84)
I1015 11:36:18.247268 1 server.go:144] register frontend &{grpc 0xc020108df0 <nil> 0xc02cc82960 84 15fb1f4a-55a5-454d-9257-a717e70dc5dd {13824409201659358837 1806906010189152 0x227f060} 0xc02cd49da0} for agentID 15fb1f4a-55a5-454d-9257-a717e70dc5dd, connID 84
I1015 11:36:18.247867 1 server.go:308] >>> Received 239 bytes of DATA(id=84)
I1015 11:36:18.248012 1 server.go:324] >>> DATA sent to Backend
I1015 11:36:18.256637 1 server.go:551] <<< Received 2171 bytes of DATA from agentID 15fb1f4a-55a5-454d-9257-a717e70dc5dd, connID 84
...
```
Agent log:
```
Nothing before...
I1015 11:36:18.244396 1 client.go:271] [tracing] recv packet, type: DIAL_REQ
I1015 11:36:18.244482 1 client.go:280] received DIAL_REQ
I1015 11:36:18.247643 1 client.go:271] [tracing] recv packet, type: DATA
I1015 11:36:18.247680 1 client.go:339] received DATA(id=84)
I1015 11:36:18.247775 1 client.go:413] [connID: 84] write last 239 data to remote
I1015 11:36:18.255019 1 client.go:384] received 2171 bytes from remote for connID[84]
I1015 11:36:18.257829 1 client.go:271] [tracing] recv packet, type: DATA
I1015 11:36:18.257850 1 client.go:339] received DATA(id=84)
I1015 11:36:18.257908 1 client.go:413] [connID: 84] write last 64 data to remote
I1015 11:36:18.258552 1 client.go:271] [tracing] recv packet, type: DATA
I1015 11:36:18.258661 1 client.go:339] received DATA(id=84)
I1015 11:36:18.258777 1 client.go:413] [connID: 84] write last 203 data to remote
I1015 11:36:18.260344 1 client.go:384] received 106 bytes from remote for connID[84]
I1015 11:36:18.264715 1 client.go:384] received 183 bytes from remote for connID[84]
```
Sometimes, when I want to view logs from the second agent, I'm getting these warnings:
Server log:
```
I1015 11:51:13.078407 1 server.go:224] proxy request from client, userAgent [grpc-go/1.26.0]
I1015 11:51:13.078709 1 server.go:257] start serving frontend stream
I1015 11:51:13.078763 1 server.go:268] >>> Received DIAL_REQ
I1015 11:51:13.078790 1 backend_manager.go:170] pick agentID=d700cd66-6ae4-416f-89be-4523adda93c3 as backend
W1015 11:51:13.078907 1 server.go:288] >>> DIAL_REQ to Backend failed: rpc error: code = Unavailable desc = transport is closing
I1015 11:51:13.078966 1 server.go:290] >>> DIAL_REQ sent to backend
I1015 11:51:15.008189 1 server.go:224] proxy request from client, userAgent [grpc-go/1.26.0]
I1015 11:51:15.008413 1 server.go:257] start serving frontend stream
I1015 11:51:15.008458 1 server.go:268] >>> Received DIAL_REQ
I1015 11:51:15.008485 1 backend_manager.go:170] pick agentID=d700cd66-6ae4-416f-89be-4523adda93c3 as backend
W1015 11:51:15.008574 1 server.go:288] >>> DIAL_REQ to Backend failed: rpc error: code = Unavailable desc = transport is closing
I1015 11:51:15.008627 1 server.go:290] >>> DIAL_REQ sent to backend
```
Agent logs nothing.
I believe, that konnectivity-server (+ kube-apiserver) pod restart would help (because it helped in the past), but I don't want to do it, because maybe you will need some more tests.
In such cases, control plane components lose the connection with the node and why doesn’t the connection isn’t being established from the other nodes where the agent is healthy and those agents are still able to route the traffic to the concerned node?