Bare Metal K8S Load Balancer Service Routing Delay

I’ve been struggling for a couple of days trying to figure out two problems with my MetalLB enabled LoadBalancer Service.

Here I will focus on the 1st problem. If I can solve that then I will move on to the second.

Setup
My cluster has two nodes.
Master: tonga.corp.sensis.com
Worker: spain.corp.sensis.com

tonga is configured to allow pods to be scheduled.

At this point I am pretty sure MetalLB is doing its part, and the problem I am seeing is either in my configuration or in K8S (perhaps kube-proxy).

I’ve got a simple RESTful service that listens on 8080. The “kubia-unhealthy” service code was actually drawn from the Kubernetes In Action book.

I’ve deployed 2 replicas within a ReplicaSet.

I have no problem browsing to these replicas using a NodePort.

curl tonga.corp.sensis.com:30123 

and

curl spain.corp.sensis.com:30123

both work.

I have created a LoadBalancer service, and MetalLB is properly assigning an IP address. In my case, I’ve just used the MetalLB layer 2 configuration. It assigns my service an ExternalIP 192.168.1.241.

I browse to the service using its port and ExternalIP

curl 192.168.1.141:8080

This works when I curl from either tonga, or spain.

However the response to the request takes just over a minute if the request is routed to the pod on the other node. I know this because the RESTful service responds with its hostname, which is just the pod name.

In the following the first curl request is routed to the pod (kubia-unhealthy-r5kb1) on the local node (tonga), and it returns successfully immediately.

The next curl request is routed to the pod (kubia-unhealthy-thtds) on the other node (spain), and it takes 1 minute to return the service RESTful service response.

By monitoring (tcpdump) the veth to the docker container I’ve verified that the request does not get to the pod on spain until the minute elapses.

In this context a healthy service response says - “I’m not well . Please restart me”, as seen in the screenshot.

Can someone please help me to isolate why it consistently takes a minute to route the request to the pod on the other node?

Cluster information:

Kubernetes version:
1.17
Cloud being used: (put bare-metal if not on a public cloud)
bare-metal
Installation method:
Host OS:
Centos 7
CNI and version:
Flannel 0.11
CRI and version:

You can format your yaml by highlighting it and pressing Ctrl-Shift-C, it will make your output easier to read.

I just ran another experiment. This time I monitored the traffic to the POD IP.

As soon as I perform the curl I see several packets that flow to port 8080 (“webcache”). Its not until 1 minute later that I see the SYN packet that triggers the request sent to the pod itself… then the pod responds.

Its interesting to see that the SYN packets that lead up to the successful request have increasing time delays (in seconds): 1, 2, 4, 8, 16, 32… Then we see the SYN ACK

I’ve now verified that I see the same delay using the ClusterIP. If I curl any service that I create using its ClusterIP. The request will be delayed by 60 seconds when it is delivered to pods on nodes other than the node that I issued curl on.

The fact that, during the one minute period, the failed SYN packets are sent out at 1, 2, 4, 8, 16, and 32 second intervals points to a piece of software that is designed to double the timeout with each retry.

Anyone have guidance on where to look? Is this a kube-proxy issue?

It might be helpful to narrow down the cause of the problem… is it CNI/flannel, or the load balancer?

Can you use an image like https://github.com/nicolaka/netshoot to ping and or create tcp connections directly from a pod on one node, to the pod on the other node (bypassing LB) to see if it exhibits the same delay?

if you run netshoot on both nodes, one as a server and one as a client, can you recreate the issue with direct pod-to-pod communication? If yes, I suppose its a flannel issue

you mentioned tcpdump, is it possible to see the outgoing SYN from the load balancer? You could perhaps determine if the delay is caused by the LB not attempting to open the connection to the remote node for 1, 2, 4 seconds…

Finally, I noticed the nodes have been running for 50 and 91 days and they have slightly different versions on them. That shouldn’t make a difference, but it might interesting to reboot one of the nodes to see if that changes the situation.

I’m sorry I cannot provide any really useful suggestions.

@bkcsfi I appreciate your response very much. I will try netshoot. Please keep in mind, I’ve completely ruled out the load balancer. I see the same problem when I just use the ClusterIP.

So in your mind its not kube-proxy? Do the packets that get forwarded to the node actually go through flannel then? Is it one of the flannel pods running in kube-system that you think might be the source of the issue?

I’ll post back on monday when I run your suggested experiments.

I had not thought that this might be a problem with flannel. But it sounds like that may be the culprit.