Will glad if you can help me here with this.
Kubernetes version: 1.24.9
Cloud being used: bare-matel
Installation method: kubeadm
Host OS: Ubuntu-22.04
CNI and version: Calico-typha 3.25.1
CRI and version: containerd
I recently added more nodes to the cluster, which has more then 50 worker nodes, and 3 VMs which are control-plane nodes.
This cluster is being upgraded from k8s version 1.17, up to now with 1.24.
Calico-typha is using default encapsulation setup, with IPIP for all traffic.
After adding the new set of servers to the cluster, DNS stopped working, but only for PODs scheduled on the new nodes.
Tests show can all tcp/udp traffic from new nodes to PODs on the existing nodes, fails.
ping/ICMP works with no issue.
I do see that the packets arrives to the destination PODs, but blocked by IPtables, under calico chain, as “ctstate INVALID”
What could be the issue?
I thought migrating from IPIP calico setup to vxlan encapsulation, to mitigate any other BGP related issues that might come from other surrounding network setup which uses BGP.
Thanks in advance!
So after further investigation, together with our network engineers, we found that checksum-offloading was causing the issue.
Disabling it on the node interfaced solved the issue.
Digging little more, we found that the issue comes from specific network card type or driver:
Broadcom Inc. and subsidiaries BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller
Although all of the new nodes resides on the same Leaf Arista switch, when we replaced the NIC above with an Intel one, the problem solved. So having offloading enabled, does not create issues on Intel NICs.
We are still checking, but this is a huge progress understanding what could be the issue.
We also trying to understand what are the pros and cons of having the offloading disabled VS replacing the network cards.
I will update once we have more information.
If anyone have anything to add, please share.
Hi @dpeer ,
Thanks for the pointers here . I landed up in same issue post upgrading to calico version 3.24 and k8s version 1.24.
NIC card on my setup was : Mellanox Connect X4
Checksum-offloading solved the problem for me as well , but was wondering, the same NIC was working perfectly fine with old calico versions . Is it something to do with calico handling in newer versions or something with NIC only. ?
P.S. Intel NIC works fine for me as well.