@fox-md @thockin
The Issue:
Any pod can reach any other Pod (regardless of the node) via the Pods’ IP.
Service IPs aren’t reachable from within the Pods (the failure message is always a variation of “network unreachable”), but are reachable from the hosts.
Another thing to note is that CoreDNS isn’t working properly, but that’s just a symptom of the aforementioned issue, because CoreDNS tries to reach the K8s API via the kubernetes.default service IP, which fails (as a special case of what I described in the previous paragraph):
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://10.96.0.1:443/version": dial tcp 10.96.0.1:443: connect: network is unreachable
[INFO] plugin/ready: Still waiting on: "kubernetes"
I tried all the steps in https://kubernetes.io/docs/tasks/debug/debug-application/debug-service/ and they succeeded (both in iptables
and ipvs
mode) except for the hairpin section. I noticed that hairpin wasn’t enabled, but then I added to my provisioning automation steps to enable hairpinning and it was still failing (i.e. pod can’t reach its own service IP even with hairpinning).
What I discovered:
In the network namespaces of the Pods, there’s neither an interface belonging to the services subnet, nor a default gateway. So traffic to IPs in the service subnet has nowhere to go.
If I enter the network namespace of a Pod and manually make the interface (in that namespace) associated to the pods’ subnet the default route/gateway, then services become reachable from within that Pod.
I’m trying to understand why no default gateway/route is setup in pods’ network namespaces: https://kubernetes.slack.com/archives/C09QYUH5W/p1701189954362389 . Apparently flannel is the culprit.
My cluster setup:
I have a baremetal K8s 1.28.2 cluster on ubuntu 22.04 using CRI-O v1.28.1~0 backed by cri-o-runc v1.0.1~2 (the versions are those of the apt packages I’m downloading).
I’m deploying flannel v0.23.0 via the manifests at https://github.com/flannel-io/flannel/releases/download/v0.23.0/kube-flannel.yml, but I’m also downloading the flannel CNI plugin v1.2.0 and writing the file /etc/cni/net.d/10-crio.conf
to
{
"name": "crio",
"type": "flannel"
}
as per instructions here.
I tried both iptables
and ipvs
, the issue appears with both.
My pod CIDR is 10.244.0.0/16
, my service CIDR is 10.96.0.0/24
(originally it was 10.96.0.0/12
, the issue appears in both cases).