Coredns fails connecting to kube-api via kubernetes service

Cluster information:

Kubernetes version: v1.28.1
Cloud being used: bare-metal
Installation method: kubeadm
Host OS: Ubuntu 22.04.3 LTS
CNI and version: cilium 1.14.1
CRI and version: containerd 1.6.22

Today, after upgrading to 1.28.1 I realized that my test cluster is unable to get coredns ready:

$ k get po -A | grep core
kube-system   coredns-5dd5756b68-hchqq            0/1     Running   0             57m
kube-system   coredns-5dd5756b68-r768b            0/1     Running   0             57m

Upon inspecting the logs there seem to be some connectivity issue between coredns and kube-api:

$ k -n kube-system logs coredns-5dd5756b68-hchqq | tail -5 | tail -2
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://10.96.0.1:443/version": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"

cilium connectivity test seem to run into the same issue:

$ cilium connectivity test
ℹ️  Monitor aggregation detected, will skip some flow validation steps
⌛ [kubernetes] Waiting for deployment cilium-test/client to become ready...
⌛ [kubernetes] Waiting for deployment cilium-test/client2 to become ready...
⌛ [kubernetes] Waiting for deployment cilium-test/echo-same-node to become ready...
⌛ [kubernetes] Waiting for deployment cilium-test/echo-other-node to become ready...
⌛ [kubernetes] Waiting for CiliumEndpoint for pod cilium-test/client-78f9dffc84-g5z5l to appear...
⌛ [kubernetes] Waiting for CiliumEndpoint for pod cilium-test/client2-59b578d4bb-jttvw to appear...
⌛ [kubernetes] Waiting for pod cilium-test/client-78f9dffc84-g5z5l to reach DNS server on cilium-test/echo-same-node-54cc4f75b8-xt4cf pod...
⌛ [kubernetes] Waiting for pod cilium-test/client2-59b578d4bb-jttvw to reach DNS server on cilium-test/echo-same-node-54cc4f75b8-xt4cf pod...
⌛ [kubernetes] Waiting for pod cilium-test/client-78f9dffc84-g5z5l to reach DNS server on cilium-test/echo-other-node-5b87f6f4f4-cdmtl pod...
⌛ [kubernetes] Waiting for pod cilium-test/client2-59b578d4bb-jttvw to reach DNS server on cilium-test/echo-other-node-5b87f6f4f4-cdmtl pod...
⌛ [kubernetes] Waiting for pod cilium-test/client-78f9dffc84-g5z5l to reach default/kubernetes service...
connectivity test failed: timeout reached waiting for lookup for kubernetes.default from pod cilium-test/client-78f9dffc84-g5z5l to succeed (last error: context deadline exceeded)

Accessing the kube-api from outside the cluster works fine - as is demonstrated by kubectl working. :wink:

Cilium status seem ok.

$ cilium status
    /¯¯\
 /¯¯\__/¯¯\    Cilium:             OK
 \__/¯¯\__/    Operator:           OK
 /¯¯\__/¯¯\    Envoy DaemonSet:    disabled (using embedded mode)
 \__/¯¯\__/    Hubble Relay:       OK
    \__/       ClusterMesh:        disabled

Deployment             hubble-relay       Desired: 1, Ready: 1/1, Available: 1/1
Deployment             hubble-ui          Desired: 1, Ready: 1/1, Available: 1/1
Deployment             cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2
DaemonSet              cilium             Desired: 4, Ready: 4/4, Available: 4/4
Containers:            cilium             Running: 4
                       hubble-relay       Running: 1
                       hubble-ui          Running: 1
                       cilium-operator    Running: 2
Cluster Pods:          8/8 managed by Cilium
Helm chart version:    1.14.1
Image versions         cilium             quay.io/cilium/cilium:v1.14.1@sha256:edc1d05ea1365c4a8f6ac6982247d5c145181704894bb698619c3827b6963a72: 4
                       hubble-relay       quay.io/cilium/hubble-relay:v1.13.2: 1
                       hubble-ui          quay.io/cilium/hubble-ui:v0.11.0@sha256:bcb369c47cada2d4257d63d3749f7f87c91dde32e010b223597306de95d1ecc8: 1
                       hubble-ui          quay.io/cilium/hubble-ui-backend:v0.11.0@sha256:14c04d11f78da5c363f88592abae8d2ecee3cbe009f443ef11df6ac5f692d839: 1
                       cilium-operator    quay.io/cilium/operator-generic:v1.14.1@sha256:e061de0a930534c7e3f8feda8330976367971238ccafff42659f104effd4b5f7: 2

There are no network policies I can find to blame.

$ k get ciliumnetworkpolicies.cilium.io -A
No resources found
$ k get networkpolicies.networking.k8s.io -A
No resources found

There are endpoints which I believe should be implicitly targeted by the service:

$ k get endpointslices.discovery.k8s.io 
NAME         ADDRESSTYPE   PORTS   ENDPOINTS                                      AGE
kubernetes   IPv4          6443    192.168.100.10,192.168.100.11,192.168.100.12   140d
$ k get svc -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Service
  metadata:
    creationTimestamp: "2023-09-01T08:00:11Z"
    labels:
      component: apiserver
      provider: kubernetes
    name: kubernetes
    namespace: default
    resourceVersion: "2726902"
    uid: 5e7c32c9-ab89-47e4-8940-db010c2ffc4d
  spec:
    clusterIP: 10.96.0.1
    clusterIPs:
    - 10.96.0.1
    internalTrafficPolicy: Cluster
    ipFamilies:
    - IPv4
    ipFamilyPolicy: SingleStack
    ports:
    - name: https
      port: 443
      protocol: TCP
      targetPort: 6443
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
kind: List
metadata:
  resourceVersion: ""

I don’t believe I have any funny business in the coredns config:

$ k get all -A -l k8s-app=kube-dns
NAMESPACE     NAME                           READY   STATUS    RESTARTS        AGE
kube-system   pod/coredns-5dd5756b68-hchqq   0/1     Running   1 (6m39s ago)   4h46m
kube-system   pod/coredns-5dd5756b68-r768b   0/1     Running   1 (6m38s ago)   4h46m

NAMESPACE     NAME               TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-system   service/kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   4h46m

NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   0/2     2            0           4h46m

NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-5dd5756b68   2         2         0       4h46m

$ k -n kube-system describe cm coredns 
Name:         coredns
Namespace:    kube-system
Labels:       <none>
Annotations:  <none>

Data
====
Corefile:
----
.:53 {
    errors
    health {
       lameduck 5s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       fallthrough in-addr.arpa ip6.arpa
       ttl 30
    }
    prometheus :9153
    forward . /etc/resolv.conf {
       max_concurrent 1000
    }
    cache 30
    loop
    reload
    loadbalance
}


BinaryData
====

Events:  <none>

There is a service running in the container - but it does not seem to hold any data, probably due to not being able to connect to the api:

$ kubectl -n kube-system debug -it pod/coredns-5dd5756b68-hchqq --image=nicolaka/netshoot --target=coredns
coredns-5dd5756b68-hchqq  ~  ss -lnp | grep :53
udp   UNCONN 0      0                  *:53               *:*    users:(("coredns",pid=1,fd=12))
tcp   LISTEN 0      4096               *:53               *:*    users:(("coredns",pid=1,fd=11))
coredns-5dd5756b68-hchqq  ~  dig @localhost kubernetes.default

; <<>> DiG 9.18.13 <<>> @localhost kubernetes.default
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 29162
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 7fdf8d625c0b48eb (echoed)
;; QUESTION SECTION:
;kubernetes.default.            IN      A

;; Query time: 0 msec
;; SERVER: ::1#53(localhost) (UDP)
;; WHEN: Fri Sep 01 13:23:08 UTC 2023
;; MSG SIZE  rcvd: 59

I can access the api from the pod on the external ip, but not the service ip:

 coredns-5dd5756b68-hchqq  ~  ping kubernetes              
ping: kubernetes: Try again

 coredns-5dd5756b68-hchqq  ~  ping k8s       
PING k8s.kubenet (192.168.100.5) 56(84) bytes of data.
64 bytes from k8s.kubenet (192.168.100.5): icmp_seq=1 ttl=62 time=0.139 ms
64 bytes from k8s.kubenet (192.168.100.5): icmp_seq=2 ttl=62 time=0.147 ms
^C
--- k8s.kubenet ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1023ms
rtt min/avg/max/mdev = 0.139/0.143/0.147/0.004 ms

 coredns-5dd5756b68-hchqq  ~  curl -k https://k8s:6443
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {},
  "code": 403
}#                                                                                                                                                                                             

 coredns-5dd5756b68-hchqq  ~  curl -k https://10.96.0.1:443 
curl: (28) Failed to connect to 10.96.0.1 port 443 after 130812 ms: Couldn't connect to server

What am I missing?

I have the same problem after upgrading to 1.28.

cilium image (running): 1.14.0

> kubectl get nodes                           
NAME    STATUS   ROLES           AGE    VERSION
k8s-a   Ready    control-plane   205d   v1.28.0
k8s-b   Ready    <none>          190d   v1.27.4
k8s-c   Ready    <none>          190d   v1.28.0

M.

I’ve cross-posted this to Cilium github - seems likely to be a version mismatch: Coredns fails connecting to kube-api via kubernetes service · Issue #27900 · cilium/cilium · GitHub

Seems there was a regression in 1.28.1 which caused some problems with init containers which in turn caused the dns-problems described above.

The regression is fixed in 1.28.2.