Cluster information:
Kubernetes version: v1.28.1
Cloud being used: bare-metal
Installation method: kubeadm
Host OS: Ubuntu 22.04.3 LTS
CNI and version: cilium 1.14.1
CRI and version: containerd 1.6.22
Today, after upgrading to 1.28.1 I realized that my test cluster is unable to get coredns
ready:
$ k get po -A | grep core
kube-system coredns-5dd5756b68-hchqq 0/1 Running 0 57m
kube-system coredns-5dd5756b68-r768b 0/1 Running 0 57m
Upon inspecting the logs there seem to be some connectivity issue between coredns and kube-api:
$ k -n kube-system logs coredns-5dd5756b68-hchqq | tail -5 | tail -2
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://10.96.0.1:443/version": dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
cilium connectivity test
seem to run into the same issue:
$ cilium connectivity test
ℹ️ Monitor aggregation detected, will skip some flow validation steps
⌛ [kubernetes] Waiting for deployment cilium-test/client to become ready...
⌛ [kubernetes] Waiting for deployment cilium-test/client2 to become ready...
⌛ [kubernetes] Waiting for deployment cilium-test/echo-same-node to become ready...
⌛ [kubernetes] Waiting for deployment cilium-test/echo-other-node to become ready...
⌛ [kubernetes] Waiting for CiliumEndpoint for pod cilium-test/client-78f9dffc84-g5z5l to appear...
⌛ [kubernetes] Waiting for CiliumEndpoint for pod cilium-test/client2-59b578d4bb-jttvw to appear...
⌛ [kubernetes] Waiting for pod cilium-test/client-78f9dffc84-g5z5l to reach DNS server on cilium-test/echo-same-node-54cc4f75b8-xt4cf pod...
⌛ [kubernetes] Waiting for pod cilium-test/client2-59b578d4bb-jttvw to reach DNS server on cilium-test/echo-same-node-54cc4f75b8-xt4cf pod...
⌛ [kubernetes] Waiting for pod cilium-test/client-78f9dffc84-g5z5l to reach DNS server on cilium-test/echo-other-node-5b87f6f4f4-cdmtl pod...
⌛ [kubernetes] Waiting for pod cilium-test/client2-59b578d4bb-jttvw to reach DNS server on cilium-test/echo-other-node-5b87f6f4f4-cdmtl pod...
⌛ [kubernetes] Waiting for pod cilium-test/client-78f9dffc84-g5z5l to reach default/kubernetes service...
connectivity test failed: timeout reached waiting for lookup for kubernetes.default from pod cilium-test/client-78f9dffc84-g5z5l to succeed (last error: context deadline exceeded)
Accessing the kube-api from outside the cluster works fine - as is demonstrated by kubectl
working.
Cilium status seem ok.
$ cilium status
/¯¯\
/¯¯\__/¯¯\ Cilium: OK
\__/¯¯\__/ Operator: OK
/¯¯\__/¯¯\ Envoy DaemonSet: disabled (using embedded mode)
\__/¯¯\__/ Hubble Relay: OK
\__/ ClusterMesh: disabled
Deployment hubble-relay Desired: 1, Ready: 1/1, Available: 1/1
Deployment hubble-ui Desired: 1, Ready: 1/1, Available: 1/1
Deployment cilium-operator Desired: 2, Ready: 2/2, Available: 2/2
DaemonSet cilium Desired: 4, Ready: 4/4, Available: 4/4
Containers: cilium Running: 4
hubble-relay Running: 1
hubble-ui Running: 1
cilium-operator Running: 2
Cluster Pods: 8/8 managed by Cilium
Helm chart version: 1.14.1
Image versions cilium quay.io/cilium/cilium:v1.14.1@sha256:edc1d05ea1365c4a8f6ac6982247d5c145181704894bb698619c3827b6963a72: 4
hubble-relay quay.io/cilium/hubble-relay:v1.13.2: 1
hubble-ui quay.io/cilium/hubble-ui:v0.11.0@sha256:bcb369c47cada2d4257d63d3749f7f87c91dde32e010b223597306de95d1ecc8: 1
hubble-ui quay.io/cilium/hubble-ui-backend:v0.11.0@sha256:14c04d11f78da5c363f88592abae8d2ecee3cbe009f443ef11df6ac5f692d839: 1
cilium-operator quay.io/cilium/operator-generic:v1.14.1@sha256:e061de0a930534c7e3f8feda8330976367971238ccafff42659f104effd4b5f7: 2
There are no network policies I can find to blame.
$ k get ciliumnetworkpolicies.cilium.io -A
No resources found
$ k get networkpolicies.networking.k8s.io -A
No resources found
There are endpoints which I believe should be implicitly targeted by the service:
$ k get endpointslices.discovery.k8s.io
NAME ADDRESSTYPE PORTS ENDPOINTS AGE
kubernetes IPv4 6443 192.168.100.10,192.168.100.11,192.168.100.12 140d
$ k get svc -o yaml
apiVersion: v1
items:
- apiVersion: v1
kind: Service
metadata:
creationTimestamp: "2023-09-01T08:00:11Z"
labels:
component: apiserver
provider: kubernetes
name: kubernetes
namespace: default
resourceVersion: "2726902"
uid: 5e7c32c9-ab89-47e4-8940-db010c2ffc4d
spec:
clusterIP: 10.96.0.1
clusterIPs:
- 10.96.0.1
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: https
port: 443
protocol: TCP
targetPort: 6443
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
kind: List
metadata:
resourceVersion: ""
I don’t believe I have any funny business in the coredns config:
$ k get all -A -l k8s-app=kube-dns
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-5dd5756b68-hchqq 0/1 Running 1 (6m39s ago) 4h46m
kube-system pod/coredns-5dd5756b68-r768b 0/1 Running 1 (6m38s ago) 4h46m
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 4h46m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 0/2 2 0 4h46m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-5dd5756b68 2 2 0 4h46m
$ k -n kube-system describe cm coredns
Name: coredns
Namespace: kube-system
Labels: <none>
Annotations: <none>
Data
====
Corefile:
----
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
BinaryData
====
Events: <none>
There is a service running in the container - but it does not seem to hold any data, probably due to not being able to connect to the api:
$ kubectl -n kube-system debug -it pod/coredns-5dd5756b68-hchqq --image=nicolaka/netshoot --target=coredns
coredns-5dd5756b68-hchqq ~ ss -lnp | grep :53
udp UNCONN 0 0 *:53 *:* users:(("coredns",pid=1,fd=12))
tcp LISTEN 0 4096 *:53 *:* users:(("coredns",pid=1,fd=11))
coredns-5dd5756b68-hchqq ~ dig @localhost kubernetes.default
; <<>> DiG 9.18.13 <<>> @localhost kubernetes.default
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 29162
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 7fdf8d625c0b48eb (echoed)
;; QUESTION SECTION:
;kubernetes.default. IN A
;; Query time: 0 msec
;; SERVER: ::1#53(localhost) (UDP)
;; WHEN: Fri Sep 01 13:23:08 UTC 2023
;; MSG SIZE rcvd: 59
I can access the api from the pod on the external ip, but not the service ip:
coredns-5dd5756b68-hchqq ~ ping kubernetes
ping: kubernetes: Try again
coredns-5dd5756b68-hchqq ~ ping k8s
PING k8s.kubenet (192.168.100.5) 56(84) bytes of data.
64 bytes from k8s.kubenet (192.168.100.5): icmp_seq=1 ttl=62 time=0.139 ms
64 bytes from k8s.kubenet (192.168.100.5): icmp_seq=2 ttl=62 time=0.147 ms
^C
--- k8s.kubenet ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1023ms
rtt min/avg/max/mdev = 0.139/0.143/0.147/0.004 ms
coredns-5dd5756b68-hchqq ~ curl -k https://k8s:6443
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
"reason": "Forbidden",
"details": {},
"code": 403
}#
coredns-5dd5756b68-hchqq ~ curl -k https://10.96.0.1:443
curl: (28) Failed to connect to 10.96.0.1 port 443 after 130812 ms: Couldn't connect to server
What am I missing?