Kubernetes version: v1.21.1
Cloud being used: bare-metal
Installation method: Rancher Kubernetes Engine
Host OS: RHEL 7
Container Runtime: Docker v20.1
Problem Statement: K8s enviornment - curl an Endpoint (that points to an external database) from a Pod fails with timeout.
K8s Cluster details: 3 node cluster hosted using Rancher Kubernetes Engine (RKE) with Docker as container runtime.
Nodes:
NAME ROLES
worker1 etcd,worker
worker2 etcd,worker
master controlplane,etcd
As this setup is with RKE, the apiserver, kubelet are run as docker containers on all the nodes
The Pod and the Endpoint reside in the same namespace.
Command inside pod:
curl :
The curl times out and fails. However, if we curl the actual database ip and port from outside the pod (that is on the node), it provides the expected response.
We are trying to trace the route of the packet issued during the curl takes using ip route and tracepath utilties.
The ip route command provides the same response everytime it is issued in the pod. However, the tracepath provides different paths each time as multiple pods share the same IP (the node’s IP where the pod’s containers are created ) in the cluster.
[root@master etc]# kubectl exec -it -n [namespace] [pod-name] sh
sh-4.2# ip route get [service ip of endpoint]
[service ip of endpoint] via [inaccessible software router ip] dev eth0 src [pod ip]
cache
sh-4.2# ip route get [database ip for which we created the endpoint]
[database ip for which we created the endpoint] via [inaccessible software router ip] dev eth0 src [pod ip]
cache
Different paths that come up each time during the execution of tracepath command is shown.
Please note once the request reaches the gateway, the path is always the same for the service ip of our endpoint. for even something general like google.com, post reaching the gateway the path is the same.
Hence, the path shown post the gateway have been removed and only the part of the output that keeps changing has been shown.
The ‘[the-ip]’ is the same ip address for all the cases and it is the ip address of the node where our pod (from which we are curling) is running.
The path before the gateway in each case are different pods that are being run as daemonsets by rancher.
sh-4.2# tracepath [service ip of the endpoint] -p [service port of the endpoint]
1?: [LOCALHOST] pmtu 1450
1: [the-ip].pushprox-kube-proxy-client.cattle-monitoring-system.svc.cluster.local 0.075ms
1: [the-ip].pushprox-kube-proxy-client.cattle-monitoring-system.svc.cluster.local 0.041ms
2: [some-gateway] 0.486ms asymm 3
sh-4.2# tracepath [service ip of the endpoint] -p [service port of the endpoint]
1?: [LOCALHOST] pmtu 1450
1: [the-ip].rancher-monitoring-prometheus-node-exporter.cattle-monitoring-system.svc.cluster.local 0.065ms
1: [the-ip].rancher-monitoring-ingress-nginx.ingress-nginx.svc.cluster.local 0.031ms
2: [some-gateway] 0.496ms asymm 3
sh-4.2# tracepath [service ip of the endpoint] -p [service port of the endpoint]
1?: [LOCALHOST] pmtu 1450
1: [the-ip].pushprox-kube-etcd-client.cattle-monitoring-system.svc.cluster.local 0.093ms
1: [the-ip].pushprox-kube-etcd-client.cattle-monitoring-system.svc.cluster.local 0.054ms
2: [some-gateway] 0.480ms asymm 3
sh-4.2# tracepath [service ip of the endpoint] -p [service port of the endpoint]
1?: [LOCALHOST] pmtu 1450
1: [the-ip].rancher-monitoring-ingress-nginx.ingress-nginx.svc.cluster.local 0.082ms
1: [the-ip].pushprox-kube-proxy-client.cattle-monitoring-system.svc.cluster.local 0.041ms
2: [some-gateway] 0.626ms asymm 3
sh-4.2# tracepath [service ip of the endpoint] -p [service port of the endpoint]
1?: [LOCALHOST] pmtu 1450
1: [the-ip].pushprox-kube-etcd-client.cattle-monitoring-system.svc.cluster.local 0.103ms
1: [the-ip].pushprox-kube-etcd-client.cattle-monitoring-system.svc.cluster.local 0.052ms
2: [some-gateway] 0.501ms asymm 3
sh-4.2# tracepath [service ip of the endpoint] -p [service port of the endpoint]
1?: [LOCALHOST] pmtu 1450
1: [the-ip].pushprox-kube-etcd-client.cattle-monitoring-system.svc.cluster.local 0.159ms
1: [the-ip].rancher-monitoring-kubelet.kube-system.svc.cluster.local 0.082ms
2: [some-gateway] 0.725ms asymm 3
We have also executed a tcpdump at the pod’s eth0 interface and identified that the request goes through coredns pod in the kube-system namespace. However, no output logs in coredns pods indicate the request coming in or going out. The same happens when curl-ing some general website like google.com, except the curl to google.com succeeds while the database endpoint’s service and port fails (our case).
Any response on specific logging or component that we should take note of is appreciated to identify at which point of the route the request fails.