Kubernetes frontend service latency astronomically higher than Docker Compose-based setup

I am experiencing an astronomical latency issue with my Kubernetes setup compared to a Docker Compose setup. When making a request to the frontend service on Kubernetes, it takes close to 10 seconds to respond, while the same request against a Docker Compose instance of the exact same image takes about 200ms.

Kubernetes Node Components:

  • Nginx server
  • API service
  • Postgres
  • Redis
  • Web frontend service

Network Configuration:

  • Using Calico’s Tigera operator with the machine on subnet 192.168.88.0/24 and the node on 10.10.0.0/16
  • No other customizations apart from the subnet.
  • Followed the Calico quickstart guide: Calico Quickstart Guide

Additionally:

  • None of the system’s deployments specify resource limits.
  • No Ingress resource specified. Instead, a NodePort is declared on the nginx service. However, I cannot access it externally using the hostname nginx.my-namespace.svc.cluster.local (nslookup reports no such hostname). Since this suggests some sort of DNS resolution issue, I’m wondering if this could be related to my issue?
  • The latency issue is consistent even when testing inside the frontend pod with curl http://localhost....
  • There is no difference in latency between static pages or pages with dynamically generated content.

I’m convinced it’s NOT an application issue and suspect it to be related to the network.

What could be causing this significant latency in my Kubernetes setup, and what steps can I take to diagnose and resolve this issue?

Any insights or diagnostic steps would be greatly appreciated!

Cluster information:

Machine: bare-metal Ubuntu 22.04 8-Core 32GiB
Kubernetes version: registry.k8s.io/kube-apiserver:v1.28.11
CNI: docker.io/calico/apiserver:v3.28.0
CRI: containerd containerd.io 1.6.33

10 seconds is not a performance issue, it’s a configuration issue. It’s not like “k8s is less efficient with the network” or something :slight_smile:

Something somewhere is timing out while it waits for a response that never comes. It’s hard to pin down who or why without disassembling the application layer by layer.

I"d first figure out which part of the app is slow by comparing log timestamps or even just watching the logs in real time.

It so happens that many DNS configs have a default 5 second timeout, so it’s possible that the heart of the problem is that DNS is not set up properly.

Thanks for the insight, @thockin! I didn’t know about the DNS 5-second timeout but it does seem to point more strongly towards issues with DNS resolution.

Do you happen to know of a good guide or resource for setting up DNS properly in a Kubernetes cluster, especially when using Calico? Any recommendations or best practices would be greatly appreciated!

DNS shouldn’t be hard to set up.

  1. Can all your pods talk to each other, across nodes?
  2. Do kube Services work?
  3. Do you have a DNS service (in cluster or out)?
  4. Are your pods correctly using that DNS service (via kubelet config)?

This doc may help: Debug Services | Kubernetes

I was just able to confirm our suspicions that DNS misconfiguration was involved. I ran tcpdump -i any port 53 then went inside the relevant service pod and hit an endpoint directly on 127.0.0.1. Sure enough, I got a bunch of DNS queries for api.NAMESPACE.svc.cluster.local. This is expected because the pod does talk to another pod (api). What I don’t quite understand is why it is hitting the pod’s FQDN (if I can call it that) because the code explicitly references api, not api.NAMESPACE.svc.cluster.local. Some sort of resolution magic seems to be happening.

Can also confirm that making requests across nodes does work but only if referencing the service’s name.

The only DNS service that I use is my router’s DNS server:

$ grep -P ^DNS /etc/systemd/resolved.conf 
DNS=192.168.88.1 1.1.1.1 8.8.8.8 8.8.4.4
$ nslookup wonga.com 192.168.88.1
Server:		192.168.88.1
Address:	192.168.88.1#53

Non-authoritative answer:
Name:	wonga.com
Address: 104.18.79.205

Seems like I will have to fix the *.cluster.local DNS resolution after all!

When running a pod we drop a bunch of DNS search paths so api is first looking for a kube service named api in the same namespace as the client.

If you don’t want that, you can use the Pod’s dnsConfig to override.