Help troubleshooting scenario

I have a 9 node cluster (3 ETCD, 2 Marters, 4 Compute). The computes nodes are 16-core, 32gb ram, on linux RHEL 7.5(3.10) and their are mostly vacant (at the time of this writing, the total CPU usage reported is close to 6 cores and memory is 105Gi)

The application running is a .Net Core 2.0 web api. The image is based on Debian 10. The web api is very simple and basically only queries for information at a database and another internal WCF service (not hosted on Kube). The database queries are very simple and doesn’t even have a “join” involved are all index-optimised.

The pod deployment request 100m of cpu and 100Mi of memory, going up to 1cpu and 1Gi

The scenario I need help with is:

  • This morning we were peaking around 8000 requests per min (133 per sec).
  • We had 10 pods running the app, spreaded as evenly as possible.
  • The pods were reporting CPU usage around 10-25% and 300-450Mi of memory
  • The response time was absolute garbage. We’re talking 50sec response time.

The database guys said that everything was very smooth on their side and out monitoring on the WCF service showed that they were also very well (responding around 50 and 300ms). Basically our requests were stuck inside the pods, not waiting for some kind of network IO.

There was no indications of CPU pressure or Memory pressure, but we decided to double the amount of pods, to 20 pods. And the problem went away.

Can anyone explain this behaviour?

Cluster information:

Kubernetes version: 1.8.3
Cloud being used: DXC - but IaaS
Installation method: don’t know
Host OS: RHEL 7.5 (3.10.0)
CNI and version:
CRI and version:

Before anything – is this the actual version you’re running? or 1.18.3. 1.8.3 has been out of support and hasn’t been touched in a few years now :grimacing:

sadly yes… 1.8 is the actual version… well thanks anyway