Setting namespace maximum CPU usage

Cluster information:

Kubernetes version: 1.18.4
Cloud being used: bare-metal 
Host OS: Ubuntu

Dear all,
I have been setting ResourceQuotas and LimitRanges on my single-node kubernetes to ensure that a certain namespace never uses more than 42 cpus of the 48 available, as I wanted to be sure other namespaces always have at least 6 cpu for them.

In particular, I have set:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: mem-cpu-demo
spec:
  hard:
    requests.cpu: "40"
    requests.memory: 192Gi
    limits.cpu: "42"
    limits.memory: 192Gi

I expected to be able to schedule up to 40 single-container pods with a CPU request of 1 and that they would be individually be able to burst to up to 24 cores by setting their limits.cpu to 24.

However, if I do that the second pod I create is rejected because:

Error creating: pods “sim-nv87acee-n4hzm” is forbidden: exceeded quota: mem-cpu-demo, requested: limits.cpu=24, used: limits.cpu=24, limited: limits.cpu=42

Contrary to what I believed, the 42 cpu limit is not enforced at runtime, but it is enforced at pod creation, because the sum of all the pods limits must never exceed 42.

So, if I specify any limit for my pods (which I do want to do, because I want them to be able to burst so limit=request would not be good) then this results in the fact that I can no longer use the full 42 cpus.
Actually, in the above mentioned case, with pod limit.cpu=24, I can only run 1 pod, so most of the time I would only use 2 cpu with all the rest of the server being inutilized.

Is there a way to cap namespace total cpu usage at runtime (leaving the ability for each pod to burst)?

To enforce the cap of 42 at runtime, we would have to aggregate data across pods, across nodes in real-time. That simply isn’t feasible today. As you discovered, this is an API-time check.

Is there a way to cap namespace total cpu usage at runtime (leaving the ability for each pod to burst)?

Imagine a 2-node cluster. You schedule 2 pods, one on each, with a CPU request of 20 and a limit of 24. But you want the aggregate limit to be lower than then sum of the individual limits. Suppose both pods start burning CPU at the same time. We have to send messages on the network between nodes to share status and negotiate a reduce limit. If one pod stops burning CPU, we need to renegotiate so the other can use more.

While not impossible, this is not what Kubernetes is doing (yet?).

The good news is - I don’t think it matters.

In a cgroup-based runtime (docker, containerd, cri-o), CPU limits are generally not necessary. CPU requests govern how much CPU you will get under contention and that is (usually) all you care about.

In other words:

Imagine a 4-core node. You schedule pod A with CPU request=1, limit=100. You schedule another pod B with CPU request=3, limit=10. Your node is full. Assume pod A is going to use as much CPU as it can - it’s a hotloop. If there is CPU idle on the machine, A will use it. However, as soon as B needs CPU, it will get it INSTEAD of A. Now, if B hotloops, it will take CPU back from A, but only up to 3 full cores.

As long as they are both hotlooping, the CPU allocation should be 1 for A and 3 for B. As soon as either one backs off the CPU, the other can use it. This works very well (in most cases).

1 Like

Dear @thockin,
thanks for your nice and clear answer.

Yes, in the more common scenario where the cluster (and the namespace) has multiple nodes, enforcing the cpu limit at runtime would require some network communication to coordinate CPU usage… with high latencies I guess that would be very complex and possibly inefficient to do!

Thanks for the clarifying example.
So in my use case if I want to protect the 6 cpu for the other namespaces what I have to do is:
a) enforce 42 cpu request maximum on the main namespace (so that the last 6 of 48 cannot be requested)
b) applications on the other namespaces have to request their 6 cpu.
Point b) will ensure that they will always be able to get their 6 cpu usage when the applications in the main namespace burst beyond the 40 cpu they can request.

I think my use case is solved, thanks!