Trying to gain some additional perspective here.
We originally did not set limits on our workloads. We’ve had instances were during startup, the node would hang at 100% CPU indefinitely.
Our first mistake was not setting kube-reserved and system-reserved, which we have corrected. However even with those limits set, to reasonable limits of 100m for both kube and system reserved, I am still able to reproduce the issue, by scaling a bunch of pods immediately.
I’ve now taken the approach of setting requests = limits, but I’m having a hard time deciding if this is the right approach, or if there is something else I’m missing.