Hi, for one of our applications we need to monitor pods for high CPU usage(in case of an infinite loop). once we detect this abnormal CPU usage we want to terminate the pod.
But under normal circumstances, we would actually need to use HPA to do auto-scaling pods. ( based n of on req, CPU, mem)
I was thinking about following the strategies
- Kill pods smartly monitor the Abnormal CPU activity for a prolonged period(1min)
- We are watching for specific pods by a specific label with their CPU activities
- if we find an abnormal behavior(max CPU usage for 1min) we terminate the ill pod and allow HPA to create a new one
- Kill specific pods randomly
- Assuming all pods under the same label are stateless, we can schedule killing 25% of random running pods each minute allowing fresh ones to start if needed
- We can use a label to select the target pod group
What is the best solution we can have? and are there existing tools to accomplish this task?