How To Restart Pods That Run Out Of Memory With More Memory

Some of my workflows have short-lived pods that occasionally run out of memory. In cases where this happens, I would like them to be automatically restarted with double their initial memory request=limit. For example, if podA has a request=limit of 32Gi of memory and exceeds it, I would like a new pod to replace it that has a request=limit of 64Gi of memory. I was thinking of experimenting with the Vertical Pod Autoscaler for this purpose, but I don’t think it will have the desired effect because there will be many times where a given workflow will use less than 32Gi of memory and only a couple of times where it exceeds 32Gi of memory. So, I’m not sure if the Recommender would make good decisions based on the historical data, and I’m not sure if it works well with Kubernetes Jobs. I could be wrong though and will definitely try it out

That being said, I am wondering what the Community has done to solve this sort of problem where you have important production jobs running that sometimes use too much memory, and teams need those to be immediately automatically restarted with more memory. There are sometimes teams that don’t put enough thought into their memory usage for a given job, and in the middle of the night these teams need their evicted pods to be automatically restarted with more memory.

I know that I can also handle this at the Job Scheduler (ie- Jenkins, Team City, etc) level, but it would be cool if people have ideas of how to handle it within Kubernetes itself. I’m sorry that this may have been over explained-- I just wanted to be sure to cover the problem I’m solving.