Is there a better way to handle pods running into CrashLoopBackOff state?

I am administrating the kubernetes cluster for my lab. I recently noticed there are a bunch of pods which are failing constantly because of the process inside the container doesn’t exit gracefully. They are being restarted over and over, failing every time, and in a CrashLoopBackOff state. I can, of course, force remove these misbehaving pods. Wondering if I can do anything smarter to do this in an automated fashion. Any suggestions, please?

Pods that enter a CrashLoopBackOff state are failing to maintain a running daemon/service (pid 1). In other words, the application is crashing over and over again and the scheduler has applied time constraints for repeatedly trying again.

The typical cause of this is usually due to an applications inability to connect to, or maintain, a database connection.

Please provide some more information regarding your application service(s).

So while @yomateod is completely right, I wanted to add that as CrashLoopBackOff does indicate an issue with the containers in your pod, your applications could be failing for an out-of-band reason. If, for instance, your application can’t connect to an external dependency, there could be a problem with kube-dns, security groups restricting flannel networking, or a NAT gateway.

In these cases, when things are just weird and you just can’t get your app to work, consider running something like the phusion/baseimage or in a similar way to your application. This will give you something you can exec into with multiple useful utilities to figure out if the problem is lower-level.

One last thing: I’ve also seen pods, especially those that depend on the kubernetes or other API’s, work for a time, then fail. Or within a replicaset, some will be just fine while others fail. This can occur when you’ve scaled up clients which consume and constantly communicate with a single backend, saturating safe connection practices. You should ensure your application properly handles this sort of thing via retries etc. and you’ve enabled appropriate readiness/liveliness probes etc…