Playing around trying to deploy a kubernetes cluster for my application.
I have a 3 nodes system (3 ubuntu VM) and microk8s installed in HA mode with dns, hostpath-storage and ingress addons. I installed on the cluster the kubernetes-dashboard, prometheus, rabbitmq and redis services from helm. I also installed my own set of services (simple dotnet microservices).
It seemed to run well enough before leaving for the holidays. Coming back 2 weeks later, everything was broken and nothing was responding.
Every single services had multiple hundreds instances of pods in the “Evicted”, “ContainerStatusUnknown”, “Error” and even “Completed” (?) states. Not only my own services, but even in the kube-system namespace, there were hundreds of “hostpath-provisioner-**” pods in the same state.
Using “kubectl get events” I could find lines saying “The node was low on resource: ephemeral-storage”. Ok, that’s an hint. I looked at the disks on my VM and they were indeed all full. This directory was taking 12GB:
/var/snap/microk8s/common/default-storage/default-data-rabbitmq-2-pvc-c9cc3490-4338-4e5a-bc5b-9bd9f175635e
Ok, rabbitmq config was not limiting storage, maybe a queue filled up and filled the disks. That’s possible. But I find it kinda worrying that a single service behaving wrongly went to the point of crashing EVERYTHING, up to even the kube-system processes.
What’s the best way to insure this does not happen again? Set something under resources.limits in each of the values.yaml files of each helm chart I use?
Next, how to clean this up? I started with a bash piping of gre / awk / xargs to send a delete command to all pods in the unwanted states that seems to have deleted them (was that the best method?)
I then uninstalled rabbitmq with helm, then deleted its pvc and pv ressources who lingered. But the files on disk on my vm in the microk8s default-storage folder mentioned earlier are still there? Do I need to delete them manually, and why?
Thanks for any help. Sorry, I’m still a bit new at this so there’s a lot of day to day operations that should be simple I don’t quite grasp yet.