Microk8s hostpath storage - How to clean up and protect from filling up

Playing around trying to deploy a kubernetes cluster for my application.

I have a 3 nodes system (3 ubuntu VM) and microk8s installed in HA mode with dns, hostpath-storage and ingress addons. I installed on the cluster the kubernetes-dashboard, prometheus, rabbitmq and redis services from helm. I also installed my own set of services (simple dotnet microservices).

It seemed to run well enough before leaving for the holidays. Coming back 2 weeks later, everything was broken and nothing was responding.

Every single services had multiple hundreds instances of pods in the “Evicted”, “ContainerStatusUnknown”, “Error” and even “Completed” (?) states. Not only my own services, but even in the kube-system namespace, there were hundreds of “hostpath-provisioner-**” pods in the same state.

Using “kubectl get events” I could find lines saying “The node was low on resource: ephemeral-storage”. Ok, that’s an hint. I looked at the disks on my VM and they were indeed all full. This directory was taking 12GB:

/var/snap/microk8s/common/default-storage/default-data-rabbitmq-2-pvc-c9cc3490-4338-4e5a-bc5b-9bd9f175635e

Ok, rabbitmq config was not limiting storage, maybe a queue filled up and filled the disks. That’s possible. But I find it kinda worrying that a single service behaving wrongly went to the point of crashing EVERYTHING, up to even the kube-system processes.

What’s the best way to insure this does not happen again? Set something under resources.limits in each of the values.yaml files of each helm chart I use?

Next, how to clean this up? I started with a bash piping of gre / awk / xargs to send a delete command to all pods in the unwanted states that seems to have deleted them (was that the best method?)

I then uninstalled rabbitmq with helm, then deleted its pvc and pv ressources who lingered. But the files on disk on my vm in the microk8s default-storage folder mentioned earlier are still there? Do I need to delete them manually, and why?

Thanks for any help. Sorry, I’m still a bit new at this so there’s a lot of day to day operations that should be simple I don’t quite grasp yet.

Hey,
I’d say to make sure this does not happen again setting resource limits/volume sizes and using a different partition of the disk might be an option. Creating a separate partition of the disk and using that mount path for the hostpath-storage addon would ensure the space the node has won’t be filled up by the volume storage. You can check this link for more information. Hostpath-storage addon is more suited for local development, I’d suggest checking out other Kubernetes storage solutions.

There isn’t an exact best method to clean pods with unwanted states, sending a delete command sounds fine. And about the files not being cleaned up from the disk might be related to the fact the hostpath-storage pod couldn’t function(maybe evicted) due to disk exhaustion so the files might be left-over even though the volumes are removed.

If you have more questions feel free, thanks.

Thanks for the reply berkayoz,

I also found out that hostpath-storage isn’t recommended for a multi node cluster, so I’ll currently in the process of updating to 1.26.0 to use the nfs addon and convert my helm values files to use that storageclass instead.

I did a test with a simple pvc/pv delete while the cluster was in a functional state and can confirm the files are left on disk. I found this question asking the same thing, and it seems “recycle” reclaim policy would call a “rm” to delete them, but “delete” is used mostly on cloud environment to completely delete the storage resource, but doesn’t seem to do anything with hostpath-storage. I guess you could change the reclaim policy to recycle by creating a class definition using the same method as your link above to change the host path, but I didn’t try since I’m in the process of moving away from it.