Node Disk IO saturation

Asking for help? Comment out what you need so we can get more information to help you!

Hi guys, I hace 3 cluster deployed in virtual machines behind Hyper-V hosts. This cluster runs with RKE2.

I can see during last weeks, every day at 00:00 and 12:00 I received alerts from Prometheus talking about Node disk IO saturation. Looks like this issue comes from etcd backups.

How this is possible? Are there any way to prevent this issues?

When this issue happened sometime one or two workers have problems to recreate some pods o unattached a volume from it. This is very frustrate.

I don’t know what to do

Thanks in adavance

Cluster information:

Kubernetes version: 1.27.15
Cloud being used: (put bare-metal if not on a public cloud) bare-metal
Installation method: RKE2
Host OS: Redhat 8.10
CNI and version:
CRI and version:

You can format your yaml by highlighting it and pressing Ctrl-Shift-C, it will make your output easier to read.

Try following to see it it helps

  1. If your etcd instances are running on the same nodes as your worker nodes, consider isolating etcd onto its own set of dedicated control plane nodes. This will ensure that the etcd backup process doesn’t interfere with other Kubernetes operations like pod scheduling or volume detachments.
  2. Try ionice if its supported on your system to give low priority for backup process (As long as you are ok for backup to take some time by giving priority to pods). configure I/O limits or use tools like ionice or cgroups to reduce the I/O priority of the backup process. This will allow critical workloads like pod scheduling or volume detachment to have priority access to disk I/O.

ionice -c3 /path/to/your-backup-script.sh # Example of setting low priority I/O class
3. If delta or incremental backup is an option, give it a try.