The ephemeral storage issue in my Kubernetes cluster

greenk8sguys · September 10, 2020, 1:55am

Hi experts,

Cluster information:

Kubernetes version: v1.16.6
Cloud being used: (put bare-metal if not on a public cloud) private
Installation method: kubespray
Host OS: CentOS Linux release 8.1.1911 (Core)

I have set up my private Kubernetes cluster earlier this year, it was stable until a month ago I have encountered the ephemeral storage issue. I would like to ask is there any way to avoid this issue or workaround provided by Kubernetes itself?

The cluster somehow evicted my application’s pods due to ephemeral storage. All pods goes to “evicted” or “pending” state and require to start all pods manually in order to resume. Below is the log message from /var/log/message, it can see that this issue is cause by the threshold of ephemeral storage was being hit:

Sep 9 12:19:26 kubelet[38806]: W0909 12:19:26.095011 38806 eviction_manager.go:330] eviction manager: attempting to reclaim ephemeral-storage
Sep 9 12:19:26 kubelet[38806]: I0909 12:19:26.095087 38806 container_gc.go:85] attempting to delete unused containers
Sep 9 12:19:26 kubelet[38806]: I0909 12:19:26.123579 38806 image_gc_manager.go:317] attempting to delete unused images
Sep 9 12:19:26 kubelet[38806]: I0909 12:19:26.151400 38806 image_gc_manager.go:371] [imageGCManager]: Removing image “sha256:643c21638c1c966fe18ca1cc8547dd401df70e85d83ca6de76b9a7957703b993” to free 39468433 bytes
Sep 9 12:19:26 kubelet[38806]: I0909 12:19:26.179061 38806 kubelet_node_status.go:472] Recording NodeHasDiskPressure event message for node
Sep 9 12:19:26 kubelet[38806]: E0909 12:19:26.249577 38806 remote_image.go:135] RemoveImage “sha256:643c21638c1c966fe18ca1cc8547dd401df70e85d83ca6de76b9a7957703b993” from image service failed: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to remove repository reference “Quay” (must force) - container 595c3f90584a is using its referenced image 643c21638c1c
Sep 9 12:19:26 kubelet[38806]: E0909 12:19:26.249797 38806 kuberuntime_image.go:120] Remove image “sha256:643c21638c1c966fe18ca1cc8547dd401df70e85d83ca6de76b9a7957703b993” failed: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to remove repository reference “Quay” (must force) - container 595c3f90584a is using its referenced image 643c21638c1c
Sep 9 12:19:26 kubelet[38806]: W0909 12:19:26.272336 38806 eviction_manager.go:417] eviction manager: unexpected error when attempting to reduce ephemeral-storage pressure: wanted to free 9223372036854775807 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to remove repository reference “Quay” (must force) - container 595c3f90584a is using its referenced image 643c21638c1c
Sep 9 12:19:26 kubelet[38806]: I0909 12:19:26.294283 38806 eviction_manager.go:341] eviction manager: must evict pod(s) to reclaim ephemeral-storage

Besides, the free disk space is drop gradually. I have checked that with “df” found that the /var mount point is almost full and I am highly suspect that some of container ate the disk space.

Actually, I have limited the resource of “ephemeral storage” for the following pods:
Pod resources:

    resources:
      limits:
        ephemeral-storage: 2Gi
      requests:
        ephemeral-storage: 1Gi
        memory: 200Mi

Pod lists:

alertmanager-prom-prometheus-operator-alertmanager statefulset
prometheus-prom-prometheus-operator-prometheus statefulset
prom-grafana.deployment
prom-prometheus-node-exporter daemon set
influxdb statefulset
phpmyadmin deployment
sqlexporter-prometheus-mysql-exporter deployment
calico-kube-controllers deployment
calico-node daemonset
coredns .deployment
dns-autoscaler deployment
kube-proxy daemon set
kubernetes-dashboard deployment
nodelocaldns daemon set

Your help would be very appreciated

Howard_Roark · September 10, 2020, 12:57pm

Is your question regarding what caused the disk space on your root volume to be used? If so, you should see if there’s lots of logs that have gotten written there, and you should check if you have lots of stopped containers that have stuck around – ie run docker ps -a. Though, the latter wouldn’t explain the spike-- it’s just something to be aware of that takes up disk space.

greenk8sguys · September 11, 2020, 1:21am

I would like to know if Kubernetes or Docker provided some workaround for this issue (instead of restart all pods and causes interruption)?
As I am not familiar with the Docker file system, I am just notice that the /var mount point is used over 90% during the issue. Thus, I am wonder why the over 50% of disk space under “/var/” is still occupied by Docker container as I have limited the log file and ephemeral storage in Docker and Kubernetes respectively.

My Docker service options:

[Service]
Environment="DOCKER_OPTS= --iptables=false
–data-root=/var/lib/docker
–log-opt max-size=50m --log-opt max-file=5
"

All ephemeral-storage is limited in Kubernetes as shown.

resources:
limits:
ephemeral-storage: 2Gi

And the docker system df is not that much after restart all pod:

$ docker system df
TYPE TOTAL ACTIVE SIZE RECLAIMABLE
Images 25 25 2.537GB 44.97MB (1%)
Containers 48 45 45.32MB 0B (0%)
Local Volumes 0 0 0B 0B
Build Cache 0 0 0B 0B

“/var” mount point

$ df -hT /var
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/cl-var ext4 98G 8.8G 85G 10% /var

Howard_Roark · September 11, 2020, 2:39am

It seems like you should list the files in /var by size. You could use ncdu /var or whatever you prefer to do that. That will tell you more about what is taking up the space within /var

greenk8sguys · September 11, 2020, 2:44am

I have keep monitoring the mount point /var by executing command - “du -h -d1 /var” during the the problem was happened. It is found the “/var/lib/docker/overlay2” is occupied most of space under /var. I have performed docker space clean with “docker system prune -a -f”, however, it did not help until restart the K8s pods and re-pulling all the images.

Topic		Replies	Views
Pod resources limit ephemeral-storage General Discussions development	0	98	December 18, 2024
Docker volume is not recognized as epihemeral storage General Discussions	0	412	June 18, 2022
Pods getting evicted due to The node was low on resource: ephemeral-storage General Discussions	2	7042	February 5, 2020
Memory error with my microK8s single node microk8s	11	3949	April 6, 2021
Exit code 137 - Pods terminated General Discussions	2	17541	January 16, 2020

The ephemeral storage issue in my Kubernetes cluster

Cluster information:

Related topics