Hi experts,
Cluster information:
Kubernetes version: v1.16.6
Cloud being used: (put bare-metal if not on a public cloud) private
Installation method: kubespray
Host OS: CentOS Linux release 8.1.1911 (Core)
I have set up my private Kubernetes cluster earlier this year, it was stable until a month ago I have encountered the ephemeral storage issue. I would like to ask is there any way to avoid this issue or workaround provided by Kubernetes itself?
The cluster somehow evicted my application’s pods due to ephemeral storage. All pods goes to “evicted” or “pending” state and require to start all pods manually in order to resume. Below is the log message from /var/log/message, it can see that this issue is cause by the threshold of ephemeral storage was being hit:
Sep 9 12:19:26 kubelet[38806]: W0909 12:19:26.095011 38806 eviction_manager.go:330] eviction manager: attempting to reclaim ephemeral-storage
Sep 9 12:19:26 kubelet[38806]: I0909 12:19:26.095087 38806 container_gc.go:85] attempting to delete unused containers
Sep 9 12:19:26 kubelet[38806]: I0909 12:19:26.123579 38806 image_gc_manager.go:317] attempting to delete unused images
Sep 9 12:19:26 kubelet[38806]: I0909 12:19:26.151400 38806 image_gc_manager.go:371] [imageGCManager]: Removing image “sha256:643c21638c1c966fe18ca1cc8547dd401df70e85d83ca6de76b9a7957703b993” to free 39468433 bytes
Sep 9 12:19:26 kubelet[38806]: I0909 12:19:26.179061 38806 kubelet_node_status.go:472] Recording NodeHasDiskPressure event message for node
Sep 9 12:19:26 kubelet[38806]: E0909 12:19:26.249577 38806 remote_image.go:135] RemoveImage “sha256:643c21638c1c966fe18ca1cc8547dd401df70e85d83ca6de76b9a7957703b993” from image service failed: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to remove repository reference “Quay” (must force) - container 595c3f90584a is using its referenced image 643c21638c1c
Sep 9 12:19:26 kubelet[38806]: E0909 12:19:26.249797 38806 kuberuntime_image.go:120] Remove image “sha256:643c21638c1c966fe18ca1cc8547dd401df70e85d83ca6de76b9a7957703b993” failed: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to remove repository reference “Quay” (must force) - container 595c3f90584a is using its referenced image 643c21638c1c
Sep 9 12:19:26 kubelet[38806]: W0909 12:19:26.272336 38806 eviction_manager.go:417] eviction manager: unexpected error when attempting to reduce ephemeral-storage pressure: wanted to free 9223372036854775807 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to remove repository reference “Quay” (must force) - container 595c3f90584a is using its referenced image 643c21638c1c
Sep 9 12:19:26 kubelet[38806]: I0909 12:19:26.294283 38806 eviction_manager.go:341] eviction manager: must evict pod(s) to reclaim ephemeral-storage
Besides, the free disk space is drop gradually. I have checked that with “df” found that the /var mount point is almost full and I am highly suspect that some of container ate the disk space.
Actually, I have limited the resource of “ephemeral storage” for the following pods:
Pod resources:
resources: limits: ephemeral-storage: 2Gi requests: ephemeral-storage: 1Gi memory: 200Mi
Pod lists:
alertmanager-prom-prometheus-operator-alertmanager statefulset
prometheus-prom-prometheus-operator-prometheus statefulset
prom-grafana.deployment
prom-prometheus-node-exporter daemon set
influxdb statefulset
phpmyadmin deployment
sqlexporter-prometheus-mysql-exporter deployment
calico-kube-controllers deployment
calico-node daemonset
coredns .deployment
dns-autoscaler deployment
kube-proxy daemon set
kubernetes-dashboard deployment
nodelocaldns daemon set
Your help would be very appreciated