Hi Kubernetes team / community.
First I though that this was a bug, but based on the code the behavior seems to be intentional, so I was trying to find an answer to this behavior, but came out empty handed ; feel free to redirect me to the right direction if this has been answered somewhere else already.
I’m trying to understand why stdout logs are accounted for ephemeral storage usage, when evaluating resource usage on a POD level.
According to ‘pkg/kubelet/eviction/eviction_manager.go’ :
func (m *managerImpl) podEphemeralStorageLimitEviction(podStats statsapi.PodStats, pod *v1.Pod) bool {
podLimits := resourcehelper.PodLimits(pod, resourcehelper.PodResourcesOptions{})
_, found := podLimits[v1.ResourceEphemeralStorage]
if !found {
return false
}
// pod stats api summarizes ephemeral storage usage (container, emptyDir, host[etc-hosts, logs])
podEphemeralStorageTotalUsage := &resource.Quantity{}
if podStats.EphemeralStorage != nil && podStats.EphemeralStorage.UsedBytes != nil {
podEphemeralStorageTotalUsage = resource.NewQuantity(int64(*podStats.EphemeralStorage.UsedBytes), resource.BinarySI)
}
podEphemeralStorageLimit := podLimits[v1.ResourceEphemeralStorage]
if podEphemeralStorageTotalUsage.Cmp(podEphemeralStorageLimit) > 0 {
// the total usage of pod exceeds the total size limit of containers, evict the pod
message := fmt.Sprintf(podEphemeralStorageMessageFmt, podEphemeralStorageLimit.String())
if m.evictPod(pod, 0, message, nil, nil) {
metrics.Evictions.WithLabelValues(signalEphemeralPodFsLimit).Inc()
return true
}
return false
}
return false
}
There’s a slight hint of this behavior in Node-pressure Eviction | Kubernetes , but without explanation:
The kubelet recognizes two specific filesystem identifiers:
nodefs: The node’s main filesystem, used for local disk volumes, emptyDir volumes not backed by memory, log storage, and more. For example, nodefs contains /var/lib/kubelet/.
Based on this implementation, the container could be evicted without writing any actual data onto any of its Ephemeral volumes.
PoC:
apiVersion: v1
kind: Pod
metadata:
name: log-test
spec:
containers:
- image: k8s.gcr.io/busybox:latest
name: test
command: ["/bin/sh"]
args:
- -c
- |-
yes $(printf 'Hello world!!!!\n%.0s' `seq 1 64`) | dd bs=1024 count=204800
while sleep 3600; do
true
done
resources:
limits:
ephemeral-storage: "10Mi"
// Creating this POD would quickly resolve an Eviction, due to reaching the ephemeral-storage limit.
The reason the POD is evicted is because Kubelet by default keeps 5 iterations of logs under /var/log/pods , and rotates a log after it reaches the 10 MiB limit (–container-log-max-files=5 --container-log-max-size=10Mi ).
My issue with this approach is that the container (or POD manifest) has no way of knowing how log rotation is configured on the worker node level, so if the ‘ephemeral-storage’ limit is below container-log-max-size * container-log-max-files, the container could be evicted just by logging to stdout (consider a scenario where DEBUG / TRACE logging is enabled).
And because the logs are managed by CRI (kubelet), so technically those resources do not strictly belong to the container in my opinion (as the container has no control of what happens with the container logs after sending it to /dev/stdout)
In my eyes this looks counter-productive , so could somebody explain the reason behind this design decision please?
Br,
P1ng-W1n