Why stdout logs are accounted in ephemeral storage usage?

Hi Kubernetes team / community.

First I though that this was a bug, but based on the code the behavior seems to be intentional, so I was trying to find an answer to this behavior, but came out empty handed ; feel free to redirect me to the right direction if this has been answered somewhere else already.

I’m trying to understand why stdout logs are accounted for ephemeral storage usage, when evaluating resource usage on a POD level.
According to ‘pkg/kubelet/eviction/eviction_manager.go’ :

func (m *managerImpl) podEphemeralStorageLimitEviction(podStats statsapi.PodStats, pod *v1.Pod) bool {
	podLimits := resourcehelper.PodLimits(pod, resourcehelper.PodResourcesOptions{})
	_, found := podLimits[v1.ResourceEphemeralStorage]
	if !found {
		return false
	}

	// pod stats api summarizes ephemeral storage usage (container, emptyDir, host[etc-hosts, logs])
	podEphemeralStorageTotalUsage := &resource.Quantity{}
	if podStats.EphemeralStorage != nil && podStats.EphemeralStorage.UsedBytes != nil {
		podEphemeralStorageTotalUsage = resource.NewQuantity(int64(*podStats.EphemeralStorage.UsedBytes), resource.BinarySI)
	}
	podEphemeralStorageLimit := podLimits[v1.ResourceEphemeralStorage]
	if podEphemeralStorageTotalUsage.Cmp(podEphemeralStorageLimit) > 0 {
		// the total usage of pod exceeds the total size limit of containers, evict the pod
		message := fmt.Sprintf(podEphemeralStorageMessageFmt, podEphemeralStorageLimit.String())
		if m.evictPod(pod, 0, message, nil, nil) {
			metrics.Evictions.WithLabelValues(signalEphemeralPodFsLimit).Inc()
			return true
		}
		return false
	}
	return false
}

There’s a slight hint of this behavior in Node-pressure Eviction | Kubernetes , but without explanation:

The kubelet recognizes two specific filesystem identifiers:
nodefs: The node’s main filesystem, used for local disk volumes, emptyDir volumes not backed by memory, log storage, and more. For example, nodefs contains /var/lib/kubelet/.

Based on this implementation, the container could be evicted without writing any actual data onto any of its Ephemeral volumes.

PoC:

apiVersion: v1
kind: Pod
metadata:
  name: log-test
spec:
  containers:
  - image: k8s.gcr.io/busybox:latest
    name: test
    command: ["/bin/sh"]
    args:
    - -c
    - |-
      yes $(printf 'Hello world!!!!\n%.0s' `seq 1 64`) | dd bs=1024 count=204800
      while sleep 3600; do
        true
      done
    resources:
      limits:
        ephemeral-storage: "10Mi"

// Creating this POD would quickly resolve an Eviction, due to reaching the ephemeral-storage limit.

The reason the POD is evicted is because Kubelet by default keeps 5 iterations of logs under /var/log/pods , and rotates a log after it reaches the 10 MiB limit (–container-log-max-files=5 --container-log-max-size=10Mi ).

My issue with this approach is that the container (or POD manifest) has no way of knowing how log rotation is configured on the worker node level, so if the ‘ephemeral-storage’ limit is below container-log-max-size * container-log-max-files, the container could be evicted just by logging to stdout (consider a scenario where DEBUG / TRACE logging is enabled).

And because the logs are managed by CRI (kubelet), so technically those resources do not strictly belong to the container in my opinion (as the container has no control of what happens with the container logs after sending it to /dev/stdout)

In my eyes this looks counter-productive , so could somebody explain the reason behind this design decision please?

Br,
P1ng-W1n

I’m experiencing a similar issue. If stdout must count against ephemeral storage, maybe a configuration item could be added to allow it to discard excess (old) log information if/when the limit is exceeded, e.g. a ring buffer? And while we’re at it, make stdout have a separate limit to the other ephemeral items?