Kubernetes version: 1.21
Cloud being used: AWS
Host OS: Bottlerocket
CRI and version: containerd 1.5.11
Investigating memory usage (working set bytes) as reported through both Prometheus.
To get the full working set bytes of all applications/processes/etc on the system a simple query of
The value returned here matches pretty closely to kubectl top nodes. So so far all good. However the value returned seemed awfully high for what I expected on the node (value was about 6G).
To further dig into this I then calculated the working set bytes for just the POD(s)
In my case this returned about 2.5G. The difference of 3.5 gig between total working set and POD working set seemed high. That would mean that the OS components, kubelet + runtime (containerd in my case) were using 3.5G of memory. Almost a quarter of the total machine memory (16G instance).
Continued further down the rabbit hole and looked at the cgroup (v2) stats for the containerd.slice and found it to be using a working set (calculated using memory.current - inactive_file from memory.stat as that is the calculation that cadvisor/runc uses) around 1.7G in which the majority of it was file backed memory (active). That explains about half of the 3.5g above (not sure why containerd is using that much memory as it really shouldn’t be) but left another 1.7g unaccounted for.
Sticking to cgroup exported stats I calculated the working set bytes for the root cgroup (calculated using /proc/meminfo - inactive_file from memory.stat) and found it to be in line with what I was seeing above (so roughly 6G). So that explains where Kubernetes is getting that value. However in calculating the child cgroups/slices to see where all the memory was being used I still ended up with around 1.7G unaccounted for.
Therefore I am left with the question as too where this memory is being consumed. It is quite important to understand as this impacts Kubernetes and when OOM/eviction thresholds are met.