Missing labels from cAdvisor metrics

tihnb · April 5, 2022, 8:31am

Recently we’ve found a very high CPU usage (almost 100% all the time) of one node in our GKE cluster. When we tried to run the container_cpu_usage_seconds_total metric to identify which container consumes that high CPU usage, we found some metrics that don’t have the pod, container and namespace labels, so we can’t tell the root cause.

The exact query we run is (to find which containers consume more than 200% CPU):

rate(container_cpu_usage_seconds_total{kubernetes_io_hostname="<node name>"}[5m]) * 100 > 200

It returns:

{beta_kubernetes_io_arch="amd64", beta_kubernetes_io_os="linux", cloud_google_com_gke_boot_disk="pd-standard", cloud_google_com_gke_container_runtime="containerd", cloud_google_com_gke_netd_ready="true", cloud_google_com_gke_os_distribution="cos", cloud_google_com_machine_family="e2", cpu="total", failure_domain_beta_kubernetes_io_region="asia-northeast1", failure_domain_beta_kubernetes_io_zone="asia-northeast1-c", iam_gke_io_gke_metadata_server_enabled="true", id="/", job="kubernetes-nodes-cadvisor", kubernetes_io_arch="amd64", kubernetes_io_os="linux", topology_gke_io_zone="asia-northeast1-c", topology_kubernetes_io_region="asia-northeast1", topology_kubernetes_io_zone="asia-northeast1-c"}
{beta_kubernetes_io_arch="amd64", beta_kubernetes_io_instance_type="e2-highmem-4", beta_kubernetes_io_os="linux", cloud_google_com_gke_boot_disk="pd-standard", cloud_google_com_gke_container_runtime="containerd", cloud_google_com_gke_netd_ready="true", cloud_google_com_gke_os_distribution="cos", cloud_google_com_machine_family="e2", cpu="total", failure_domain_beta_kubernetes_io_region="asia-northeast1", failure_domain_beta_kubernetes_io_zone="asia-northeast1-c", iam_gke_io_gke_metadata_server_enabled="true", id="/kubepods", job="kubernetes-nodes-cadvisor", kubernetes_io_arch="amd64", kubernetes_io_os="linux",  topology_gke_io_zone="asia-northeast1-c", topology_kubernetes_io_region="asia-northeast1", topology_kubernetes_io_zone="asia-northeast1-c"}
{beta_kubernetes_io_arch="amd64", beta_kubernetes_io_os="linux", cloud_google_com_gke_boot_disk="pd-standard", cloud_google_com_gke_container_runtime="containerd", cloud_google_com_gke_netd_ready="true", cloud_google_com_gke_os_distribution="cos", cloud_google_com_machine_family="e2", cpu="total", failure_domain_beta_kubernetes_io_region="asia-northeast1", failure_domain_beta_kubernetes_io_zone="asia-northeast1-c", iam_gke_io_gke_metadata_server_enabled="true", id="/kubepods/burstable",  job="kubernetes-nodes-cadvisor", kubernetes_io_arch="amd64", kubernetes_io_os="linux", node_kubernetes_io_instance_type="e2-highmem-4", topology_gke_io_zone="asia-northeast1-c", topology_kubernetes_io_region="asia-northeast1", topology_kubernetes_io_zone="asia-northeast1-c"}

As you can see they don’t have the metrics so we can’t identify the root cause.

Some more info about our cluster:

Version: 1.20.15-gke.1000.
Region: Tokyo.
Machine type: e2-highmem-4.

So we want to ask:

Do we need to do anything to “enable” the labels back?
If the labels are correct (missing pod, container and namespace is expected), which steps should we do to identify which resource consumes that high CPU?

Thanks,

Topic		Replies	Views
cAdvisor labelling of metrics General Discussions development	0	1809	June 16, 2021
Details of the bug in builtin cadvisor (0.37.0) in kubelet of 1.19.2 General Discussions	1	816	January 20, 2021
[Question] Pod cgroup metrics from cadvisor in k8s v1.12 General Discussions	1	948	December 14, 2018
Cpu kubernetes bugs in google cloud General Discussions	1	518	October 9, 2019
Gathering CPU/RAM usage of namespaced pods with Java client General Discussions	0	845	December 14, 2019

Missing labels from cAdvisor metrics

Related topics