Missing labels from cAdvisor metrics

Recently we’ve found a very high CPU usage (almost 100% all the time) of one node in our GKE cluster. When we tried to run the container_cpu_usage_seconds_total metric to identify which container consumes that high CPU usage, we found some metrics that don’t have the pod, container and namespace labels, so we can’t tell the root cause.

The exact query we run is (to find which containers consume more than 200% CPU):

rate(container_cpu_usage_seconds_total{kubernetes_io_hostname="<node name>"}[5m]) * 100 > 200

It returns:

{beta_kubernetes_io_arch="amd64", beta_kubernetes_io_os="linux", cloud_google_com_gke_boot_disk="pd-standard", cloud_google_com_gke_container_runtime="containerd", cloud_google_com_gke_netd_ready="true", cloud_google_com_gke_os_distribution="cos", cloud_google_com_machine_family="e2", cpu="total", failure_domain_beta_kubernetes_io_region="asia-northeast1", failure_domain_beta_kubernetes_io_zone="asia-northeast1-c", iam_gke_io_gke_metadata_server_enabled="true", id="/", job="kubernetes-nodes-cadvisor", kubernetes_io_arch="amd64", kubernetes_io_os="linux", topology_gke_io_zone="asia-northeast1-c", topology_kubernetes_io_region="asia-northeast1", topology_kubernetes_io_zone="asia-northeast1-c"}
{beta_kubernetes_io_arch="amd64", beta_kubernetes_io_instance_type="e2-highmem-4", beta_kubernetes_io_os="linux", cloud_google_com_gke_boot_disk="pd-standard", cloud_google_com_gke_container_runtime="containerd", cloud_google_com_gke_netd_ready="true", cloud_google_com_gke_os_distribution="cos", cloud_google_com_machine_family="e2", cpu="total", failure_domain_beta_kubernetes_io_region="asia-northeast1", failure_domain_beta_kubernetes_io_zone="asia-northeast1-c", iam_gke_io_gke_metadata_server_enabled="true", id="/kubepods", job="kubernetes-nodes-cadvisor", kubernetes_io_arch="amd64", kubernetes_io_os="linux",  topology_gke_io_zone="asia-northeast1-c", topology_kubernetes_io_region="asia-northeast1", topology_kubernetes_io_zone="asia-northeast1-c"}
{beta_kubernetes_io_arch="amd64", beta_kubernetes_io_os="linux", cloud_google_com_gke_boot_disk="pd-standard", cloud_google_com_gke_container_runtime="containerd", cloud_google_com_gke_netd_ready="true", cloud_google_com_gke_os_distribution="cos", cloud_google_com_machine_family="e2", cpu="total", failure_domain_beta_kubernetes_io_region="asia-northeast1", failure_domain_beta_kubernetes_io_zone="asia-northeast1-c", iam_gke_io_gke_metadata_server_enabled="true", id="/kubepods/burstable",  job="kubernetes-nodes-cadvisor", kubernetes_io_arch="amd64", kubernetes_io_os="linux", node_kubernetes_io_instance_type="e2-highmem-4", topology_gke_io_zone="asia-northeast1-c", topology_kubernetes_io_region="asia-northeast1", topology_kubernetes_io_zone="asia-northeast1-c"}

As you can see they don’t have the metrics so we can’t identify the root cause.

Some more info about our cluster:

  • Version: 1.20.15-gke.1000.
  • Region: Tokyo.
  • Machine type: e2-highmem-4.

So we want to ask:

  • Do we need to do anything to “enable” the labels back?
  • If the labels are correct (missing pod, container and namespace is expected), which steps should we do to identify which resource consumes that high CPU?

Thanks,