Recently we’ve found a very high CPU usage (almost 100% all the time) of one node in our GKE cluster. When we tried to run the container_cpu_usage_seconds_total
metric to identify which container consumes that high CPU usage, we found some metrics that don’t have the pod
, container
and namespace
labels, so we can’t tell the root cause.
The exact query we run is (to find which containers consume more than 200% CPU):
rate(container_cpu_usage_seconds_total{kubernetes_io_hostname="<node name>"}[5m]) * 100 > 200
It returns:
{beta_kubernetes_io_arch="amd64", beta_kubernetes_io_os="linux", cloud_google_com_gke_boot_disk="pd-standard", cloud_google_com_gke_container_runtime="containerd", cloud_google_com_gke_netd_ready="true", cloud_google_com_gke_os_distribution="cos", cloud_google_com_machine_family="e2", cpu="total", failure_domain_beta_kubernetes_io_region="asia-northeast1", failure_domain_beta_kubernetes_io_zone="asia-northeast1-c", iam_gke_io_gke_metadata_server_enabled="true", id="/", job="kubernetes-nodes-cadvisor", kubernetes_io_arch="amd64", kubernetes_io_os="linux", topology_gke_io_zone="asia-northeast1-c", topology_kubernetes_io_region="asia-northeast1", topology_kubernetes_io_zone="asia-northeast1-c"}
{beta_kubernetes_io_arch="amd64", beta_kubernetes_io_instance_type="e2-highmem-4", beta_kubernetes_io_os="linux", cloud_google_com_gke_boot_disk="pd-standard", cloud_google_com_gke_container_runtime="containerd", cloud_google_com_gke_netd_ready="true", cloud_google_com_gke_os_distribution="cos", cloud_google_com_machine_family="e2", cpu="total", failure_domain_beta_kubernetes_io_region="asia-northeast1", failure_domain_beta_kubernetes_io_zone="asia-northeast1-c", iam_gke_io_gke_metadata_server_enabled="true", id="/kubepods", job="kubernetes-nodes-cadvisor", kubernetes_io_arch="amd64", kubernetes_io_os="linux", topology_gke_io_zone="asia-northeast1-c", topology_kubernetes_io_region="asia-northeast1", topology_kubernetes_io_zone="asia-northeast1-c"}
{beta_kubernetes_io_arch="amd64", beta_kubernetes_io_os="linux", cloud_google_com_gke_boot_disk="pd-standard", cloud_google_com_gke_container_runtime="containerd", cloud_google_com_gke_netd_ready="true", cloud_google_com_gke_os_distribution="cos", cloud_google_com_machine_family="e2", cpu="total", failure_domain_beta_kubernetes_io_region="asia-northeast1", failure_domain_beta_kubernetes_io_zone="asia-northeast1-c", iam_gke_io_gke_metadata_server_enabled="true", id="/kubepods/burstable", job="kubernetes-nodes-cadvisor", kubernetes_io_arch="amd64", kubernetes_io_os="linux", node_kubernetes_io_instance_type="e2-highmem-4", topology_gke_io_zone="asia-northeast1-c", topology_kubernetes_io_region="asia-northeast1", topology_kubernetes_io_zone="asia-northeast1-c"}
As you can see they don’t have the metrics so we can’t identify the root cause.
Some more info about our cluster:
- Version: 1.20.15-gke.1000.
- Region: Tokyo.
- Machine type: e2-highmem-4.
So we want to ask:
- Do we need to do anything to “enable” the labels back?
- If the labels are correct (missing
pod
,container
andnamespace
is expected), which steps should we do to identify which resource consumes that high CPU?
Thanks,