Increased kube-apiserver 504s with Kubernetes 1.21

Alec_Kloss · August 17, 2021, 12:48pm

Cluster information:

Kubernetes version: v1.21.3
Cloud being used: AWS
Installation method: Kops 1.21.0
Host OS: ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210720
CNI and version: calico v3.19.1
CRI and version: containerd v1.4.6

After a rolling update from a 1.20 cluster to 1.21, I’ve noticed an increase in 504 responses from the apiservers. The metric involved is

 sum by(code, group, namespace, resource, subresource, version, verb) (rate(apiserver_request_terminations_total[5m])) > 0

The rate isn’t at all high, but in prior clusters it’s been rock-steady at 0. I’m not sure there’s any meaningful action for me to take other than adding a little bit of tolerance to a low level of these errors, but it seems like someone might be interested in this little problem.

protosam · August 17, 2021, 10:47pm

What’s the cause of the 504s from the apiserver container logs?

Alec_Kloss · August 19, 2021, 3:30pm

My best guess is messages like

1 wrap.go:54] timeout or abort while handling: GET "/api/v1/namespaces/kafka-infra/pods/infra-kafka-1"

are issued at the same time the metrics are incremented. There isn’t really anything obvious in nearby logs to explain why what seems like a pretty simple request is having problems. There’s some other suspicious stuff like

I0819 11:32:17.087600       1 healthz.go:244] etcd check failed: healthz

At this point, I’m going to destroy this cluster and try again. a fresh 1.21 install will behave better. (I do have another 1.21 cluster that was installed in in a similar fashion that seems to be working fine.)

Alec_Kloss · August 24, 2021, 3:22am

Well, no luck on a newly provisioned cluster working better. I do have a theory that it’s the Strimzi Kafka operator that’s “causing” the problem, is that if I scale Strimzi’s Deployments down to zero, the API server metrics quiet down:

This is not a problem in Kubernetes 1.18.12.

protosam · August 24, 2021, 4:11am

Well, it’s weird that the api-server has a timeout when handling a request to get the pod info at /api/v1/namespaces/kafka-infra/pods/infra-kafka-1.

I’m not sure I would blame your kafka statefulset persay. My guess would be that you have something hammering etcd and/or the api.

Alec_Kloss · August 24, 2021, 12:50pm

Whatever it is, it is also hammering the 1.18 cluster, which handles it without these errors. Here are metrics from a 1.18 cluster (api servers are r5.xlarge):

Metrics from a 1.21.3 cluster (api servers are t3.2xlarge) will be in a subsequent post.

Here’s the prom query text for easy cut-n-paste. Note the 20x scaling factor just to make the chart a little easier to read.

sum(rate(node_cpu_seconds_total{instance=~"(10.2.20.41:9100)|(10.2.5.207:9100)|(10.2.8.107:9100)",mode!="idle"}[2m])) by (instance) or 20 * (sum by(code, group, namespace, resource, subresource, version, verb) (rate(apiserver_request_total{code!~"^(20[01])|(0)|(404)"}[5m])) > 1 or sum by(code, group, namespace, resource, subresource, version, verb) (rate(apiserver_request_terminations_total[5m])) > 0)

In the 1.21 cluster, I stopped the two strimzi deployments about 15 minutes ago. It hasn’t been going long enough to be super convincing, but I have no doubt based on metrics from another 1.21 cluster where I did the same thing last night, that the errors will be gone.

I’ve tried strimzi 0.18 and 0.25 and they seem to behave the same way. Note that the kafka cluster itself seems to be fine; it’s the Deployments for the operator that I scale to zero.

Alec_Kloss · August 24, 2021, 12:50pm

Metrics from 1.21.3 (api servers are t3.2xlarge):

protosam · August 24, 2021, 5:22pm

When I look at the information you’ve given, what I think is being observed is just the symptoms, not the cause.

Do you have visibility into exactly what is using the API? Like auditing wise?

Topic		Replies	Views
Apiserver liveness and readiness probes fail randomly with code 500 General Discussions	3	11485	May 23, 2023
Kube-apiserver stops accepting tokens, and secrets missing General Discussions	0	555	September 17, 2020
Apiserver received an error that is not an metav1.status General Discussions	1	1776	April 4, 2023
Apiserver-pod and etcd-pod in CrashLoopBackOff status General Discussions	0	851	December 29, 2022
Why my apiserver QPS so high but everything is OK? General Discussions development	3	1590	December 4, 2021

Increased kube-apiserver 504s with Kubernetes 1.21

Cluster information:

Related topics