Increased kube-apiserver 504s with Kubernetes 1.21

Cluster information:

Kubernetes version: v1.21.3
Cloud being used: AWS
Installation method: Kops 1.21.0
Host OS: ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20210720
CNI and version: calico v3.19.1
CRI and version: containerd v1.4.6

After a rolling update from a 1.20 cluster to 1.21, I’ve noticed an increase in 504 responses from the apiservers. The metric involved is

 sum by(code, group, namespace, resource, subresource, version, verb) (rate(apiserver_request_terminations_total[5m])) > 0

The rate isn’t at all high, but in prior clusters it’s been rock-steady at 0. I’m not sure there’s any meaningful action for me to take other than adding a little bit of tolerance to a low level of these errors, but it seems like someone might be interested in this little problem.

What’s the cause of the 504s from the apiserver container logs?

My best guess is messages like

1 wrap.go:54] timeout or abort while handling: GET "/api/v1/namespaces/kafka-infra/pods/infra-kafka-1"

are issued at the same time the metrics are incremented. There isn’t really anything obvious in nearby logs to explain why what seems like a pretty simple request is having problems. There’s some other suspicious stuff like

I0819 11:32:17.087600       1 healthz.go:244] etcd check failed: healthz

At this point, I’m going to destroy this cluster and try again. :crossed_fingers: a fresh 1.21 install will behave better. (I do have another 1.21 cluster that was installed in in a similar fashion that seems to be working fine.)

Well, no luck on a newly provisioned cluster working better. I do have a theory that it’s the Strimzi Kafka operator that’s “causing” the problem, is that if I scale Strimzi’s Deployments down to zero, the API server metrics quiet down:

This is not a problem in Kubernetes 1.18.12.

Well, it’s weird that the api-server has a timeout when handling a request to get the pod info at /api/v1/namespaces/kafka-infra/pods/infra-kafka-1.

I’m not sure I would blame your kafka statefulset persay. My guess would be that you have something hammering etcd and/or the api.

Whatever it is, it is also hammering the 1.18 cluster, which handles it without these errors. Here are metrics from a 1.18 cluster (api servers are r5.xlarge):

Metrics from a 1.21.3 cluster (api servers are t3.2xlarge) will be in a subsequent post.

Here’s the prom query text for easy cut-n-paste. Note the 20x scaling factor just to make the chart a little easier to read.

sum(rate(node_cpu_seconds_total{instance=~"(|(|(",mode!="idle"}[2m])) by (instance) or 20 * (sum by(code, group, namespace, resource, subresource, version, verb) (rate(apiserver_request_total{code!~"^(20[01])|(0)|(404)"}[5m])) > 1 or sum by(code, group, namespace, resource, subresource, version, verb) (rate(apiserver_request_terminations_total[5m])) > 0)

In the 1.21 cluster, I stopped the two strimzi deployments about 15 minutes ago. It hasn’t been going long enough to be super convincing, but I have no doubt based on metrics from another 1.21 cluster where I did the same thing last night, that the errors will be gone.

I’ve tried strimzi 0.18 and 0.25 and they seem to behave the same way. Note that the kafka cluster itself seems to be fine; it’s the Deployments for the operator that I scale to zero.

Metrics from 1.21.3 (api servers are t3.2xlarge):

When I look at the information you’ve given, what I think is being observed is just the symptoms, not the cause.

Do you have visibility into exactly what is using the API? Like auditing wise?