What could cause applying CRDs to timeout

Cluster information:

Kubernetes version: 1.18
Cloud being used: AWS EKS

We have a small (3-node) AWS cluster that we’ve been working on for a few months.

It’s been running ArgoCD 1.7.10, and we recently tried to upgrade that to 1.8.7.
That upgrade failed - the command didn’t return.
Since then attempting to apply CRDs has just caued a timeout.

To try to work this out I’ve extracted the application CRD from https://raw.githubusercontent.com/argoproj/argo-cd/v1.8.7/manifests/install.yaml and trying to apply that goes like this:
kubectl apply -f application-crd.yaml Error from server (Timeout): error when creating "application-crd.yaml": the server was unable to return a response in the time allotted, but may still be processing the request (post customresourcedefinitions.apiextensions.k8s.io)

We can’t find anything in any logs so we’re a bit short of ideas on where to look.
Any suggestions?

Thanks.

Can you check the argocd-application-controller deployment pod logs? Is working as expected?

We have the problem with applying CRDs even after removing the complete ArgoCD namespace (and removing its CRDs).
So whilst the problem appears to be related to ArgoCD it isn’t a problem in an ArgoCD pod.

It might also be worth knowing that we have Istio installed, but not configured for anything.

Thanks.

Gotcha, I misread that the issue was with an Instance of an ArgoCD Application, not the actual CRD resource.

Can you create other CRDs without any issues? Can you check if the application CRD is still present, using for example:

❯ kubectl proxy &
[1] 90270

❯ curl -sq localhost:8001/apis/apiextensions.k8s.io/v1/customresourcedefinitions/applications.argoproj.io | head -n5
{
  "kind": "CustomResourceDefinition",
  "apiVersion": "apiextensions.k8s.io/v1",
  "metadata": {
    "name": "applications.argoproj.io",
❯ curl -sq localhost:8001/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/applications.argoproj.io | head -n5
{
  "kind": "CustomResourceDefinition",
  "apiVersion": "apiextensions.k8s.io/v1beta1",
  "metadata": {
    "name": "applications.argoproj.io",


We’ve had other CRDs fail and succeed - working with a trivial one (Extend the Kubernetes API with CustomResourceDefinitions | Kubernetes) it works some times, but not others.
It feels like the more complex the CRD the more likely it is to fail, but that might be completely spurious.
I’ve tried trimming the Argo application CRD down (ditch the schema first and then pretty much randomly ditch other bits of it) and it will succeed about one time in twenty - but then trying again immediately afterwards with the same yaml will fail.

I tried running your commands and the first the run gave:

$ curl -sq localhost:8001/apis/apiextensions.k8s.io/v1/customresourcedefinitions/applications.argoproj.io |    head -n5
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

$

Wanting to see what the rest of the output was I did:

$ curl -sq localhost:8001/apis/apiextensions.k8s.io/v1/customresourcedefinitions/applications.argoproj.io | less

and that worked, so I tried increasing the number of lines output by head:

$ curl -sq localhost:8001/apis/apiextensions.k8s.io/v1/customresourcedefinitions/applications.argoproj.io | head -n6

and that is hanging!

Killing that and going back to -n5 also hangs, so it’s not to do with the number of lines, I think there is just something very unhappy in our etcd :frowning:

Can you try to kubectl apply with higher verbosity(8,9…) (-v )Also the kube-api logs can show you what is happening(EKS are in cloudwatch??)

Running with -v 9 ends with:

I0324 12:29:10.548264   17384 round_trippers.go:423] curl -k -v -XPOST  -H "Content-Type: application/json" -H "Accept: application/json" -H "User-Agent: kubectl/v1.18.9 (linux/amd64) kubernetes/d1db3c4" 'https://10.88.1.187/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions'
I0324 12:30:10.550515   17384 round_trippers.go:443] POST https://10.88.1.187/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions 504 Gateway Timeout in 60002 milliseconds
I0324 12:30:10.550560   17384 round_trippers.go:449] Response Headers:
I0324 12:30:10.550567   17384 round_trippers.go:452]     Content-Type: text/plain; charset=utf-8
I0324 12:30:10.550573   17384 round_trippers.go:452]     Content-Length: 136
I0324 12:30:10.550578   17384 round_trippers.go:452]     Date: Wed, 24 Mar 2021 12:30:10 GMT
I0324 12:30:10.550583   17384 round_trippers.go:452]     Cache-Control: no-cache, private
I0324 12:30:10.550607   17384 request.go:1068] Response Body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within 1m0s","reason":"Timeout","details":{},"code":504}
I0324 12:30:10.550714   17384 helpers.go:216] server response object: [{
  "metadata": {},
  "status": "Failure",
  "message": "error when creating \"application-crd.yaml\": the server was unable to return a response in the time allotted, but may still be processing the request (post customresourcedefinitions.apiextensions.k8s.io)",
  "reason": "Timeout",
  "details": {
    "group": "apiextensions.k8s.io",
    "kind": "customresourcedefinitions",
    "causes": [
      {
        "reason": "UnexpectedServerResponse",
        "message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"Timeout: request did not complete within 1m0s\",\"reason\":\"Timeout\",\"details\":{},\"code\":504}"
      }
    ]
  },
  "code": 504
}]
F0324 12:30:10.550740   17384 helpers.go:115] Error from server (Timeout): error when creating "application-crd.yaml": the server was unable to return a response in the time allotted, but may still be processing the request (post customresourcedefinitions.apiextensions.k8s.io)

All I see in the logs are requests relating to Istio.

If it’s a managed EKS cluster, probably opening a support case is the best approach. It looks like and issue in the control plane and that’s under the AWS managed service responsability.

I agree with rael, you will need to talk about this with AWS EKS support.

We contacted AWS EKS support and their instructions led us to the cause of the problem: webhooks.
We were asked to run these two:
kubectl describe mutatingwebhookconfigurations -A
kubectl describe validatingwebhookconfigurations -A
and they listed some webhooks that were no longer required.

Removing them allowed things to work again.

It seems that Kubernetes can get a bit stuck when there are broken hooks lying around.
We get similarly stuck when there are problems applying AWS Ingress rules.

It would be nice if Kubernetes had a way to reports all the hooks it was currently trying to call, along with the command that caused them to be called and when they started the call, but maybe that would be too expensive to track.

Thanks for the help.

1 Like