Kubernetes version: 1.18
Cloud being used: AWS EKS
We have a small (3-node) AWS cluster that we’ve been working on for a few months.
It’s been running ArgoCD 1.7.10, and we recently tried to upgrade that to 1.8.7.
That upgrade failed - the command didn’t return.
Since then attempting to apply CRDs has just caued a timeout.
To try to work this out I’ve extracted the application CRD from https://raw.githubusercontent.com/argoproj/argo-cd/v1.8.7/manifests/install.yaml and trying to apply that goes like this: kubectl apply -f application-crd.yaml
Error from server (Timeout): error when creating "application-crd.yaml": the server was unable to return a response in the time allotted, but may still be processing the request (post customresourcedefinitions.apiextensions.k8s.io)
We can’t find anything in any logs so we’re a bit short of ideas on where to look.
Any suggestions?
We have the problem with applying CRDs even after removing the complete ArgoCD namespace (and removing its CRDs).
So whilst the problem appears to be related to ArgoCD it isn’t a problem in an ArgoCD pod.
It might also be worth knowing that we have Istio installed, but not configured for anything.
We’ve had other CRDs fail and succeed - working with a trivial one (Extend the Kubernetes API with CustomResourceDefinitions | Kubernetes) it works some times, but not others.
It feels like the more complex the CRD the more likely it is to fail, but that might be completely spurious.
I’ve tried trimming the Argo application CRD down (ditch the schema first and then pretty much randomly ditch other bits of it) and it will succeed about one time in twenty - but then trying again immediately afterwards with the same yaml will fail.
I tried running your commands and the first the run gave:
I0324 12:29:10.548264 17384 round_trippers.go:423] curl -k -v -XPOST -H "Content-Type: application/json" -H "Accept: application/json" -H "User-Agent: kubectl/v1.18.9 (linux/amd64) kubernetes/d1db3c4" 'https://10.88.1.187/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions'
I0324 12:30:10.550515 17384 round_trippers.go:443] POST https://10.88.1.187/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions 504 Gateway Timeout in 60002 milliseconds
I0324 12:30:10.550560 17384 round_trippers.go:449] Response Headers:
I0324 12:30:10.550567 17384 round_trippers.go:452] Content-Type: text/plain; charset=utf-8
I0324 12:30:10.550573 17384 round_trippers.go:452] Content-Length: 136
I0324 12:30:10.550578 17384 round_trippers.go:452] Date: Wed, 24 Mar 2021 12:30:10 GMT
I0324 12:30:10.550583 17384 round_trippers.go:452] Cache-Control: no-cache, private
I0324 12:30:10.550607 17384 request.go:1068] Response Body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within 1m0s","reason":"Timeout","details":{},"code":504}
I0324 12:30:10.550714 17384 helpers.go:216] server response object: [{
"metadata": {},
"status": "Failure",
"message": "error when creating \"application-crd.yaml\": the server was unable to return a response in the time allotted, but may still be processing the request (post customresourcedefinitions.apiextensions.k8s.io)",
"reason": "Timeout",
"details": {
"group": "apiextensions.k8s.io",
"kind": "customresourcedefinitions",
"causes": [
{
"reason": "UnexpectedServerResponse",
"message": "{\"metadata\":{},\"status\":\"Failure\",\"message\":\"Timeout: request did not complete within 1m0s\",\"reason\":\"Timeout\",\"details\":{},\"code\":504}"
}
]
},
"code": 504
}]
F0324 12:30:10.550740 17384 helpers.go:115] Error from server (Timeout): error when creating "application-crd.yaml": the server was unable to return a response in the time allotted, but may still be processing the request (post customresourcedefinitions.apiextensions.k8s.io)
All I see in the logs are requests relating to Istio.
If it’s a managed EKS cluster, probably opening a support case is the best approach. It looks like and issue in the control plane and that’s under the AWS managed service responsability.
We contacted AWS EKS support and their instructions led us to the cause of the problem: webhooks.
We were asked to run these two:
kubectl describe mutatingwebhookconfigurations -A
kubectl describe validatingwebhookconfigurations -A
and they listed some webhooks that were no longer required.
Removing them allowed things to work again.
It seems that Kubernetes can get a bit stuck when there are broken hooks lying around.
We get similarly stuck when there are problems applying AWS Ingress rules.
It would be nice if Kubernetes had a way to reports all the hooks it was currently trying to call, along with the command that caused them to be called and when they started the call, but maybe that would be too expensive to track.