Kubernetes version: 1.21.6
Cloud being used: GCP
We noticed that these errors started after the node-pool was autoscaled and the existing Nodes were replaced with new compute instances. This also happened during a maintenance window. We’re using an NFS server to mount the volumes.
The issue appears to have only affected certain Nodes on the cluster. We’ve cordoned the Nodes where the mount error’s and pods on the “healthy” nodes are working.
"Unable to attach or mount volumes: unmounted volumes=[vol], unattached volumes=[vol]: timed out waiting for the condition"
We’re also seeing errors on the konnectivity-agent:
"connection read failure" err="read tcp 10.4.2.34:43682->10.162.0.119:10250: use of closed network connection"
We believe the issue is when autoscaling is enabled, and new Nodes are introduced to the pool. The problem is it appears to be completely random. Sometimes the pods come up fine and others get the mount error. We are able to get pods running by Cordoning the nodes giving the mount error and deleting pods with the errors so they get scheduled to a “healthy” node.