I have a production cluster is currently running on K8s version
1.19.9, where the kube-scheduler and kube-controller-manager failed to have leader elections. The leader is able to acquire the first lease, however it then cannot renew/reacquire the lease, this has caused other pods to constantly in the loop of electing leaders as none of them could stay on long enough to process anything/stay on long enough to do anything meaningful and they time out, where another pod will take the new lease; this happens from node to node.
My duct tape recovery method was to shutdown the other candidates and disable leader elections
--leader-elect=false. We manually set a leader and let it stay on for a while, then reactivated leader elections after. This has seemed to work as intended again, the leases are renewing normally after
Could it be possible that the api-server may be too overwhelmed to expend any resources(?), because the elections have failed due to timeout? Was wondering if anyone has ever encountered such an issue.