Leader election utility

I have a need to do leader election in my application. I am aware that there is a leader election utility already implemented in client-go .

I am thinking of implementing following algorithm that I believe would be simpler, wouldn’t depend on clocks on nodes running pods and will detect leader failure very quickly without waiting for a “leaseExpiry”. I have to admit that I am not an expert on kubernetes internal management of resources, so my understanding could be wrong. So, I would like to know if following algorithm makes sense.

Let say there are 3 pods P1, P2 and P3 among which the leader is to be elected. User provides a “key”(e.g. “test-leader-election” ) that all pods know about upfront. (For simplicity, assume there is only one app process/container in the pod).
High level idea is that each pod will try to create a configMap named “test-leader-election” . Pod will put its own identity in a known annotation in the configMap. Pod will also set the metadata.ownerReferences field in the configMap so that GC always deletes the configMap if owner pod is deleted/disappeared.
whichever candidate pod manages to create the configMap becomes the leader and others place a watch on that configMap (using field selector on name). If any non-leader pod notices disappearance of the configMap, it tries to create the configMap to become leader.

I “think” above algorithm would provide “fencing” as well. (Assuming k8s can guarantee that above configMap can only disappear if the pod who created it disappears. But I have “heard” about corner cases where pods might become zombies still running the process and k8s not knowing about those.)

What do you guys think?

Instead of the leaseExpiry you have to wait for the GC doing it’s sweep, right?

Not sure how the GC is implemented, but wouldn’t you be exchanging a configuration (like the leaseExpiry) that you can tune as you want for the kubernetes GC? Probably changes to the GC behavior are global and you may loose flexibility (if that matters to your use case).

Am I missing something?

Sorry if I’m saying something obvious or, even worse obviously wrong, I have not used this before and don’t know the client-go implementation :slight_smile:

current leaseExpiry based implementation will notice a leader going away only after whole leaseExpiry has passed which introduces delays before next leader can be elected. You can’t configure leaseExpiry to be too low or else you risk “thinking” that leader is down where it actually isn’t . Also this approach depends on clocks on pretty much all the nodes because pods could be running anywhere.

AFAIK GC of configMap on deletion of pod would be faster than the leaseExpiry you can afford to set.