Custom Deployment/Statefulset Behavior Advice Request

Kubernetes version: 1.16.8
Cloud being used: AWS
Installation method: Kops
Host OS: Debian
CNI and version: Calico 3.9.3
CRI and version: Docker 18.09.9

All of our applications follow a similar lifecycle - containers are built via CI/CD process and installed/updated via Helm charts.

We have long-running applications that stream data and can either act as a server (direct interaction with an end-user) or client (send data to a relay/repeater/another stream). These apps have a primary instance and backup instances. This is done for several reasons - redundancy and reliability and mixed configuration: the primary can stream high quality while the backups stream low quality data or different formats depending on the use cases.

Deployments and Helm charts have gotten us 90% towards our goal with one exception: when updating the applications to the latest version or due to a configuration change our requirements dictate a need to update the backup instances first, wait for them to become ready and then roll the primary instance.

Since each instance of the application has its own configuration we can’t utilize this as a ReplicaSet. We looked at implementing a Stateful Set, but in some cases application instances have side-car containers to do additional data processing and others do not, so the set definition does not work. There is also an added complexity that the instances can be deployed to multiple clusters in different AWS regions which also complicate the implementation of a StatefulSet.

We’re investigating using the Operator Framework to bridge the gap, but we’re not sure that’s the right path to go down. There are also several implementations: Kudo, Metacontroller, Operator SDK, etc. it’s hard to tell which one would be the right fit.

Does anyone have advice around how we should go about achieving our requirement?

Much appreciated.

In case this helps anyone else facing this type of issue, we went with a custom init container that performs healthchecks against the backup pods. Combine this with the RollingDeployment update strategy, this effectively allows the primary pod to continue to run while its replacement waits until all the backups are ready. Once the backups are ready, the primary continues the rolling update. In case of an issue, the primary pod will fail at initialization allowing an operator to remedy.