Best Practice deploying the same deployment per node

Planning to deploy our application on azure AKS.

Our initial plan is to:

  1. Deploy a deployment with 3 replicas of the application on each cluster node. i.e, each node will have a deployment and all 3 pods belonging to that deployment will be on that node.
  2. All the deployments across nodes have the same application.
  3. Use a single 1 TypeLB service load-balancing all the deployments.

The idea was if a node needs any maintenance or if a new version of the application is released we can just remove the deployment from the endpoints list and upgrade one node at a time.

Wanted to know if this is an anti-pattern or we will face any issues with this.

I think you consider Kubernetes worker nodes as a “regular” cluster, just a bunch of servers and missing the features that the Kubernetes Control Plane provides.

As I don’t know the specific requirements of your application, I may be wrong, but I don’t think it’s necessary to deploy three copies of your application (totalling 9 pods) and bind each copy to just one node.

TL;DR;

Deployments are not bound to nodes; they exists as objects in the API for managing pods.

Say you have a three (worker) nodes cluster (as the control plane is managed by Azure). When you create a deployment and specify 3 pods, the deployment is stored in etcd. The deployment controller sees the new deployment object and checks if the actual state of the cluster (0 pods) meets the desired stated (3 pods) requested in the deployment manifest. As the desired and current state of the cluster are different, the deployment controller requests the creation of 3 pods to the Replication Controller and it will create the pods (there are some more steps involved, but I think they are not relevant for your question). As there are three worker nodes, the Scheduler evaluates which node is more suitable for each pod. Assuming the three nodes have the same load, the default scheduling policy will spread the pods on the nodes, so pod-a is scheduled (and created) on node 1, pod-b on node 2 and pod-c on node-3.

Node maintenance

To perform some node maintenance, you first need to run kubectl cordon on the node. This tells the scheduler to do not schedule new pods on the node (say, node-3). Pods are still running and your app available. Then you run kubectl drain on the node. The K8s API begins terminating pods on the node, so your pod-c (on node-3) is removed. The Deployment Controller (via the Replication Controller) sees 2 pods, but the deployment states 3, so a new pod is requested.

Now, as the node-3 is unavailable for scheduling (because of the kubectl cordon command), the scheduler can only choose between node-1 or node-2 to create the third pod for the deployment. Let’s say it schedules it on the node-1; then, you have pod-a, pod-d (it’s a new pod and it has a new identifier) on node-1 and pod-b in node-2.

Your application is available during the whole process (althought depending on your application behaviour, users connecting to the evicted pod may notice it if the response from your application takes too long and the pod is killed before it finishes processing).

One thing that you need to consider is that the Deployment only takes care of the number of pods that you request; to actually expose your application outside the cluster you need to deploy at least a Service. The service acts “like a load balancer”, sending requests to all available pods (related to the service). Your cloud provider may deploy LB automatically depending on the type os Service that you request.

Upgrading the app

The Deployment is able to manage multiple ReplicaSets. ReplicaSets define sets of identical pods. So the reason why the Deployment Controller asks the Replica Controller to create pods (instead of creating them directly) is to allow rolling updates.

So, you create a Deployment with three pods of your app (v1). Later, you want to upgrade to v2. You update the manifest of the Deployment (from docker.com/user/app:v1 to docker.com/user/app:v2).

The Deployment Controller notices the change. The Deployment Controller is “smart” enough to see that this is an update of an existing Deployment. The Deployment Controller sees a difference between the current state of the cluster (3 pods “v1”) and (0 pods “v2”). The Deployment controller requests the Replication Controller the creation of three pods “v2”.

By default, instead of removing the three pods:v1 and creating the pods:v2 all at once, the Replication Controller creates a new ReplicaSet of pods:v2 linked to the existing Deployment and creates one pod:v2. At this point, the Deployment is managing two ReplicaSets, one with 3 pods:v1 and a second one with 1 pod:v2.

Once the system checks the readiness of the pod:v2, it removes (randomly) one of the pods:v1. Now, there are 2 pods:v1 and 1 pod:v2. The Deployment Controller still is not happy (he wants 3 pods:v2), so the process is repeated: 2pods:v1, 2pods:v2, kill 1 pod:v1. This process goes on until the systems reaches the desired state (3pods:v2), and the last pod:v1 has been removed.

Conclusion

Kubernetes provides functionality to avoid interrupting the service provided by the applications deployed in the cluster. Multiple components in the Kubernetes Control Plane take care of the “well-being” of the pods, checking if they are running ok and “fixing” them when they are not.

I would recommend checking the Kubernetes functionality, as it may help you to deploy your application confidently (and maybe even cheaper, if less pods or resources are needed to run the application :wink: )