Distributing pods based on Node performance

John_Gallagher · September 2, 2022, 1:21pm

Cluster information:

Kubernetes version: 1.20
Cloud being used: Bare metal
Installation method: Kubeadm
Host OS: Redhat

We have a 7 node cluster and have the following requirements:

Deploy specific application pods to 2 nodes only
Enure the 2 nodes have enough memory/cpu before deploying them.

For #1, we tainted the two nodes, added the tolerations as well as node affinity to our statefulsets/deployments so no other pods are allowed to run besides our application.
For #2, one of the nodes keeps crashing because the scheudler is deploying pods even though memory/cpu is over 80% and the other node is less that 30% memory/cpu.

I don’t understand why this is happening but suspect it’s because for some reason the scheduler is coming up with a score based on the entire cluster and not the two nodes we’re interested in.

Does anyone know if it’s possible to configure a custom kube-scheduler to focus on two nodes only ? I think this is the only way to achieve a balance.

Thanks & Regards,
John

thockin · September 2, 2022, 3:07pm

You can use a nodeSelector in your pods to limit which nodes are considered. Setting pod resource requests is how you ask for “enough room”. Is that not working?

John_Gallagher · September 2, 2022, 3:23pm

Thanks but isn’t this the same thing as node affinity (which is a little more advanced) ? We have given the same label to each of the two nodes and the scheduler isn’t distributing to any of the other 5 nodes (as intended) but it doesn’t seem to take high memory into account when placing new pods so one of the two nodes eventually falls over.

John_Gallagher · September 2, 2022, 3:34pm

Here’s what I have in my statefulsets for affinity/toleration:

Where label app=acme

affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: app
operator: In
values:
- acme

where taint is app=acme

tolerations:
- effect: NoSchedule
key: app
operator: Equal
value: acme

I even tried topologySpreadConstraints but took it out as didn’t see any improvement:

topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: acme

thockin · September 2, 2022, 7:37pm

Are you trying to keep other pods off these nodes (fully dedicated to one app) or just make sure this app doesn’t go to anything but these nodes?

If you are using requests.cpu you should always be able to get as much CPU as you request (unless you are over-committing by not setting allocatable properly or you are setting a too-low request).

Theog75 · September 3, 2022, 6:52pm

what are the requests setup for your application (CPU and memory)?

you have to perform proper application profiling (see Application Profiling - Documentation) that should make sure your pods will deploy ONLY is there is enough resources for the Pod to Deploy.

John_Gallagher · September 4, 2022, 7:46am

Hey there:

We 2 worker nodes with 16 CPU and 120 GB memory each

There are approx 4 statefulsets with:

requests:
cpu: 100m
memory: 6Gi

We also have many deployments which aren’t as resource hungry.

John_Gallagher · September 4, 2022, 7:53am

Thanks for your response, appreciate your time. We want this application to only run on the two nodes (16 CPU and 120 GB memory each). I can’t understand why the scheduler keeps assigning pods to one of the nodes which is over 80% memory while the other node is only 40% used Is there a possibilty that despite the node affinity/tolleration, the scheduler is still calculating a score based on the overall cluster (8 nodes) ?

If I create a custom scheduler, do you know if it’s possible to isolate it to two nodes ? I’ve been scouring the internet but can’t find a way to do this. It might be easier to control the pods if the scheduler isn’t aware of the whole cluster to begin with ?

Theog75 · September 4, 2022, 8:21am

The scheduler is reliant in addition to tolerations a node selection(including affinity) to cpu and memory request.
It does not balance the load based on consumption but based on the cpu and memory request

My guess is that the 100millicore request is able to deploy on that node regardless to its current usage (which the scheduler does know about).

The way to inform the scheduler about an application consumption is using requests, i think profiling these applications properly will solve your issue

John_Gallagher · September 4, 2022, 11:47am

Thanks you !

Sorry but I don’t understand how you mean it balances on request not consumption. Are you saying it combines the request total of each pod on each of the nodes and the node with the smallest total CPU & Request will be chosen ?

If we take a simple example of 2 statefulsets with 2 replicas with 5CPU/5GB memory

There are two worker nodes with 20CPU/20GB memory. If node A is 80% memory used and node B is 20% used and an additional stateful set with 2CPU/2GB , how is the scheduler going to decide which node to place this new application?

Theog75 · September 4, 2022, 12:23pm

The scheduler nominates Pod (regardless if it is a statefulset, deployment or cronjob) by the amount of cpu and memory requested on that node. usage is not relevant to the scheduler.

as per the example you gave, you have a node with 20 cores and 20GB or ram allocateable (regardless to node capacity i.e. if you have system reserved or kubelet reserved the allocateable amount will be lower than the capacity. if a Pod has a request of 5 cores and 5GB of RAM the scheduler will nominate Pods to nodes which have an available request on them (CPU and memory) of at least 5 cores and 5GB or memory (regardless to actual usage on that node).

you can check requests on a node by describing a node (run kubectl describe node <nodename>:

(Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                3822m (47%)   4874m (60%)
  memory             2560Mi (10%)  5404Mi (22%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)

in the above example you can see that CPU requests are at 47%, even if actual consumption or usage on that node is 100% the scheduler will still be able to nominate and deploy a Pod with a 100 millicore CPU request on that node. The Pod might then be throttling if actual CPU usage is too high on that node but that does not matter to the sceduler.

my advise is to perform application profiling to your Pods (dynamically for each version) and make sure you set the proper requests (and limits) for each Pod.

I would love to have a chat about this if you wish, email me.

thockin · September 4, 2022, 3:37pm

To emphasize the great answers here:

As long as the sum of the requests is less than that node’s allocatable capacity, the node is eligible.

If your apps are using more than the request, you are lying to the scheduler and you can’t be too surprised if it makes the “wrong” choice.

It’s generally OK to do this with CPU, since it can easily be re-balanced to requests, but memory cannot. If you request 4Gi or memory but actually need 8Gi, you should fix your request.

John_Gallagher · September 8, 2022, 9:15am

Thanks, so you mean it basically checks the total requests being used by the pods on each node (kubectl describe node ) and will send the pod to the one which has the lowest request limit regardless of how much memory is being used ?

Thanks, thought the scheduler was more intuative than that

Theog75 · September 8, 2022, 10:24am

Memory and CPU requests (if either of them are scarce REQUEST WISE on a node , and the Pod request is higher that what is available on the node - the Pod will not be nominated on that node.

thockin · September 8, 2022, 6:25pm

Pedantically, it’s not always the node which has the lowest total requests. It looks at the total requests of all pods currently scheduled on a given node and other factors and makes a decision. Requests are considered, limits and “actual usage” are not.

Topic		Replies	Views
How does scheduler decides which pod to run on which node? General Discussions	1	573	June 4, 2021
Imbalance of pods in Kubernetes General Discussions	1	1145	October 27, 2019
Can schedule only specific pods on a specific node? General Discussions	0	567	May 26, 2020
Scheduling-according-to-the-available-memory-of-the-node General Discussions	5	2213	November 29, 2023
Even distribution of critical pods in K8S cluster General Discussions	8	2285	October 11, 2018

Distributing pods based on Node performance

Cluster information:

Related topics