Node cpu usage increase to ~100% from ~30% with 1 additional deployment

Cluster information:

Kubernetes version: v1.23.5
Cloud being used: bare-metal
Host OS: linux ubuntu 20.04
CNI and version: calico v3.20.3
CRI and version: containerd v1.5.8

We are trying to deploy a pod running a modified version of yolov5 using only cpu (no gpu)
The cpu resources:requests:cpu is set to 6 currently

The node is a linux machine with 24 cpu.

If we start 1 instance of the pod, the pod will take up ~3000m of cpu and the yolo engine takes about 200ms to analyze 1 frame

When we start the 2nd pod instance of the same image, the resource increase is similar at ~3000m cpu and about 200ms to analyze 1 frame.

At this point,
kubectl top pod
indicate the cpu and memory of the pod
yolo-abc-1-f885d644c-54lg2 3034m 596Mi
yolo-abc-11-6565756b5f-ljc44 2890m 562Mi

kubectl top node
indicate the load on the node as
workerxxx 6774m 28% 15881Mi 19%

When we try to start another pod, the pod started successfully initially, but very quickly, the analyzing of a frame will take 12 secs.
The load on the node also increase suddenly to 97%
workerxxx 23396m 97% 17009Mi 21%

yolo-abc-1-f885d644c-54lg2 7101m 587Mi
yolo-abc-11-6565756b5f-ljc44 7210m 564Mi
yolo-abc-4-54846b5475-jlb7p 7114m 5499Mi

And, the performance on the original 2 pods that is running successfully is impacted as well.

The observation is consistently at the 3rd pod, the performance is impacted.

Any suggestion to approach, troubleshoot or workaround the problem is greatly appreciated.

Hi, hmnls

You seem to be experiencing performance issues and resource limitations when running multiple instances of the yolov5 pod in your Kubernetes cluster (if I understand the question correctly). Here are some troubleshooting suggestions that I hope you find helpful.

When a program is not running properly, it is useless if you dedicate a big server with a lot of resources to it, so it is necessary to check all the processes.

First, make sure you’ve accurately defined your CPU and memory resource requests and limits in your pod specification. Each pod seems to request around 3000m CPU which may be too much considering you have 24 CPUs on the node. Adjust these values based on the actual needs of your application.

Second, check that the node has enough capacity to handle the increased load. A sudden increase in analysis time and node load can indicate that the node has reached its capacity. Consider scaling your cluster or optimizing resource allocation.

Check the logs of the affected pod (yolo-abc-4-…) for any error messages, warnings, or performance information. This can provide insights into what might be causing the slowdown.

Check if there are any specific YOLO settings or parameters that may be causing performance degradation. Experiment with different configurations and analyze the impact on performance.

Check if there is interference or resource conflict between pods on the same node. It is possible for the third pod to consume resources that affect the performance of other pods. Consider spreading pods across multiple nodes.

Check the overall health of your node. Make sure there are no underlying issues such as disk I/O bottlenecks or network issues affecting performance.
Upgrade Kubernetes and Components:

Make sure you are using the latest stable versions of Kubernetes, Calico and Node OS and other components (very important). Upgrading to the latest version may include performance improvements and bug fixes.

Use profiling tools to analyze YOLO app performance. Determine which part of the application consumes more and slows down. Optimize the relevant code or settings. For example, check the same tool in another environment (Docker swarm). I usually use the Elastic APM tool for application performance

If the workload changes over time, implement horizontal pod autoscaling (HPA) to automatically adjust the number of pod replicas based on resource usage.
By systematically examining these aspects, you should be able to identify the root cause of performance degradation and implement appropriate solutions or solutions.

Hi jamallmahmoudi,

Thanks so much for the detailed information.

Please let me go through the suggestions carefully and gather more information before I make any assumption.

I do log some of the information required.
Thus, will provide them in a separate reply.

But I just like to say that the detail reply is greatly appreciated.

Thanks again.

These are the additional information I can gather for the problem.

The resources:requests:cpu has been tested with different value from 1, 2, 4, 6, 8.
For the case of 8cpu, the 3rd pod cannot be provisioned because on a node with 24 cpu, k8s cannot allocate enough cpu to start the 3rd pod.
For the rest of the cpu settings, 1, 2, 4, 6, the behavior did not change.
We did initially leave the cpu setting commented, but the time of the inference is few time slower than setting to any of the above value.

For the scaling of the capacity, we actually do have 2 worker node even on the example provided.
The behavior is actually almost identical in the sense that if I balance over the 2 worker node, then the maximum pod to run is 4 (2 on each node).
On the 5th pod, i.e. the 3rd pod on the specific worker, the cpu will go to 100%
I do agree that there should be some increase in demand for system resource, but I do not know exactly which resource it is at this time.
The guess is the cpu.

On the logging of the pod, we did log the execution of each step and can pin point to the code that is executing the prediction that is taking majority of the time.
e…g
pred = self.model(image, augment-False)
this line of code is taking largest percentage of time logged for the total time need for yolo (>90%)

And, as for comparing the cpu load required for the application.
We did try to run the python app natively, in a docker container and a pod.
The cpu is similar over the 3 environment as in 400% of cpu is 3 cases.
However, I just found out that it cpu load of the pod might spike over 10 cpu temporarily.
So, I think I should be looking into reducing the cpu requirement of the app.

Hi, hmnls

Thanks for the detailed information!

It’s good that you tested different CPU resource requests, but it looks like there may still be challenges with allocating enough CPU for your pods, especially when using 8 CPUs. You may want to consider CPU requests and limits based on the actual CPU usage patterns you observe during application execution. If the third pod reaches 100% CPU on a particular worker node, the node itself may have reached capacity. Consider monitoring the overall consumption of node resources (CPU, memory) during pod scaling. If a node is consistently reaching its resource limits, you may need to scale your cluster or optimize application resource usage.
and Since you’ve identified a particular line of code (pred = self.model(image, augment=false)) as a bottleneck, you may want to look for ways to optimize this part of your program. This could include profiling the code, using more efficient algorithms, or parallelizing certain computations, etc.

If you fail, as a last resort you should shoot the app developer

Hi jamallmahmoudi,

Sorry for the late reply as I missed the reply.

Thanks for the advices.

I have also tested with limiting the cpu to fit the expected no. of pods in a node.
However, while it does manage to deploy the expected no. of pods, almost immediately, the pod will used up all the cpu it’s allocate and the processing time for a frame keeps getting longer.
I think this is a sign of the cpu thrashing ( but may be wrong)

We somehow manage workaround (not really solve it)
I think the cpu load is heavily dependent on the size of the model.
Thus, with help from team members, by replacing the model with one that is much smaller in size, we are able to load more pod in a server.
Thus, in a way, we are now able to fit in the expected number of pods in a server.

Blockquote
If you fail, as a last resort you should shoot the app developer

If you referring to the one who code the app (not yolo), don’t think it’s a good idea.
I will have to shoot myself then. :sweat_smile::sweat_smile::sweat_smile: