Cluster information:
Kubernetes version: v1.23.5
Cloud being used: bare-metal
Host OS: linux ubuntu 20.04
CNI and version: calico v3.20.3
CRI and version: containerd v1.5.8
We are trying to deploy a pod running a modified version of yolov5 using only cpu (no gpu)
The cpu resources:requests:cpu is set to 6 currently
The node is a linux machine with 24 cpu.
If we start 1 instance of the pod, the pod will take up ~3000m of cpu and the yolo engine takes about 200ms to analyze 1 frame
When we start the 2nd pod instance of the same image, the resource increase is similar at ~3000m cpu and about 200ms to analyze 1 frame.
At this point,
kubectl top pod
indicate the cpu and memory of the pod
yolo-abc-1-f885d644c-54lg2 3034m 596Mi
yolo-abc-11-6565756b5f-ljc44 2890m 562Mi
kubectl top node
indicate the load on the node as
workerxxx 6774m 28% 15881Mi 19%
When we try to start another pod, the pod started successfully initially, but very quickly, the analyzing of a frame will take 12 secs.
The load on the node also increase suddenly to 97%
workerxxx 23396m 97% 17009Mi 21%
yolo-abc-1-f885d644c-54lg2 7101m 587Mi
yolo-abc-11-6565756b5f-ljc44 7210m 564Mi
yolo-abc-4-54846b5475-jlb7p 7114m 5499Mi
And, the performance on the original 2 pods that is running successfully is impacted as well.
The observation is consistently at the 3rd pod, the performance is impacted.
Any suggestion to approach, troubleshoot or workaround the problem is greatly appreciated.