Hey :=),
I am working on a custom scheduler which kinda works like a batch scheduler, however i have a small problem.
E.g. if i want to do the following (oversimplified):
Schedule Pod A first but keep it waiting (via permit phase), now schedule Pod B and if (for whatever reasons) none of the nodes satisfies our requirements redo the scheduling process for Pod A but filter out the node which has been chosen before for Pod A.
When I use the following in the permit phase to accomplish this:
- return framework.NewStatus(framework.Unschedulable, “Test case 2”) (to discard the current Pod) or
- waitingPod.Reject(ss.Name(), “Test case 4”) (to discard a previous Pod)
depending on what we want to accomplish.
The selected pods get marked as “scheduling failed”, but it remains in the “Pending” state. However, the time until the scheduler retries to schedule them again is inconsistent. Sometimes it instantly retries to schedule the pods (which is what i want) and sometimes nothing happens for up to 5–6 minutes.
So basically my question is, what causes this behavior? Specifically, what determines when the scheduler retries scheduling if the scheduling of a pod fails? And is it possible to reduce the retry time? Can this be achieved by tags in the deployment file, configuration changes, or manually moving the pods into the activeQ?
Maybe somebody is able to help me