Hello! I’m reaching out to check if I can get help on this specific issue I am encountering with a cronjob in k8s and understand what could be causing this behaviour.
I have an specific cluster where a daily CronJob is consistently delayed by 90-120 seconds.
The exact same job in three other clusters runs on time, with a normal 3-4 second delay.
This is a differential failure specific to this one cluster.
Here are the success timestamps for the same job , all scheduled for 03:00:00Z:
-
Healthy Cluster 1:
2025-11-14T03:00:03.928Z(3s delay) -
Healthy Cluster 2:
2025-11-14T03:00:03.375Z(3s delay) -
Healthy Cluster 3:
2025-11-14T03:00:04.617Z(4s delay) -
Problem Cluster:
2025-11-14T03:01:38.500Z(98s delay)
The cause of the delay was suspected to be a 92-second stall inside the kube-controller-manager on the problem cluster:
I1112 03:00:00.107020 1 job_controller.go:568] "enqueueing job" logger="job-controller" delay="0s"
I1112 03:00:00.111085 1 job_controller.go:568] "enqueueing job" logger="job-controller" delay="0s"
I1112 03:00:00.138959 1 job_controller.go:568] "enqueueing job" logger="job-controller" delay="1s"
...
I1112 03:00:00.207072 1 job_controller.go:568] "enqueueing job" logger="job-controller" delay="1s"
[... 92-SECOND GAP IN LOGS ...]
I1112 03:01:32.532598 1 job_controller.go:568] "enqueueing job" logger="job-controller" delay="0s"
The CronJob’s own status field confirms this consistent delay.
# From: kubectl get cronjob control-message-sender -o yaml
status:
lastScheduleTime: "2025-11-12T03:00:00Z"
lastSuccessfulTime: "2025-11-12T03:01:36Z"
The configurations of the “good” (on-time) jobs and “bad” (delayed) job show no meaningful difference (like startingDeadlineSeconds) that would explain this behavior.
Factors Investigated (Ruled Out)
-
Clock Skew: We have checked the clock on the node and the
kube-controller-managerpod. It is in sync with UTC. The logs from 3:00 AM also show other jobs being enqueued at the correct time, proving the clock is not the issue. -
Ghost Resource Errors: This cluster did have a
Failed to list *v1.PartialObjectMetadataerror loop. We fixed this by restarting the controller pod. The errors are now gone, but the 90-120 second delay persists. This proves the error loop was a separate symptom, not the root cause of the stall.
Question:
Is it possible that this systematic delay for this specific cluster’s cronjob is due to the approximate schedule that k8s cronjob have? Or is there any other reason that could be causing this delay?
Thank you for your help!
Environment:
-
Kubernetes version:
v1.31.1 -
Cloud being used:
bare-metal -
Installation method:
Talos OS -
Host OS:
Talos Linux -
CNI and version:
Cilium -
CRI and version:
containerd