Seeking help from k8s experts.
I’m facing an issue with my custom operator - the custom resources stop getting reconciled after a few hours (or days) for any CREATE/UPDATE/DELETE events. On debugging, I observed that my operator stops receiving events in the controller’s reflector watcher. There are no errors with respect to the watcher around the time when the issue starts to happen. Although, there are general watch errors, like “too old resource version“ or “unexpected EOF“ which are recoverable. To recover from this situation, I’m forced to restart my operator pod after which things start to work until I see the same issue happen again after few hours or days.
Is there any way I can recover from a “silent” dead watch in this situation? I do not have access to the reflector, as it’s internally managed by controller-runtime. If not in the operator, what can i check further to investigate this issue?
Kubernetes version: 1.34.1
Controller-Runtime version: v0.22.4
Note:
I have tried using Cache SyncPeriod, but it only helps with reconciling existing resources in the operator’s cache. It does not help with reconciling newly created resources.
Cluster information:
Kubernetes version: 1.34.1
Cloud being used: OCI
Installation method: Oracle Kubernetes Engine
Host OS: Oracle Linux 8.10