Hello everyone!
I need help with an incident that rendered a Kubernetes cluster unusable. Below are the details:
Specifications
- Kubernetes version: 1.29.7
- Rancher version: 2.10.0
- OS: Ubuntu 24.04 LTS (noble)
- Kernel: 6.8.0-57-generic
- Environment: On-premise VMs
Issue Description
The cluster worked fine for two months until a VM reboot caused POD communication via webhooks to fail, breaking applications.
Attempted Fixes
- Restarted the entire environment (VMs + services).
- Upgraded Kubernetes.
- Restored VM backups from a working state (issue persisted).
- Reconfigured
iptables
andip6tables
.
Error Logs
1. kube-apiserver
log
Copy
I0413 16:30:18.599361 1 trace.go:236] Trace[520599913]: “List” accept:application/vnd.kubernetes.protobuf […] W0413 16:30:18.868311 1 aggregator.go:166] failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec […] E0413 16:30:19.074533 1 available_controller.go:460] v1beta1.metrics.k8s.io failed with: failing or missing response from https://:/apis/metrics.k8s.io/v1beta1 […]
2. kube-proxy
log
Copy
I0413 16:34:42.947162 1 server_others.go:168] “Using iptables Proxier” E0413 16:34:43.230905 1 proxier.go:1525] “Failed to execute iptables-restore” err=< exit status 2: Ignoring deprecated --wait-interval option. Warning: Extension MARK revision 0 not supported, missing kernel module? ip6tables-restore v1.8.8 (legacy): unknown option “–xor-mark” Error occurred at line: 17 >
3. rke2-snapshot-controller
log
Copy
E0413 02:35:52.434128 1 main.go:89] Failed to list v1 volumesnapshots with error=Get “https://:/apis/snapshot.storage.k8s.io/v1/volumesnapshots”: dial tcp :: i/o timeout
4. kube-scheduler
log
Copy
E0413 02:33:48.318390 1 reflector.go:147] […] failed to list *v1.ConfigMap: Get “https://:/api/v1/namespaces/kube-system/configmaps[…]”: dial tcp :: connect: connection refused
5. Rancher & Webhook
log
Copy
E0413 16:41:34.741026 39 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1 […]
Key Observations
- Metrics API Failure: Persistent
503
errors frommetrics.k8s.io/v1beta1
. - Network Issues: Timeouts (
i/o timeout
) and connection refused errors. - iptables Misconfiguration:
ip6tables-restore
fails with--xor-mark
error (legacy version incompatibility?).