Problem whit communications cluster K8S RK2e

Hello everyone!

I need help with an incident that rendered a Kubernetes cluster unusable. Below are the details:

Specifications

  • Kubernetes version: 1.29.7
  • Rancher version: 2.10.0
  • OS: Ubuntu 24.04 LTS (noble)
  • Kernel: 6.8.0-57-generic
  • Environment: On-premise VMs

Issue Description

The cluster worked fine for two months until a VM reboot caused POD communication via webhooks to fail, breaking applications.

Attempted Fixes

  1. Restarted the entire environment (VMs + services).
  2. Upgraded Kubernetes.
  3. Restored VM backups from a working state (issue persisted).
  4. Reconfigured iptables and ip6tables.

Error Logs

1. kube-apiserver

log

Copy

I0413 16:30:18.599361 1 trace.go:236] Trace[520599913]: “List” accept:application/vnd.kubernetes.protobuf […] W0413 16:30:18.868311 1 aggregator.go:166] failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec […] E0413 16:30:19.074533 1 available_controller.go:460] v1beta1.metrics.k8s.io failed with: failing or missing response from https://:/apis/metrics.k8s.io/v1beta1 […]

2. kube-proxy

log

Copy

I0413 16:34:42.947162 1 server_others.go:168] “Using iptables Proxier” E0413 16:34:43.230905 1 proxier.go:1525] “Failed to execute iptables-restore” err=< exit status 2: Ignoring deprecated --wait-interval option. Warning: Extension MARK revision 0 not supported, missing kernel module? ip6tables-restore v1.8.8 (legacy): unknown option “–xor-mark” Error occurred at line: 17 >

3. rke2-snapshot-controller

log

Copy

E0413 02:35:52.434128 1 main.go:89] Failed to list v1 volumesnapshots with error=Get “https://:/apis/snapshot.storage.k8s.io/v1/volumesnapshots”: dial tcp :: i/o timeout

4. kube-scheduler

log

Copy

E0413 02:33:48.318390 1 reflector.go:147] […] failed to list *v1.ConfigMap: Get “https://:/api/v1/namespaces/kube-system/configmaps[…]”: dial tcp :: connect: connection refused

5. Rancher & Webhook

log

Copy

E0413 16:41:34.741026 39 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1 […]


Key Observations

  • Metrics API Failure: Persistent 503 errors from metrics.k8s.io/v1beta1.
  • Network Issues: Timeouts (i/o timeout) and connection refused errors.
  • iptables Misconfiguration: ip6tables-restore fails with --xor-mark error (legacy version incompatibility?).