Cluster communication fails when spanning across class B and C networks

frath01 · May 18, 2020, 12:34pm

Our production cluster is running fine on k8s 1.12.3-rancher1-1 having several nodes in two different networks: 192.168.225.0/24 (2) and 172.30.0.0/24 (6).
When upgrading the cluster to any newer version of k8s (verified with 1.16.4-rancher1-1 and 1.17.5-rancher1-1) communication between nodes of these networks fails.

To reproduce the issue set up the following environment. It is not necessary to perform an upgrade from 1.12.3 to a new version. A clean install of any new version seems to produce the same result:

3 VMs using “Ubuntu LTS 16.04”
- one VM: GATEWAY forwarding packages between the networks (172.30.0.1; 192.168.225.1) as well as access to the internet
- one VM: CORE01 (172.30.0.2) as etcd, controlplane and worker
- one VM: FRONTEND01 (192.168.225.2) as worker

<cluster.yml>

nodes:

# frontend nodes
  - address: 192.168.225.2
    role:
      - worker
    hostname_override: frontend01
    labels:
      tier: frontend
      environment: Production
    user: deployuser
    ssh_key_path: ./frontend.key
    # note: for support of a key with a passphrase see https://rancher.com/docs/rke/v0.1.x/en/config-options/#ssh-agent

# core nodes
  - address: 172.30.0.2
    role:
      - controlplane
      - etcd
      - worker
    hostname_override: core01
    labels:
      tier: core
      environment: Production
    user: deployuser
    ssh_key_path: ./backend.key
    # note: for support of a key with a passphrase see https://rancher.com/docs/rke/v0.1.x/en/config-options/#ssh-agent

# Cluster Level Options
cluster_name: production
ignore_docker_version: false
kubernetes_version: "v1.16.4-rancher1-1"

# SSH Agent
ssh_agent_auth: false # use the rke built agent

# deploy an ingress controller on all ''
ingress:
    provider: nginx
    options:
      server-tokens: false
      ssl-redirect: false

Firewall-Rules

host	rule
FRONTEND01	allow 8472/udp from 172.30.0.2
FRONTEND01	allow 10250/tcp from 172.30.0.2
FRONTEND01	allow ssh
CORE01	allow 6443/tcp from 192.168.225.2
CORE01	allow 8472/udp from 192.168.225.2
CORE01	allow ssh

Deploy the cluster using rke (v1.0.8) and wait for it to be ready.
Launch a centos-pod on one of the nodes, e.g. CORE01
kubectl run -it centos1 --rm --image=centos --restart=Never --overrides='{"apiVersion":"v1","spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchFields":[{"key":"metadata.name","operator":"In","values":["core01"]}]}]}}}}}' --kubeconfig kube_config_cluster.yml -- /bin/bash
ping your favourite external site
for i in {1..100}; do ping -c 1 wikipedia.com; done

Notice the very slow speed in name resolution that often fails completely.

Stop FRONTEND01 and wait for the cluster to recognize the lost node
ping again

Name resolution works fast and ping succeeds every time.

Reset all VMs and change the network configuration for GATEWAY and CORE01: Put it in a 192.168.0.0/16 network segment (but not the same as FRONTEND01!)
Deploy the cluster
ping some external site

Name resolution works fast and ping succeeds every time.

component	version
OS	Ubuntu 16.04.6
docker	19.03.1 (docker-ce, docker-ce-cli)
k8s	1.12.3-rancher1-1 (ok); 1.16.4-rancher1-1 (failed), 1.17.5-rancher1-1 (failed)
rke	1.0.8
kubectl	1.16.1

Topic		Replies	Views
Communication failling between Kubernetes' node General Discussions	0	522	May 4, 2022
Connection issues with pods on Centos 8 nodes General Discussions	18	9205	December 18, 2019
Cluster communication General Discussions	2	2108	July 29, 2019
Unable to connect to svc cluster IP on the same node General Discussions	1	340	April 24, 2023
Pod to pod (different nodes) no route to host General Discussions	1	2602	July 1, 2021

Cluster communication fails when spanning across class B and C networks

Related topics