Our production cluster is running fine on k8s 1.12.3-rancher1-1 having several nodes in two different networks: 192.168.225.0/24 (2) and 172.30.0.0/24 (6).
When upgrading the cluster to any newer version of k8s (verified with 1.16.4-rancher1-1 and 1.17.5-rancher1-1) communication between nodes of these networks fails.
To reproduce the issue set up the following environment. It is not necessary to perform an upgrade from 1.12.3 to a new version. A clean install of any new version seems to produce the same result:
- 3 VMs using “Ubuntu LTS 16.04”
- one VM: GATEWAY forwarding packages between the networks (172.30.0.1; 192.168.225.1) as well as access to the internet
- one VM: CORE01 (172.30.0.2) as etcd, controlplane and worker
- one VM: FRONTEND01 (192.168.225.2) as worker
<cluster.yml>
nodes:
# frontend nodes
- address: 192.168.225.2
role:
- worker
hostname_override: frontend01
labels:
tier: frontend
environment: Production
user: deployuser
ssh_key_path: ./frontend.key
# note: for support of a key with a passphrase see https://rancher.com/docs/rke/v0.1.x/en/config-options/#ssh-agent
# core nodes
- address: 172.30.0.2
role:
- controlplane
- etcd
- worker
hostname_override: core01
labels:
tier: core
environment: Production
user: deployuser
ssh_key_path: ./backend.key
# note: for support of a key with a passphrase see https://rancher.com/docs/rke/v0.1.x/en/config-options/#ssh-agent
# Cluster Level Options
cluster_name: production
ignore_docker_version: false
kubernetes_version: "v1.16.4-rancher1-1"
# SSH Agent
ssh_agent_auth: false # use the rke built agent
# deploy an ingress controller on all ''
ingress:
provider: nginx
options:
server-tokens: false
ssl-redirect: false
Firewall-Rules
host | rule |
---|---|
FRONTEND01 | allow 8472/udp from 172.30.0.2 |
FRONTEND01 | allow 10250/tcp from 172.30.0.2 |
FRONTEND01 | allow ssh |
CORE01 | allow 6443/tcp from 192.168.225.2 |
CORE01 | allow 8472/udp from 192.168.225.2 |
CORE01 | allow ssh |
- Deploy the cluster using
rke
(v1.0.8) and wait for it to be ready. - Launch a centos-pod on one of the nodes, e.g. CORE01
kubectl run -it centos1 --rm --image=centos --restart=Never --overrides='{"apiVersion":"v1","spec":{"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchFields":[{"key":"metadata.name","operator":"In","values":["core01"]}]}]}}}}}' --kubeconfig kube_config_cluster.yml -- /bin/bash
- ping your favourite external site
for i in {1..100}; do ping -c 1 wikipedia.com; done
Notice the very slow speed in name resolution that often fails completely.
- Stop FRONTEND01 and wait for the cluster to recognize the lost node
- ping again
Name resolution works fast and ping succeeds every time.
- Reset all VMs and change the network configuration for GATEWAY and CORE01: Put it in a 192.168.0.0/16 network segment (but not the same as FRONTEND01!)
- Deploy the cluster
- ping some external site
Name resolution works fast and ping succeeds every time.
component | version |
---|---|
OS | Ubuntu 16.04.6 |
docker | 19.03.1 (docker-ce, docker-ce-cli) |
k8s | 1.12.3-rancher1-1 (ok); 1.16.4-rancher1-1 (failed), 1.17.5-rancher1-1 (failed) |
rke | 1.0.8 |
kubectl | 1.16.1 |