FailedCreatePodSandBox PODs - nodes can't reach API; canal kube-flannel - 100.64.0.1:443: getsockopt: no route to host

I am running cluster of 3 nodes and 1 master on AWS, started by kops.
After stopping the master node manually on the AWS the EC2 was terminated and recreated, the problem has started.

  • Some pods are in the: ContainerCreating state with: FailedCreatePodSandBox Warning:
Warning  FailedCreatePodSandBox  28m                    kubelet, ip-10-26-35-244.eu-west-1.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container 
"f661e658410e063e53bf2e554c0b3134a8b62e239b21982746f4fdc5ff94a95b" network for pod "project-api-db75bcb89-8j7gh": NetworkPlugin cni failed to set up pod "project-api-db75bcb89-8j7gh_default" network: error getting ClusterInformation: Get https://[100.64.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 100.64.0.1:443: connect: no route to host, failed to clean up sandbox container "f661e658410e063e53bf2e554c0b3134a8b62e239b21982746f4fdc5ff94a95b" network for pod "project-api-db75bcb89-8j7gh": NetworkPlugin cni failed to teardown pod "project-api-db75bcb89-8j7gh_default" network: error getting ClusterInformation: Get https://[100.64.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 100.64.0.1:443: connect: no route to host]
  Normal   SandboxChanged          3m19s (x100 over 28m)  kubelet, ip-10-26-35-244.eu-west-1.compute.internal  Pod sandbox changed, it will be killed and re-created.
  • Canal pods failing on 2 nodes:
17:46 $ kubectl get pods -n kube-system
NAME                                                                READY   STATUS             RESTARTS   AGE
canal-4524l                                                         1/3     CrashLoopBackOff   80         2h
canal-5k54f                                                         2/3     Running            0          73d
canal-7lj5j                                                         3/3     Running            7          73d
canal-fm9r7                                                         3/3     Running            0          3h
  • The kube-flannel can’t reach the API:
kubectl logs canal-5k54f -n kube-system kube-flannel
E0729 11:15:12.083011       1 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:284: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: getsockopt: no route to host
  • When I SSH onto Kubernetes node, I can confirm that curl https://100.64.0.1:443 is not reachable on 2 nodes and it is reachable on the other two. This order corresponds to the canal pods failing.

Cluster information:

Kubernetes version: 1.11.9
Kubectl version: 1.14.1
Kops version: 1.11.0
Cloud being used: AWS
Installation method: Kops
Host OS: Debian 9.5
CNI and version: Canal / Calico
CRI and version: Not sure

What could be the problem?
How can I find out why on some nodes: 100.64.0.1:443 returns API and on some nodes it returns: no route to host. How the traffic is routed to respective master node? I couldn’t find anything in the route tables on the server, as well as in the AWS.

==
Kops Yaml

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2019-05-17T09:48:15Z
  name: live.k8s.local
spec:
  additionalPolicies:
    node: |
      [
        {
           "Effect": "Allow",
           "Action": [
             "route53:ChangeResourceRecordSets"
           ],
           "Resource": [
             "arn:aws:route53:::hostedzone/*"
           ]
        },
        {
           "Effect": "Allow",
           "Action": [
             "route53:ListHostedZones",
             "route53:ListResourceRecordSets"
           ],
           "Resource": [
             "*"
           ]
        },
        {
          "Effect": "Allow",
          "Resource": [
            "*"
          ],
          "Action": [
            "ec2:AssociateAddress",
            "ec2:DisassociateAddress",
            "ec2:DescribeAddresses"
          ]
        }
      ]
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://kops-state-store-live/live.k8s.local
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: a
    name: main
  - etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: a
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    admissionControl:
    - Initializers
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - PersistentVolumeLabel
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - NodeRestriction
    - ResourceQuota
    - AlwaysPullImages
    - DenyEscalatingExec
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
  kubernetesApiAccess:
  - 34.34.34.34/32
  kubernetesVersion: 1.11.9
  masterInternalName: api.internal.live.k8s.local
  masterPublicName: api.live.k8s.local
  networkCIDR: 10.26.0.0/16
  networkID: vpc-26aa6343
  networking:
    canal: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 34.34.34.34/32
  subnets:
  - cidr: 10.26.32.0/19
    name: eu-west-1a
    type: Public
    zone: eu-west-1a
  topology:
    dns:
      type: Public
    masters: public
    nodes: public

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-05-17T09:48:16Z
  labels:
    kops.k8s.io/cluster: live.k8s.local
  name: master-eu-west-1a
spec:
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-west-1a
  role: Master
  subnets:
  - eu-west-1a

---

apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: 2019-05-17T09:48:16Z
  labels:
    kops.k8s.io/cluster: live.k8s.local
  name: nodes
spec:
  additionalUserData:
  - content: |
      #!/bin/bash
      apt update
      apt install -y jq awscli
      export AWS_DEFAULT_REGION=eu-west-1
      INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
      ALLOCATION_ID=`aws ec2 describe-addresses --filters="Name=tag:Name,Values=live-kops" | jq -r '.Addresses[] | "\(.InstanceId) \(.AllocationId)"' | grep null | awk '{print $2}' | xargs shuf -n1 -e`
      if [ ! -z $ALLOCATION_ID ]; then
        aws ec2 associate-address --instance-id $INSTANCE_ID --allocation-id $ALLOCATION_ID --allow-reassociation > /tmp/assign-eip.sh.log
      fi
    name: assign-eip.sh
    type: text/x-shellscript
  image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
  machineType: r5.xlarge
  maxSize: 3
  minSize: 3
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
  role: Node
  rootVolumeSize: 500
  rootVolumeType: gp2
  subnets:
  - eu-west-1a

Hi,

I don’t see any issue with replication here, although I’m on a different platform. Can you please try restarting the cluster and check.

If you still face the issue, do write back and mention the kops manifest details.

Thanks

What you mean by saying cluster restart? Is there any official way to restart cluster?

The problem with the restart is that also I don’t have any guarantees that cluster will not fall apart totally, there are some live services running at the moment. There are some PVC.
Cluster is in broken state now, but most of the PODs are running fine.

Where the: 100.64.0.1:443 request gets served from? How it’s routed from a single node to that IP? I believe once I can solve this problem I might restore my cluster state, as this is an obvious issue in logs.

Thanks !

Right, so I’ve had to suspend my operations for a while. I’ve did the rolling-restart of all nodes, however the validation was failing on:
I0730 08:42:40.118682 10809 instancegroups.go:273] Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: kube-system pod "canal-c4wmp" is not healthy.

so I’ve decided to use: --cloudonly

My live operations stopped for a while, but after restart was finished everything went back to normal.

Great to hear that all is fixed now. But to leverage it fully and to avoid the hiccups it’s recommended use a native container cloud for K8s. Like I run many nodes for my clients and had to do away with aws.

Have a great day.

Thanks, might consider this. Which one you would recommend?

You’re welcome. I’m using RealCloud [https://realcloud.in], using them for more than a year now. K8s Cluster deployment and Jenkins, Maven too is with 1 click. Highly recommended.

They are “native” container cloud with simple 1 click cluster setup, cost has come down heavily due to their native container granularity compared to what I was paying on aws.