I am running cluster of 3 nodes and 1 master on AWS, started by kops
.
After stopping the master
node manually on the AWS the EC2 was terminated and recreated, the problem has started.
- Some pods are in the:
ContainerCreating
state with:FailedCreatePodSandBox
Warning:
Warning FailedCreatePodSandBox 28m kubelet, ip-10-26-35-244.eu-west-1.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container
"f661e658410e063e53bf2e554c0b3134a8b62e239b21982746f4fdc5ff94a95b" network for pod "project-api-db75bcb89-8j7gh": NetworkPlugin cni failed to set up pod "project-api-db75bcb89-8j7gh_default" network: error getting ClusterInformation: Get https://[100.64.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 100.64.0.1:443: connect: no route to host, failed to clean up sandbox container "f661e658410e063e53bf2e554c0b3134a8b62e239b21982746f4fdc5ff94a95b" network for pod "project-api-db75bcb89-8j7gh": NetworkPlugin cni failed to teardown pod "project-api-db75bcb89-8j7gh_default" network: error getting ClusterInformation: Get https://[100.64.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 100.64.0.1:443: connect: no route to host]
Normal SandboxChanged 3m19s (x100 over 28m) kubelet, ip-10-26-35-244.eu-west-1.compute.internal Pod sandbox changed, it will be killed and re-created.
- Canal pods failing on 2 nodes:
17:46 $ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
canal-4524l 1/3 CrashLoopBackOff 80 2h
canal-5k54f 2/3 Running 0 73d
canal-7lj5j 3/3 Running 7 73d
canal-fm9r7 3/3 Running 0 3h
- The
kube-flannel
can’t reach the API:
kubectl logs canal-5k54f -n kube-system kube-flannel
E0729 11:15:12.083011 1 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:284: Failed to list *v1.Node: Get https://100.64.0.1:443/api/v1/nodes?resourceVersion=0: dial tcp 100.64.0.1:443: getsockopt: no route to host
- When I SSH onto Kubernetes node, I can confirm that
curl https://100.64.0.1:443
is not reachable on 2 nodes and it is reachable on the other two. This order corresponds to thecanal
pods failing.
Cluster information:
Kubernetes version: 1.11.9
Kubectl version: 1.14.1
Kops version: 1.11.0
Cloud being used: AWS
Installation method: Kops
Host OS: Debian 9.5
CNI and version: Canal / Calico
CRI and version: Not sure
What could be the problem?
How can I find out why on some nodes: 100.64.0.1:443
returns API and on some nodes it returns: no route to host
. How the traffic is routed to respective master
node? I couldn’t find anything in the route
tables on the server, as well as in the AWS.
==
Kops Yaml
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
creationTimestamp: 2019-05-17T09:48:15Z
name: live.k8s.local
spec:
additionalPolicies:
node: |
[
{
"Effect": "Allow",
"Action": [
"route53:ChangeResourceRecordSets"
],
"Resource": [
"arn:aws:route53:::hostedzone/*"
]
},
{
"Effect": "Allow",
"Action": [
"route53:ListHostedZones",
"route53:ListResourceRecordSets"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Resource": [
"*"
],
"Action": [
"ec2:AssociateAddress",
"ec2:DisassociateAddress",
"ec2:DescribeAddresses"
]
}
]
api:
loadBalancer:
type: Public
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://kops-state-store-live/live.k8s.local
etcdClusters:
- etcdMembers:
- instanceGroup: master-eu-west-1a
name: a
name: main
- etcdMembers:
- instanceGroup: master-eu-west-1a
name: a
name: events
iam:
allowContainerRegistry: true
legacy: false
kubeAPIServer:
admissionControl:
- Initializers
- NamespaceLifecycle
- LimitRanger
- ServiceAccount
- PersistentVolumeLabel
- DefaultStorageClass
- DefaultTolerationSeconds
- NodeRestriction
- ResourceQuota
- AlwaysPullImages
- DenyEscalatingExec
kubelet:
anonymousAuth: false
authenticationTokenWebhook: true
authorizationMode: Webhook
kubernetesApiAccess:
- 34.34.34.34/32
kubernetesVersion: 1.11.9
masterInternalName: api.internal.live.k8s.local
masterPublicName: api.live.k8s.local
networkCIDR: 10.26.0.0/16
networkID: vpc-26aa6343
networking:
canal: {}
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- 34.34.34.34/32
subnets:
- cidr: 10.26.32.0/19
name: eu-west-1a
type: Public
zone: eu-west-1a
topology:
dns:
type: Public
masters: public
nodes: public
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-05-17T09:48:16Z
labels:
kops.k8s.io/cluster: live.k8s.local
name: master-eu-west-1a
spec:
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: t3.medium
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-eu-west-1a
role: Master
subnets:
- eu-west-1a
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
creationTimestamp: 2019-05-17T09:48:16Z
labels:
kops.k8s.io/cluster: live.k8s.local
name: nodes
spec:
additionalUserData:
- content: |
#!/bin/bash
apt update
apt install -y jq awscli
export AWS_DEFAULT_REGION=eu-west-1
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
ALLOCATION_ID=`aws ec2 describe-addresses --filters="Name=tag:Name,Values=live-kops" | jq -r '.Addresses[] | "\(.InstanceId) \(.AllocationId)"' | grep null | awk '{print $2}' | xargs shuf -n1 -e`
if [ ! -z $ALLOCATION_ID ]; then
aws ec2 associate-address --instance-id $INSTANCE_ID --allocation-id $ALLOCATION_ID --allow-reassociation > /tmp/assign-eip.sh.log
fi
name: assign-eip.sh
type: text/x-shellscript
image: kope.io/k8s-1.11-debian-stretch-amd64-hvm-ebs-2018-08-17
machineType: r5.xlarge
maxSize: 3
minSize: 3
nodeLabels:
kops.k8s.io/instancegroup: nodes
role: Node
rootVolumeSize: 500
rootVolumeType: gp2
subnets:
- eu-west-1a