Kuberspray: install hangs on kube-proxy restart task


#1

Trying to install kubernetes using kuberspray on a small three node lab environment. However during running of the playbook, the playbook hangs on a task which tried to restart kube-proxy pods. Anybody know what is the reason ?

OS= Ubuntu 18.04,
HW= 64G RAM, 6 core HP.

kubectl version
Client Version: version.Info{Major:“1”, Minor:“13”, GitVersion:“v1.13.3”, GitCommit:“721bfa751924da8d1680787490c54b9179b1fed0”, GitTreeState:“clean”, BuildDate:“2019-02-01T20:00:57Z”, GoVersion:“go1.11.5”, Compiler:“gc”, Platform:“linux/amd64”}

Server Version: version.Info{Major:“1”, Minor:“13”, GitVersion:“v1.13.3”, GitCommit:“721bfa751924da8d1680787490c54b9179b1fed0”, GitTreeState:“clean”, BuildDate:“2019-02-01T20:00:57Z”, GoVersion:“go1.11.5”, Compiler:“gc”, Platform:“linux/amd64”}

ansible-playbook -i hosts.ini --become --become-user=root cluster.yml -b -vvv


After running playbook lots of activities takes place but it stops at a Task

TASK [kubernetes/kubeadm : Restart all kube-proxy pods to ensure that they load the new configmap] **********************************************************************

task path: /home/tom/Services/kubespray/roles/kubernetes/kubeadm/tasks/main.yml:135

Sunday 03 March 2019 01:33:42 +0000 (0:00:02.043) 0:12:02.490 **********

Using module file /usr/local/lib/python3.6/dist-packages/ansible/modules/commands/command.py

<10.0.1.11> ESTABLISH SSH CONNECTION FOR USER: tom

<10.0.1.11> SSH: EXEC sshpass -d12 ssh -o ControlMaster=auto -o ControlPersist=30m -o ConnectionAttempts=100 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o User=tom -o ConnectTimeout=10 -o ControlPath=/home/tom/.ansible/cp/4c47906b36 10.0.1.11 ‘/bin/sh -c ‘"’"‘sudo -H -S -p “[sudo via ansible, key=wpxabqqcijihhvhkhegxzjygqeebatpe] password: " -u root /bin/sh -c '”’"’"’"’"’"’"’"‘echo BECOME-SUCCESS-wpxabqqcijihhvhkhegxzjygqeebatpe; /usr/bin/python’"’"’"’"’"’"’"’"’ && sleep 0’"’"’’

Escalation succeeded


kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-7dbc74fcf-c766q 0/1 Pending 0 86m
kube-system coredns-7dbc74fcf-xgj77 0/1 Pending 0 86m
kube-system kube-apiserver-server1 1/1 Running 5 86m
kube-system kube-controller-manager-server1 1/1 Running 5 86m
kube-system kube-proxy-bhz5c 1/1 Running 0 85m
kube-system kube-proxy-q7gmz 0/1 Terminating 0 86m
kube-system kube-scheduler-server1 1/1 Running 5 86m
kube-system nginx-proxy-server2 1/1 Running 0 86m

Docker instances :
root@server1:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4ddef9a27785 gcr.io/google_containers/pause-amd64:3.1 “/pause” About an hour ago Up About an hour k8s_POD_kube-proxy-q7gmz_kube-system_3c23dd55-3d54-11e9-87a8-18a90554101c_0
103749aa9db8 3a6f709e97a0 “kube-scheduler --ad…” About an hour ago Up About an hour k8s_kube-scheduler_kube-scheduler-server1_kube-system_0303621df0e68163d195543d161f2308_5
e5c8f60b250a 0482f6400933 “kube-controller-man…” About an hour ago Up About an hour k8s_kube-controller-manager_kube-controller-manager-server1_kube-system_41b7130ce6b5a26df5697ed775e36b9f_5
33761bb92463 fe242e556a99 “kube-apiserver --al…” About an hour ago Up About an hour k8s_kube-apiserver_kube-apiserver-server1_kube-system_ad9c231985127c13fd9fbc178e652357_5
aa50e3a4de73 gcr.io/google_containers/pause-amd64:3.1 “/pause” About an hour ago Up About an hour k8s_POD_kube-scheduler-server1_kube-system_0303621df0e68163d195543d161f2308_5
4399df06b627 gcr.io/google_containers/pause-amd64:3.1 “/pause” About an hour ago Up About an hour k8s_POD_kube-controller-manager-server1_kube-system_41b7130ce6b5a26df5697ed775e36b9f_5
c9ef8fa4db7d gcr.io/google_containers/pause-amd64:3.1 “/pause” About an hour ago Up About an hour k8s_POD_kube-apiserver-server1_kube-system_ad9c231985127c13fd9fbc178e652357_5
486a36c74895 quay.io/coreos/etcd:v3.2.24 “/usr/local/bin/etcd” About an hour ago Up About an hour etcd1


#2

That line looks very strange. Ask kubectl -n kube-system describe pod kube-proxy-q7gmz what is going on, maybe that can help you figure out what the problem is. A Pod should not need 86+ minutes to terminate. :smiley:


#3

For anybody seeing this issue: Please have a look at mattymo comment in this thread: https://github.com/kubernetes-sigs/kubespray/issues/4314
Pods on down/unresponsive nodes can’t be deleted without
–force --grace-period=0.

Will test the fix and comment back here.


#4

This problem was solved by adding –force --grace-period=0 in one of the Ansible tasks as it is described in : https://github.com/kubernetes-sigs/kubespray/issues/4314