How to reschedule pod on another node if node fails? How to speed up rescheduling?

Hi I am on k8s version 1.14. I am trying to get familiar with livenessProbes and readinessProbes.

I am using elasticsearch’s logstash and defined following probes:

    livenessProbe:
      httpGet:
        path: /
        port: 9600
      failureThreshold: 3
      initialDelaySeconds: 60
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10

    readinessProbe:
      httpGet:
        path: /
        port: 9600
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10

Now I am experimenting with full node failure, so I just stopped docker service on the node. kubelet stayed online.

After shutting down docker both rules had been triggered I could see following when describing my pod:

Type     Reason     Age                 From               Message
----     ------     ----                ----               -------
Normal   Killing    22m                 kubelet, server    Container logstash failed liveness probe, will be restarted
Normal   Pulled     22m (x2 over 24m)   kubelet, server    Container image "docker-registry:443/docker.elastic.co/logstash/logstash:7.1.1-plx_redis_0.8.4" already present on machine
Normal   Created    22m (x2 over 24m)   kubelet, server    Created container logstash
Normal   Started    22m (x2 over 24m)   kubelet, server    Started container logstash
Normal   Scheduled  21m                 default-scheduler  Successfully assigned default/poc-logstash-5c89d6879-6nkgd to server
Warning  Unhealthy  10m (x16 over 23m)  kubelet, server    Readiness probe failed: Get http://zzz.zzz.zzz.zzz:9600/: dial tcp zzz.zzz.zzz.zzz:9600: connect: connection refused
Warning  Unhealthy  10m (x6 over 23m)   kubelet, server    Liveness probe failed: Get http://zzz.zzz.zzz.zzz:9600/: dial tcp zzz.zzz.zzz.zzz:9600: connect: connection refused

What I do not understand here: the counter for Readiness and Liveness is not raising any longer, but the unreadiness was shown correctly when getting the pod - as expected.

NAME                           READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED NODE   READINESS GATES
poc-logstash-5c89d6879-6nkgd   0/1     Running   1          19m   <none>   server   <none>           <none>

Shortly later the node was shown as status NotReady as expected.

But as I understand my probe config, the pod should be restarted after failureThreshold=3 x (periodSeconds=10 + timeoutSeconds=10) = 60s. So after 60 seconds I expected a rescheduling of my pod. And if the node is down I expect kubernetes to take another node!

But it took about 6 or 7 Minutes until a new pod on another node was created and I saw such a picture.

NAME                           READY   STATUS        RESTARTS   AGE   IP                NODE           NOMINATED NODE   READINESS GATES
poc-logstash-5c89d6879-6nkgd   0/1     Terminating   1          30m   <none>            server         <none>           <none>
poc-logstash-5c89d6879-dnz87   1/1     Running       0          10m   zzz.zzz.zzz.zzz   server2        <none>           <none>

Can you explain why it took so long to reschedule the pod on another node? How can I speed up things like this?

Thanks a lot, Andreas

Hi again! :slight_smile:

The thing is: when a node stops reporting, there are some timeouts to mark it not ready and to, later, evict pods there. The node status is handled by kube controller manager component, IIRC.

See the documentation here: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/

You may want to check:

–node-monitor-grace-period duration

–pod-eviction-timeout duration

And friends? (There are TONs of flags :))

Also, if you are using a cloud provider, using the cloud provider flag, and the node is deleted in the cloud provider, I think it should be detected even faster. I think the cloud controller manager will poll the cloud provider too. If the cloud provider reports the node does not exist, it will be deleted quite fast (don’t remember the timeouts, but hopefully the defaults are on that documentation page too :)). In my experience, when using spot instances, it was detected within seconds.

Does playing with those flags improves the detection for you?

Oh, and just find this issue, that seems to tell something along these lines too: https://github.com/kubernetes/kubernetes/issues/65936. Maybe it is helpful for you too? :slight_smile:

1 Like