Hi I am on k8s version 1.14. I am trying to get familiar with livenessProbes and readinessProbes.
I am using elasticsearch’s logstash and defined following probes:
livenessProbe:
httpGet:
path: /
port: 9600
failureThreshold: 3
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /
port: 9600
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
Now I am experimenting with full node failure, so I just stopped docker service on the node. kubelet stayed online.
After shutting down docker both rules had been triggered I could see following when describing my pod:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 22m kubelet, server Container logstash failed liveness probe, will be restarted
Normal Pulled 22m (x2 over 24m) kubelet, server Container image "docker-registry:443/docker.elastic.co/logstash/logstash:7.1.1-plx_redis_0.8.4" already present on machine
Normal Created 22m (x2 over 24m) kubelet, server Created container logstash
Normal Started 22m (x2 over 24m) kubelet, server Started container logstash
Normal Scheduled 21m default-scheduler Successfully assigned default/poc-logstash-5c89d6879-6nkgd to server
Warning Unhealthy 10m (x16 over 23m) kubelet, server Readiness probe failed: Get http://zzz.zzz.zzz.zzz:9600/: dial tcp zzz.zzz.zzz.zzz:9600: connect: connection refused
Warning Unhealthy 10m (x6 over 23m) kubelet, server Liveness probe failed: Get http://zzz.zzz.zzz.zzz:9600/: dial tcp zzz.zzz.zzz.zzz:9600: connect: connection refused
What I do not understand here: the counter for Readiness and Liveness is not raising any longer, but the unreadiness was shown correctly when getting the pod - as expected.
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
poc-logstash-5c89d6879-6nkgd 0/1 Running 1 19m <none> server <none> <none>
Shortly later the node was shown as status NotReady
as expected.
But as I understand my probe config, the pod should be restarted after failureThreshold=3 x (periodSeconds=10 + timeoutSeconds=10) = 60s. So after 60 seconds I expected a rescheduling of my pod. And if the node is down I expect kubernetes to take another node!
But it took about 6 or 7 Minutes until a new pod on another node was created and I saw such a picture.
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
poc-logstash-5c89d6879-6nkgd 0/1 Terminating 1 30m <none> server <none> <none>
poc-logstash-5c89d6879-dnz87 1/1 Running 0 10m zzz.zzz.zzz.zzz server2 <none> <none>
Can you explain why it took so long to reschedule the pod on another node? How can I speed up things like this?
Thanks a lot, Andreas