How to detect readiness when node becomes unavailable?

asp · July 26, 2019, 9:49am

Cluster information:

Kubernetes version: 1.14.3
Cloud being used: bare-metal
Installation method: rpm
Host OS: CentOS 7.6
CNI and version: Calico
CRI and version: Docker CE 18.09

You can format your yaml by highlighting it and pressing Ctrl-Shift-C, it will make your output easier to read.

Hi, I am new to liveness and readiness probes. I have elasticsearch running as 3 node cluster, each elasticsearch node is pinned to a different kubernetes worker node caused by usage of local storage.

I am trying to get familar with liveness and readiness probes. If a pod goes down, the serverice MUST not forward requests to the unavailable pod. Otherwise the request gets into timeout and I have effect on the customer side.

I am currently testing with complete node failure by stopping docker daemon and kubelet on node kubernetes03.

When stopping the docker daemon, kubernetes seems to recognize the shutdown of the pod and is firing my init container for elasticsearch. After that, nothing happens anymore. Seems as if neither livenessProbes or readinessProbes are fired. The node is shown as NotReady, but the pod still shows running with state READY 1/1.
When I curl against the headless service of the elasticsearch statefulset, I still get routed to the unavailable pod poc-es-master-0.

For readiness and liveless I fire a cmd-script:

livenessProbe:
          failureThreshold: 2
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
          exec:
            command:
              - sh
              - -c
              - |
                #!/bin/bash

                http() {
                  local path="${1}"
                  if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
                    BASIC_AUTH="-u ${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
                  else
                    BASIC_AUTH=''
                  fi
                  curl -XGET -s -k --fail ${CURL_CONFIG} ${BASIC_AUTH} http://${IP}:9200${URL}
                }

                transport() {
                  local path="${1}"
                  # just checking for output. There is a result "This is not an HTTP port"
                  curl -XGET -s -k --fail ${CURL_CONFIG} ${BASIC_AUTH} http://${IP}:9300${URL}
                }


                IP=127.0.0.1
                CURL_TIMEOUT_CONNECT_SECONDS=2
                CURL_TIMEOUT_RESPONSE_SECONDS=3

                CURL_CONFIG="--connect-timeout $CURL_TIMEOUT_CONNECT_SECONDS --max-time $CURL_TIMEOUT_RESPONSE_SECONDS"


                URL=$2
                HTTP_RETURN=0
                if  http "${URL}"  ; then
                  echo http is working
                  HTTP_RETURN=0
                else
                  echo http is NOT working
                  HTTP_RETURN=1
                fi

                TRANSPORT_RETURN=0
                if  transport ; then
                  echo transport is working
                  TRANSPORT_RETURN=0
                else
                  echo transport is NOT working
                  TRANSPORT_RETURN=1
                fi

                if [ "$TRANSPORT_RETURN" -eq "0" ] && [ "$HTTP_RETURN" -eq "0" ]; then
                  echo "http and transport are running"
                  exit 0
                else
                  echo "$(date) overall readiness: NO. Return codes: transport=$TRANSPORT_RETURN, http=$HTTP_RETURN"
                  exit 1
                fi

What can I do to render this pod to READY 0/1 (unready) automatically?

Thanks, Andreas

rata · July 26, 2019, 10:18pm

Isn’t this what we discussed here: https://discuss.kubernetes.io/t/how-to-reschedule-pod-on-another-node-if-node-fails-how-to-speed-up-rescheduling/7193/1

Am I missing something?

Topic		Replies	Views
How to reschedule pod on another node if node fails? How to speed up rescheduling? General Discussions	1	12311	July 17, 2019
Pods show running.... but node was shut down 10 minutes ago General Discussions	3	878	January 17, 2020
Problem starting statefulset with activated readiness probe General Discussions	1	1029	July 30, 2019
Testing nic failure on bare-metal kubernetes cluster General Discussions	0	752	May 5, 2020
Apiserver liveness and readiness probes fail randomly with code 500 General Discussions	3	11492	May 23, 2023

How to detect readiness when node becomes unavailable?

Cluster information:

Related topics