Liveness probe report strange exit code

:wave: folks, I need you’re help to understood what happens on the cluster at the moment.

We have a service that works fine on all probes, but got a strange exit code status of 137 with reason as error only, so I’m not sure how to debug further. It’s not an OOM issue as the kernel does not show anything with sudo dmesg -wH on that node.

I’ll provide my analysis bellow and appreciate any response on this topic :pray:

Cluster information:

Kubernetes version: v1.30.1-eks-49c6de4
Cloud being used: AWS
Installation method: EKS
Host OS: AMI 1.30.0-20240703
CNI and version: v1.18.2-eksbuild.1
CRI and version: containerd GitHub - containerd/containerd: An open and reliable container runtime 1.7.11 64b8a811b07ba6288238eefc14d898ee0b5b99ba

Debug phase

  1. When liveness hits logic we see on events that performing reload as it should as the service does not respond any more
    "lastState": {
      "terminated": {
        "containerID": "containerd://79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7",
        "exitCode": 137,
        "finishedAt": "2024-07-25T08:22:18Z",
        "reason": "Error",
        "startedAt": "2024-07-25T07:41:19Z"
      }
  1. As this is kubelet flow now I go to the node where this container lives
kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" event={"ID":"0137706b-7afa-4ad7-9925-f48053377f9b","Type":"ContainerStarted","Data":"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7"}
prober.go:107] "Probe failed" probeType="Liveness" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" probeResult="failure" output="HTTP probe failed with statuscode: 500"
kubelet.go:2536] "SyncLoop (probe)" probe="liveness" status="unhealthy" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
kuberuntime_manager.go:1027] "Message for Container of pod" containerName="platform-api" containerStatusID={"Type":"containerd","ID":"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7"} pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" containerMessage="Container platform-api failed liveness probe, will be restarted"
kuberuntime_container.go:779] "Killing container with a grace period" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" containerID="containerd://79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7" gracePeriod=60
generic.go:334] "Generic (PLEG): container finished" podID="0137706b-7afa-4ad7-9925-f48053377f9b" containerID="79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7" exitCode=137
kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" event={"ID":"0137706b-7afa-4ad7-9925-f48053377f9b","Type":"ContainerDied","Data":"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7"}
kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" event={"ID":"0137706b-7afa-4ad7-9925-f48053377f9b","Type":"ContainerStarted","Data":"4954b402daf71b6bd737ea2c419d23e9370583ec71664ac12cc6ee97f32bfbf9"}
kubelet.go:2536] "SyncLoop (probe)" probe="startup" status="unhealthy" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
kubelet.go:2536] "SyncLoop (probe)" probe="readiness" status="" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
kubelet.go:2536] "SyncLoop (probe)" probe="startup" status="started" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
kubelet.go:2536] "SyncLoop (probe)" probe="readiness" status="ready" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"

As you can see there’s this same flow with exitCode=137 and "Type":"ContainerDied" :thinking:

  1. I’ve also check contianerd logs to confirm flow to the container it self
level=info msg="StopContainer for \"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7\" with timeout 60 (s)"
level=info msg="Stop container \"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7\" with signal terminated"
level=info msg="Kill container \"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7\""
level=info msg="shim disconnected" id=79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7 namespace=k8s.io
level=warning msg="cleaning up after shim disconnected" id=79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7 namespace=k8s.io
level=info msg="cleaning up dead shim" namespace=k8s.io
level=info msg="StopContainer for \"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7\" returns successfully"

Above you can see that I’ve even tried to bump terminationGracePeriodSeconds as 60 seconds instead of default 30 but it didn’t help much.

Do you have any clue why the liveness probe marks this exit code or how I can debug further this issue? I’ve raised question also on this thread here

Thanks for looking at this!

Exit code 137 is Linux’s way of saying “signal #9” (bit 7=128=“this is a signal”). 128+9 = 137

Signal 9 is SIGKILL

SIGKILL is what we use to terminate a pod.

Your log shows a kill with grace (kuberuntime_container.go:779] “Killing container with a grace period” …) followed by a SGIKILL (generic.go:334] “Generic (PLEG): container finished” … exitCode=137)

It doesn’t, however, include timestamps.

So whether you got that grace period or not is unknown to us.

Hey @thockin,

I appreciate this signal flow that you posted.
I’m aware of those codes, but what confuses me and I assume that I’m not getting correct understanding is this documentation → pod-termination

I see there that liveness should first send SIGTERM and during the grace period restart the container (I guess not with SIGKILL exit code)?

Apologize for cut log timestamps, bellow is whole part for the same containerId


Jul 25 07:41:20 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 07:41:20.710695    2246 kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" event={"ID":"0137706b-7afa-4ad7-9925-f48053377f9b","Type":"ContainerStarted","Data":"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7"}
Jul 25 07:41:20 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 07:41:20.734874    2246 pod_startup_latency_tracker.go:104] "Observed pod startup duration" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podStartSLOduration=2.734855196 podStartE2EDuration="2.734855196s" podCreationTimestamp="2024-07-25 07:41:18 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-07-25 07:41:20.733686775 +0000 UTC m=+225853.400008734" watchObservedRunningTime="2024-07-25 07:41:20.734855196 +0000 UTC m=+225853.401177105"
Jul 25 07:41:28 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 07:41:28.494303    2246 kubelet.go:2536] "SyncLoop (probe)" probe="startup" status="unhealthy" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
Jul 25 07:41:28 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 07:41:28.494475    2246 kubelet.go:2536] "SyncLoop (probe)" probe="readiness" status="" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
Jul 25 07:42:08 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 07:42:08.780561    2246 kubelet.go:2536] "SyncLoop (probe)" probe="startup" status="started" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
Jul 25 07:42:08 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 07:42:08.830984    2246 kubelet.go:2536] "SyncLoop (probe)" probe="readiness" status="ready" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
Jul 25 08:20:56 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:20:56.308009    2246 prober.go:107] "Probe failed" probeType="Liveness" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" probeResult="failure" output="HTTP probe failed with statuscode: 500"
Jul 25 08:20:56 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:20:56.324255    2246 prober.go:107] "Probe failed" probeType="Readiness" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" probeResult="failure" output="HTTP probe failed with statuscode: 500"
Jul 25 08:21:04 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:21:04.456216    2246 prober.go:107] "Probe failed" probeType="Readiness" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" probeResult="failure" output="HTTP probe failed with statuscode: 500"
Jul 25 08:21:04 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:21:04.456224    2246 prober.go:107] "Probe failed" probeType="Liveness" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" probeResult="failure" output="HTTP probe failed with statuscode: 500"
Jul 25 08:21:17 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:21:17.118147    2246 prober.go:107] "Probe failed" probeType="Readiness" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" probeResult="failure" output="HTTP probe failed with statuscode: 500"
Jul 25 08:21:17 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:21:17.118163    2246 prober.go:107] "Probe failed" probeType="Liveness" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" probeResult="failure" output="HTTP probe failed with statuscode: 500"
Jul 25 08:21:17 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:21:17.118940    2246 kubelet.go:2536] "SyncLoop (probe)" probe="readiness" status="" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
Jul 25 08:21:17 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:21:17.119226    2246 kubelet.go:2536] "SyncLoop (probe)" probe="liveness" status="unhealthy" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
Jul 25 08:21:17 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:21:17.120446    2246 kuberuntime_manager.go:1027] "Message for Container of pod" containerName="platform-api" containerStatusID={"Type":"containerd","ID":"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7"} pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" containerMessage="Container platform-api failed liveness probe, will be restarted"
Jul 25 08:21:17 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:21:17.120552    2246 kuberuntime_container.go:779] "Killing container with a grace period" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" containerID="containerd://79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7" gracePeriod=60
Jul 25 08:21:21 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:21:21.165825    2246 prober.go:107] "Probe failed" probeType="Readiness" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" probeResult="failure" output="HTTP probe failed with statuscode: 500"
Jul 25 08:21:27 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:21:27.160017    2246 prober.go:107] "Probe failed" probeType="Readiness" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" probeResult="failure" output="HTTP probe failed with statuscode: 500"
Jul 25 08:21:36 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:21:36.828598    2246 prober.go:107] "Probe failed" probeType="Readiness" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" probeResult="failure" output="HTTP probe failed with statuscode: 500"
Jul 25 08:21:41 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:21:41.065150    2246 kubelet.go:2536] "SyncLoop (probe)" probe="readiness" status="ready" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
Jul 25 08:22:15 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:22:15.090427    2246 prober.go:107] "Probe failed" probeType="Readiness" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" probeResult="failure" output="HTTP probe failed with statuscode: 500"
Jul 25 08:22:18 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:22:18.911758    2246 prober.go:107] "Probe failed" probeType="Readiness" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" probeResult="failure" output="Get \"http://10.30.105.222:8001/health/isAlive\": read tcp 10.30.106.125:48444->10.30.105.222:8001: read: connection reset by peer"
Jul 25 08:22:19 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:22:19.154736    2246 generic.go:334] "Generic (PLEG): container finished" podID="0137706b-7afa-4ad7-9925-f48053377f9b" containerID="79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7" exitCode=137
Jul 25 08:22:19 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:22:19.154788    2246 kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" event={"ID":"0137706b-7afa-4ad7-9925-f48053377f9b","Type":"ContainerDied","Data":"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7"}
Jul 25 08:22:20 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:22:20.160462    2246 kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" event={"ID":"0137706b-7afa-4ad7-9925-f48053377f9b","Type":"ContainerStarted","Data":"4954b402daf71b6bd737ea2c419d23e9370583ec71664ac12cc6ee97f32bfbf9"}
Jul 25 08:22:28 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:22:28.494963    2246 kubelet.go:2536] "SyncLoop (probe)" probe="startup" status="unhealthy" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
Jul 25 08:22:28 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:22:28.496371    2246 kubelet.go:2536] "SyncLoop (probe)" probe="readiness" status="" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
Jul 25 08:23:08 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:23:08.820936    2246 kubelet.go:2536] "SyncLoop (probe)" probe="startup" status="started" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
Jul 25 08:23:08 ip-10-30-106-125.us-east-2.compute.internal kubelet[2246]: I0725 08:23:08.884087    2246 kubelet.go:2536] "SyncLoop (probe)" probe="readiness" status="ready" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"

Thanks for looking at this!

This is resolved, so I’ll close this question.

For those who hit the same issue, this was troubleshooting :point_down:

  1. I’ve looked at the service code (java) and didn’t see any logic that can handle SIGTERM
    • That rings issue immediately to pull the image (or you can build a new one) and run locally
      • I’ve baked a custom compose to bring what service needs to start in dev mode
      • As soon as composing started I did the following
        • docker kill -s SIGTERM <continer-name>
          • what happens there is that service didn’t respond as I expected at all

You can see below that if I send SIGKILL that will produce 137 as k8s logs showed above on the traces.
I’ve done a test on the redis with the same SIGTERM and as you can see it gracefully shut down with the right exit code :ok_hand:

<removed logs of the service>
api exited with code 137
redis  | 1:signal-handler (1721988133) Received SIGTERM scheduling shutdown...
redis  | 1:M 26 Jul 2024 10:02:13.970 # User requested shutdown...
redis  | 1:M 26 Jul 2024 10:02:13.970 * Saving the final RDB snapshot before exiting.
redis  | 1:M 26 Jul 2024 10:02:13.972 * DB saved on disk
redis  | 1:M 26 Jul 2024 10:02:13.972 # Redis is now ready to exit, bye bye...
redis exited with code 0

NOTE: This was a new service to me so I wasn’t aware of this missed configuration before.