folks, I need you’re help to understood what happens on the cluster at the moment.
We have a service that works fine on all probes, but got a strange exit code status of 137 with reason as error
only, so I’m not sure how to debug further. It’s not an OOM issue as the kernel does not show anything with sudo dmesg -wH
on that node.
I’ll provide my analysis bellow and appreciate any response on this topic
Cluster information:
Kubernetes version: v1.30.1-eks-49c6de4
Cloud being used: AWS
Installation method: EKS
Host OS: AMI 1.30.0-20240703
CNI and version: v1.18.2-eksbuild.1
CRI and version: containerd GitHub - containerd/containerd: An open and reliable container runtime 1.7.11 64b8a811b07ba6288238eefc14d898ee0b5b99ba
Debug phase
- When
liveness
hits logic we see onevents
that performing reload as it should as the service does not respond any more
"lastState": {
"terminated": {
"containerID": "containerd://79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7",
"exitCode": 137,
"finishedAt": "2024-07-25T08:22:18Z",
"reason": "Error",
"startedAt": "2024-07-25T07:41:19Z"
}
- As this is
kubelet
flow now I go to the node where this container lives
kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" event={"ID":"0137706b-7afa-4ad7-9925-f48053377f9b","Type":"ContainerStarted","Data":"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7"}
prober.go:107] "Probe failed" probeType="Liveness" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" probeResult="failure" output="HTTP probe failed with statuscode: 500"
kubelet.go:2536] "SyncLoop (probe)" probe="liveness" status="unhealthy" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
kuberuntime_manager.go:1027] "Message for Container of pod" containerName="platform-api" containerStatusID={"Type":"containerd","ID":"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7"} pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" containerMessage="Container platform-api failed liveness probe, will be restarted"
kuberuntime_container.go:779] "Killing container with a grace period" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" podUID="0137706b-7afa-4ad7-9925-f48053377f9b" containerName="platform-api" containerID="containerd://79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7" gracePeriod=60
generic.go:334] "Generic (PLEG): container finished" podID="0137706b-7afa-4ad7-9925-f48053377f9b" containerID="79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7" exitCode=137
kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" event={"ID":"0137706b-7afa-4ad7-9925-f48053377f9b","Type":"ContainerDied","Data":"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7"}
kubelet.go:2464] "SyncLoop (PLEG): event for pod" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q" event={"ID":"0137706b-7afa-4ad7-9925-f48053377f9b","Type":"ContainerStarted","Data":"4954b402daf71b6bd737ea2c419d23e9370583ec71664ac12cc6ee97f32bfbf9"}
kubelet.go:2536] "SyncLoop (probe)" probe="startup" status="unhealthy" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
kubelet.go:2536] "SyncLoop (probe)" probe="readiness" status="" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
kubelet.go:2536] "SyncLoop (probe)" probe="startup" status="started" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
kubelet.go:2536] "SyncLoop (probe)" probe="readiness" status="ready" pod="platform-api-qa/platform-api-6684f5fdbf-8fh9q"
As you can see there’s this same flow with exitCode=137
and "Type":"ContainerDied"
- I’ve also check
contianerd
logs to confirm flow to the container it self
level=info msg="StopContainer for \"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7\" with timeout 60 (s)"
level=info msg="Stop container \"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7\" with signal terminated"
level=info msg="Kill container \"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7\""
level=info msg="shim disconnected" id=79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7 namespace=k8s.io
level=warning msg="cleaning up after shim disconnected" id=79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7 namespace=k8s.io
level=info msg="cleaning up dead shim" namespace=k8s.io
level=info msg="StopContainer for \"79923472ec40e3cb9d41322a4d139471e8a25edc490808407778c37302da46a7\" returns successfully"
Above you can see that I’ve even tried to bump terminationGracePeriodSeconds
as 60
seconds instead of default 30
but it didn’t help much.
Do you have any clue why the liveness
probe marks this exit code or how I can debug further this issue? I’ve raised question also on this thread here
Thanks for looking at this!