Exit code 137 - Pods terminated

Hello Team! I am relatively new to k8s and am hoping I can learn a lot from you all!

I need some advise on an issue I am facing with k8s 1.14 and running gitlab pipelines on it. Many jobs are throwing up exit code 137 errors and I found that it means that the container is being terminated abruptly.

Cluster information:

Kubernetes version: 1.14
Cloud being used: AWS EKS
Installation method: EKS
Host OS: Amazon Linux
Node: c5.4xlarge

After digging in, I found the below logs:

kubelet: I0114 03:37:08.639450 4721 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 95% which is over the high threshold (85%). Trying to free 3022784921 bytes down to the low threshold (80%).
kubelet: E0114 03:37:08.653132 4721 kubelet.go:1282] Image garbage collection failed once. Stats initialization may not have completed yet: failed to garbage collect required amount of images. Wanted to free 3022784921 bytes, but freed 0 bytes
kubelet: W0114 03:37:23.240990 4721 eviction_manager.go:397] eviction manager: timed out waiting for pods runner-u4zrz1by-project-12123209-concurrent-4zz892_gitlab-managed-apps(d9331870-367e-11ea-b638-0673fa95f662) to be cleaned up
kubelet: W0114 00:15:51.106881 4781 eviction_manager.go:333] eviction manager: attempting to reclaim ephemeral-storage
kubelet: I0114 00:15:51.106907 4781 container_gc.go:85] attempting to delete unused containers
kubelet: I0114 00:15:51.116286 4781 image_gc_manager.go:317] attempting to delete unused images
kubelet: I0114 00:15:51.130499 4781 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage
kubelet: I0114 00:15:51.130648 4781 eviction_manager.go:362] eviction manager: pods ranked for eviction:

  1. runner-u4zrz1by-project-10310692-concurrent-1mqrmt_gitlab-managed-apps(d16238f0-3661-11ea-b638-0673fa95f662)
  2. runner-u4zrz1by-project-10310692-concurrent-0hnnlm_gitlab-managed-apps(d1017c51-3661-11ea-b638-0673fa95f662)
  3. runner-u4zrz1by-project-13074486-concurrent-0dlcxb_gitlab-managed-apps(63d78af9-3662-11ea-b638-0673fa95f662)
  4. prometheus-deployment-66885d86f-6j9vt_prometheus(da2788bb-3651-11ea-b638-0673fa95f662)
  5. nginx-ingress-controller-7dcc95dfbf-ld67q_ingress-nginx(6bf8d8e0-35ca-11ea-b638-0673fa95f662)
  6. alertmanager-768d89dcc8-4hxj6_prometheus(d4e6f161-3651-11ea-b638-0673fa95f662)
  7. kube-proxy-bpqm7_kube-system(4e307fee-35c6-11ea-b638-0673fa95f662)
  8. aws-node-rc8rw_kube-system(4e30a734-35c6-11ea-b638-0673fa95f662)

And then the pods get terminated resulting in the exit code 137s.Can anyone help me understand the reason and a possible solution to overcome this?


It seems that your applications (started by gitlab runner) write a lot of data (logs, artifacts, cache?) and the node can’t hold them so the eviction manager deletes some of them … “must evict pod(s) to reclaim ephemeral-storage”.

As a solution you can try to use bigger disk for nodes, attach an additional volume to the pods (https://docs.gitlab.com/runner/executors/kubernetes.html#using-volumes), reduce number of parallel runners…

Hello Tomasz,

The nodes initially had 20G of ebs volume and on a c5.4xlarge. I increased it to 50 and 100G but that did not help. So I did not know if that was supposed to solve the problem. But after you advising to do the same, I changed the instance type to c5d.4xlarge which had 400GB of cache storage and gave 300GB of EBS. This solved the error.

Thanks for confirming the solution.