Asking for help? Comment out what you need so we can get more information to help you!
Cluster information:
Kubernetes version:5.0.1
Cloud being used: Google Cloud Kubernetes
Installation method:
Host OS: Linux
CNI and version:
CRI and version:
You can format your yaml by highlighting it and pressing Ctrl-Shift-C, it will make your output easier to read.
Hello,
I’m troubleshooting recurring OOM kills (Exit Code 137) on GKE for a Next.js application. Looking for insights on memory management and debugging strategies.
Environment:
-
Platform: GKE (Google Kubernetes Engine)
-
Node Pool: Custom (node-pool=frontend)
-
Container: Node.js 20.5.0 Alpine
-
Image: Next.js 15.3.3 standalone build
Pod Configuration:
Resources:
Limits:
cpu: 2800m
memory: 3500Mi
Requests:
cpu: 2000m
memory: 3000Mi
Environment:
NODE_OPTIONS: --max-old-space-size=2048
Probes:
Liveness: http-get :3000/api/health (delay=20s, period=10s)
Readiness: http-get :3000/api/health (delay=20s, period=10s)
Problem:
Pods consistently OOM kill with Exit Code 137 after 5-11 hours of runtime. Pattern is accelerating (was 11h, now 5h).
Timeline:
Dec 3: 13:39 Start → 18:43 OOM (5h) → 05:49 OOM (11h)
Dec 4: 08:44 Start → 13:46 OOM (5h)
Restart Count: 8 (doubled in 24 hours)
Observations:
-
Memory grows steadily:
-
Starts: ~800MB
-
After 3h: ~2GB
-
After 5h: ~3.4GB → OOM
-
-
Consistent pattern:
-
Always Exit Code 137
-
No other errors in logs
-
Health checks pass until OOM
-
-
High traffic:
-
~1,500-3,000 requests/hour
-
Mostly API calls to external service
-
Server-side rendering
-
Logs Show:
Frequent API calls to external service:
No errors, just normal operation until sudden OOM.
Questions:
-
Is 3.5GB limit reasonable for Next.js SSR app? Should I increase it or is this masking a leak?
-
Heap limit vs Memory limit: Currently heap=2GB, memory=3.5GB. Is this ratio correct?
-
QoS Class is Burstable: Should I make it Guaranteed by setting requests=limits?
-
Node.js memory management: Are there better NODE_OPTIONS for production?
-
Monitoring: What metrics should I track to catch this earlier?
What I’ve Tried:
-
Increased memory from 3GB → 3.5GB (just delayed the problem) -
Added health endpoint with memory metrics -
Checked for memory leaks in application code -
Reviewed logs (no errors, just normal traffic) -
Cannot reproduce in local Docker (different traffic patterns)
Questions for Community:
-
Are there GKE-specific memory management settings I should check?
-
Should I enable memory profiling in production?
-
Are there better probe configurations for memory-intensive apps?
-
Should I implement pod disruption budgets?
-
Any recommended monitoring/alerting strategies?
Current Mitigation:
Considering:
-
Increase memory to 5GB (temporary)
-
Scale out to 6 replicas (distribute load)
-
Implement pod auto-scaling based on memory
But want to fix root cause, not just mask it.
Any advice appreciated!
What did you expect to happen?
The container pods should not get restarted once deployed.
How can we reproduce it (as minimally and precisely as possible)?
NA
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
# Kustomize Version: v5.0.1
Cloud provider
Google Cloud Kubernetes
OS version
# On Linux:
$ cat /etc/os-release
# NAME="CentOS Linux"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
$ uname -a
# Linux bastion 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Wed Dec 22 13:25:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux