Hi everyone, I’m running my restaurant-focused website (which features Texas Roadhouse menu pages, reviews, and coupon listings) inside a Kubernetes cluster on a managed cloud service. The setup uses Nginx as a reverse proxy, a Node.js backend API, and a MySQL database deployed via StatefulSet. Everything was running fine for a few months, but recently, I’ve started noticing random pod restarts, failed health checks, and unstable traffic routing through my services.
The main issue is that my Node.js application pods occasionally fail the liveness probe. Kubernetes automatically restarts them, but the logs don’t show any critical application-level errors — just normal request handling, followed by a sudden termination. I checked the pod events with kubectl describe pod, and I see “Readiness probe failed: connection refused” before each restart. Increasing the initial delay and timeout values helped temporarily, but the issue returns every few hours.
I also noticed that CPU and memory usage spike right before a restart. I’ve already added resource limits and requests to the deployment YAML, but the metrics in kubectl top pods show random bursts that cause throttling. The Node.js app handles dynamic data (menu updates, images, and API calls), so I wonder if these bursts are linked to higher traffic or inefficient memory usage. However, the same code runs smoothly in a standalone Docker environment, so I suspect this might be related to Kubernetes resource management or autoscaling behavior.
Another concern is with persistent storage. My MySQL pod is using a PersistentVolumeClaim with an SSD-backed disk, but occasionally the application logs show “database connection lost” errors for a few seconds. The database container doesn’t crash, but it seems like there’s a temporary I/O freeze. I’ve verified that the PVC isn’t being rescheduled, so I’m unsure if this is a networking issue between pods or something related to storage throttling.
I also noticed that when one pod restarts, the load balancer doesn’t immediately remove it from the Service endpoints, which causes a few 502 errors during user requests. I’m using a standard ClusterIP service with an Nginx Ingress controller. I thought readiness probes were supposed to handle that gracefully, but maybe my configuration isn’t tuned properly. I’ve been experimenting with shorter probe intervals and longer thresholds, but I haven’t found a balance that avoids downtime completely.
Has anyone else experienced similar issues where pods intermittently fail health checks and restart even though the app itself isn’t crashing? I’d love some advice on best practices for configuring readiness and liveness probes for Node.js apps, handling storage latency in MySQL StatefulSets, and ensuring smoother traffic routing during pod restarts. My goal is to make the website’s Kubernetes deployment as stable and resilient as it used to be before these issues began. Sorry for the long post!