Network issue on kube pods

  1. first of all, I know about k8s and how it works with all their stuff;
  2. second, I need to tell about my app and how it works on k8s
  3. 3rd, I gonna ask my question:

this app consist of three pods which each one has their own services:

1. nginx with php-fpm (called sample-app)

2. redis

3. queue(based on node js)

this app work in this way:

when a request comes into sample-app, this app stores request as a job in Redis, this while queue watch Redis for unprocessed jobs and process them(for example send an email).

now my issue:

the queue pod with no reason and no time order and with no logs stop working, there is a need to tell again in the k8s logs you can’t see any error or warning, it’s mostly like stuck and do nothing!!:thinking:. this app works fine on minikube and before k8s, on VM.
so every time I delete queue pods, the new created ones start to work normally and process jobs from Redis.

I test this app on minikube with benchmark tools (send request from outside of cluster with ab command) with no problems and no errors, so whenever I move to k8s and test this app with hight request density the queue pod stop working too again as my second issue whit this app.

*extra information about my cluster:
k8s ver is 1.15, docker ver 19.3.1, CNI plugin is calico, and I use helm.

now is there any way to troubleshoot what is the problem and recognize where is the problem comes from?

Can you connect and strace it, for example?

yes rata. as I said, queue pod work without any restart due to this pod based on pm2, and if pm2 faced with any exception it’s going to restart itself and so my pod restart too.but in kubectl get command restart count for this pod is zero! however when this pod is in stuck I can connect and ping redis and execute command and also restart pm2 process which is going to restart pod.

Are you sure that it is a network issue and not an app issue ?

Would try looking from following angles.

  1. Can you attach to redis pod and try to ping to the service name of queue-pod ?. This would tell if there is a reach-ability issue or not .

  2. If it the pods are reachable using service-name’s; does heavy traffic causing network issue ?. Do you observe any network egress drops on the src container or src node interface or do you see network ingress as dst node or dst container ?

  3. When the queue-pod freezes, what is the cpu and memory utilization of the container ?

Great. Sorry I wasn’t clear, I meant if you can run strace and connect to that process and see what it is doing. Strace will show you if it is waiting on some syscall or what it is doing and might help to understand what is happening.

Cpu and mem wise I guess it is fine and not hitting any limits too?