Spikes for TargetResponseTime at Ingress created AWS Application load balancer

We have K8S cluster on EKS. In front of it is an Ingress created AWS Application Load balancer.
The TargetResponseTime metric at AWS shows spikes every half an hour strangely. NO cron jobs form application side either. Where do we look at to understand this? We ran out of all options, could watch, container insights, hpa, logs etc :frowning: Someone with a hint please?

Cluster information:

Kubernetes version: 1.18
Cloud being used: AWS EKS
Installation method: eksctl
Host OS: Amazon Linux 2

Start working to isolate what part of your application has the latent response. A general assumption here is that Cloud Watch is monitoring a Load Balancer for you and that’s what is being observed. Figure out ways to observe your components behind the load balancer and aggregate more data on each piece of the application. Outside of aggregating better metrics, you’re just making guesses.

Thank you! But we do have metrics for the eks components as well. None of the graphs for pods or containers show anything abnormal. Can you please throw some light on what kind of metrics and on what components do we enable to understand the issue?

So if the response of the service behind the load balancer is not latent, then perhaps the problem is with the load balancer or the network linking the load balancer to the thing it’s polling? That seems like something AWS would be needed to look into.

Edit: Also, it could be the network between the load balancer and cloud watch.

No, we have a different dashboard which indicates the same pattern. So it can’t be about Cloudwatch.

Regarding the network between load balancer and the backend, what do you recommend to look at? Because we already talked with the AWS support folks and they themselves couldn’t hint at anything :frowning:

Thanks for being active on this case, by the way!

Here’s how I would investigate this.

If you kind of merge this with the steps of the scientific method, I think you should be able to get somewhere on this. I wanted to keep it short, however I feel like I could keep expanding this forever and you probably just need somewhere to start.

1. Visualize a Line in the Architecture

Where does data flow from and to?

In real world cases there could be branches leading to other deployments and pods as they depend on other services or are depended on. What I like to do is start with a picking one line in the flow that narrows in on an pod-by-pod basis. Sometimes things overlap.

Here’s a simple example of a path I would want to troubleshoot.

Containers in Pod → Pod → Deployment → Service (Type LoadBalancer) → AWS Load Balancer (Managed by CCM)

2. Getting Information

2. Check Logs

Checking the Containers in Pods

If the pod is a part of a deployment, check out the deployment selectors.

kubectl -n kube-system get deployment coredns -o json | jq '.spec.selector.matchLabels'

Find pods in the deployment via the selectors. You can add multiples of -l k8s-app=kube-dns if you want to do multiple selectors. Here’s an example of looking up k8s-app=kube-dns:

$ kubectl -n kube-system get pods -l k8s-app=kube-dns
NAME                      READY   STATUS    RESTARTS   AGE
coredns-f9fd979d6-sdlb6   1/1     Running   3          12d
coredns-f9fd979d6-xngwr   1/1     Running   3          12d

Get all the containers in a pod:

# kubectl -n kube-system get pod coredns-f9fd979d6-sdlb6 -o json | jq .spec.containers[].name
"coredns"

Check out the logs for each of the containers in the list. In the prior example, there’s only one container in the pod, some pods have multiple. If there are multiple contaIners, you must use -c.

# kubectl -n kube-system logs pod/coredns-f9fd979d6-sdlb6
# kubectl -n kube-system logs pod/coredns-f9fd979d6-sdlb6 -c coredns

Lastly, check logs that haven’t been accounted for in deployments. There are a lot of containers that have applications that writes multiple log files, and the standard output would be unreliable. This is just a development mistake where the developer wasn’t thinking like a system administrator.

This is how you open up a shell inside a container. You will want to check places that the application would typically write logs.

Knowing Linux is kindof a core dependency at this point. There’s a book out there called The Linux Command Line by William Shotts. His website sells hard copies of his book, but also provides a free PDF.

Here is how to open a shell in a container. What we’re doing is running /bin/sh with interactive and tty flags. It’s important to note, some containers have no shells. Some developers have made containers that have just the bare minimum (probably most notable being kube-system components). I have no advice on how to get into those yet.

kubectl exec -it pod-name -- /bin/sh
kubectl exec -it pod-name -c contianer-name -- /bin/sh

Checking Pods

At this point you’ve reviewed the components of the pod.

Check out the status for the pod.

kubectl -n kube-system describe pod coredns-f9fd979d6-sdlb6

Checking Deployments

Check out the deployment logs:

kubectl -n kube-system logs deployment/coredns

Check the out the status for the deployment.

kubectl -n kube-system describe  deployment/coredns

Check out the logs of the service:

kubectl -n kube-system logs svc/kube-dns

Checking the CCM

I wouldn’t assume problems with the CCM here, but it doesn’t hurt to check out it’s logs. There’s a few out there that were just kindof slapped together, but the AWS CCM is probably solid. There’s a healthy bug/patch cycle for it due to so many people using it.

I’m intentionally not going to expand on this. What I think might not be correct.

My assumption is that you should be able to treat troubleshooting CCM logs just involved checking pods related to the CCM. I don’t think there are any additional API resources to check related to this, but I haven’t gotten deep into how to implement this yet.

Check the Load Balancer Logs

You can also check the load balancer’s access logs, I’m not sure if they have error logs persay, but that is something I bet the AWS support team would be able to point you in a direction for further documentation here.

Metrics as a Tool

The goal is to see where the problem starts in the line. This can be done with synthetic checks at every stop. Here’s just some brief ideas on what you can do. When in doubt, refer to the steps of the scientific method that I didn’t bother to document here.

Some Ideas for Where & What to Collect Metrics

Pods

  • sidecar container (the network is shared between containers)
  • APM modules loaded into the app
  • synthetically check response over kubectl port-forward

Note: Kubernetes provides liveness probes, they are worth looking into.

Deployment

  • scan for stuff in logs and aggregate that

Can’t really think of anything else here without doing further research.

Service

  • synthetically check response over kubectl port-forward
  • checking at the load balancer associated with the service

What to do with the data

Once there is data after the event has happened, you can continue.

If the pod is slow to respond, and there is nothing in the data gathered, investigate the dependencies.

Things like the CSI (storage) and the CNI (networking) are both dependencies in addition to things like databases. You should implement metrics to check their response times to figure out if the application is the problem or not.

Thanks a ton for the hints, will try them and come back. Really appreciate this :blush:

Well, the only logs for coredns pods seem to be these

Don’t think these are the cause?
And on a similar note, may I please know why there are only 3 lines in the logs?