Here’s how I would investigate this.
If you kind of merge this with the steps of the scientific method, I think you should be able to get somewhere on this. I wanted to keep it short, however I feel like I could keep expanding this forever and you probably just need somewhere to start.
1. Visualize a Line in the Architecture
Where does data flow from and to?
In real world cases there could be branches leading to other deployments and pods as they depend on other services or are depended on. What I like to do is start with a picking one line in the flow that narrows in on an pod-by-pod basis. Sometimes things overlap.
Here’s a simple example of a path I would want to troubleshoot.
Containers in Pod → Pod → Deployment → Service (Type LoadBalancer) → AWS Load Balancer (Managed by CCM)
2. Getting Information
2. Check Logs
Checking the Containers in Pods
If the pod is a part of a deployment, check out the deployment selectors.
kubectl -n kube-system get deployment coredns -o json | jq '.spec.selector.matchLabels'
Find pods in the deployment via the selectors. You can add multiples of
-l k8s-app=kube-dns if you want to do multiple selectors. Here’s an example of looking up
$ kubectl -n kube-system get pods -l k8s-app=kube-dns
NAME READY STATUS RESTARTS AGE
coredns-f9fd979d6-sdlb6 1/1 Running 3 12d
coredns-f9fd979d6-xngwr 1/1 Running 3 12d
Get all the containers in a pod:
# kubectl -n kube-system get pod coredns-f9fd979d6-sdlb6 -o json | jq .spec.containers.name
Check out the logs for each of the containers in the list. In the prior example, there’s only one container in the pod, some pods have multiple. If there are multiple contaIners, you must use
# kubectl -n kube-system logs pod/coredns-f9fd979d6-sdlb6
# kubectl -n kube-system logs pod/coredns-f9fd979d6-sdlb6 -c coredns
Lastly, check logs that haven’t been accounted for in deployments. There are a lot of containers that have applications that writes multiple log files, and the standard output would be unreliable. This is just a development mistake where the developer wasn’t thinking like a system administrator.
This is how you open up a shell inside a container. You will want to check places that the application would typically write logs.
Knowing Linux is kindof a core dependency at this point. There’s a book out there called
The Linux Command Line by William Shotts. His website sells hard copies of his book, but also provides a free PDF.
Here is how to open a shell in a container. What we’re doing is running
/bin/sh with interactive and tty flags. It’s important to note, some containers have no shells. Some developers have made containers that have just the bare minimum (probably most notable being kube-system components). I have no advice on how to get into those yet.
kubectl exec -it pod-name -- /bin/sh
kubectl exec -it pod-name -c contianer-name -- /bin/sh
At this point you’ve reviewed the components of the pod.
Check out the status for the pod.
kubectl -n kube-system describe pod coredns-f9fd979d6-sdlb6
Check out the deployment logs:
kubectl -n kube-system logs deployment/coredns
Check the out the status for the deployment.
kubectl -n kube-system describe deployment/coredns
Check out the logs of the service:
kubectl -n kube-system logs svc/kube-dns
Checking the CCM
I wouldn’t assume problems with the CCM here, but it doesn’t hurt to check out it’s logs. There’s a few out there that were just kindof slapped together, but the AWS CCM is probably solid. There’s a healthy bug/patch cycle for it due to so many people using it.
I’m intentionally not going to expand on this. What I think might not be correct.
My assumption is that you should be able to treat troubleshooting CCM logs just involved checking pods related to the CCM. I don’t think there are any additional API resources to check related to this, but I haven’t gotten deep into how to implement this yet.
Check the Load Balancer Logs
You can also check the load balancer’s access logs, I’m not sure if they have error logs persay, but that is something I bet the AWS support team would be able to point you in a direction for further documentation here.
Metrics as a Tool
The goal is to see where the problem starts in the line. This can be done with synthetic checks at every stop. Here’s just some brief ideas on what you can do. When in doubt, refer to the steps of the scientific method that I didn’t bother to document here.
Some Ideas for Where & What to Collect Metrics
- sidecar container (the network is shared between containers)
- APM modules loaded into the app
- synthetically check response over
Note: Kubernetes provides liveness probes, they are worth looking into.
- scan for stuff in logs and aggregate that
Can’t really think of anything else here without doing further research.
- synthetically check response over
- checking at the load balancer associated with the service
What to do with the data
Once there is data after the event has happened, you can continue.
If the pod is slow to respond, and there is nothing in the data gathered, investigate the dependencies.
Things like the CSI (storage) and the CNI (networking) are both dependencies in addition to things like databases. You should implement metrics to check their response times to figure out if the application is the problem or not.