Hi!
I’m trying to debug an issue that is solved by using hostNetwork: true
. The k8s installation is using kubenet, and the k8s version is 1.9.8.
The installation is done with kops on AWS, using m4.xlarge and c4.xlarge instances.
The problem is the following:
When we migrated this application to kubernetes, the response time (percentile 95) for a certain endpoint increased about 20-30%.
This issue is solved, though, when using hostNetwork: true
in the yaml. The performance is the same than it was on VMs for this endpoint, i.e the percentile 95 for the response time is the same for this endpoint.
I’ve asked this in the kubernetes office hours on July 18th (yeah, a while ago!) and the hostNetwork: true
workaround come up there. They told me to cc them if I created an issue on github, but I’m starting here as the clues are really vague. Just using the tag
The pod has 3 containers:
- Nginx
- A log collector
- The app (a ruby app running with unicorn)
These apps are also in the VMs mode
What I tried:
- Found a way to reproduce using
ab
(apache benchmark) ab -c 1 -n 1000 'https://...
- The same happens with http, instead of https
- I tried removing the nginx container, but it doesn’t change anything.
- The log collector is used over localhost, and the very same thing is done on the VMs that do not exhibit the problem
- I tried using unix sockets between nginx and the app, instead of localhost, but it didn’t change anything either.
- Tried using same instances (m4.xlarge) with EKS: the same happens. Although the performance cost of not using
hostNetwork: true
is less, about 10%. Please note that EKS does not use kubenet and uses their own network overlay based on some open source. - Tried using another endpoint that just returns a string (puts “Ok”) and the issue does not happen
- Tried using an endpoint that returns a few MBs (like
"Die" * 10 * 1024 * 1024
), and the issue does not happen either - Tried the same endpoint that has the issue with different query string params, so the response is big (9MB) or short (130kb) and both reliably reproduce the issue
- Tried a nodejs application, that returns similar jsons from similar sources, and the issue is not present (nor with short/long responses)
What might do next:
So, I’m trying to debug this issue to understand what it is and, hopefully, stop using hostNetwork: true
. There seems to path to dig further:
-
Try other CNIs (EKS showed less performance degradation) to see if the performance changes
-
See what this endpoint does or how it interacts with unicorn and the whole stack. One big difference is that unicorn is one process per request (synchronous) and nodejs is not.
-
Try to use more newer machines (m5/c5) to see if they mitigate the performance hit. But, as this issue is not present with the current instances using them as VMs, seems that if it helps, will only hide the problem
This endpoint that has the perf problem, is an endpoint in ruby that reads a database and gets returns a json. The database, the host, the network, all seem fine (monitoring CPU, disk IO, swap, etc. with vmstat, our regular tools, AWS console, checking kern.log, sysloca and that stuff also)
By any chance, did you have a similar experience? Or do you have any other ideas on how to continue to debug this issue?
Any ideas or any kind of help is more than welcome!
Rodrigo