I’m first going to describe the background of the situation, and I have a question at the end.
We have several clusters running Kubernetes 1.24 (recently upgraded from 1.13).
In those clusters, our services are NodePort services (we’re considering changing, but that’s down the road).
This will result in a random port being assigned for the “http” and “https” nodeport, in the range from 30000-32767. This is all fine.
Some of our services define “custom” nodeport entries for various reasons. They use a hardcoded port number in those cases. I’ve noticed that some of those services have used a port value in the range of 30000-32767, in the reserved range. That’s obviously the wrong thing to do, but Kubernetes doesn’t block this, and I didn’t think to put in mechanisms to prevent this.
I know that if one of these services attempts a deployment where their hardcoded port number has already been randomly allocated in a previous deployment by the k8s controller, it will fail, saying the port is already in use.
My question is, when the k8s controller randomly picks a port in its reserved range, does it make any attempt to see if that port is already in use, obviously not from random assignments, but from a hardcoded reference in a custom nodeport, and try another random port, or does it make the (reasonable) assumption that nothing else should be using those ports?
If the k8s controller doesn’t check if that port is already in use, then that rollout will fail with a similar error, but this case will be much harder to diagnose than when a rollout of a “custom” nodeport fails. In that earlier case, we will know it’s because of a mistake in that service. In the second case, all we will know is that SOME service has defined that port already, and it won’t tell me which service it is, so I would have to look at all the service objects in the entire cluster to find the custom nodeport with the bad port number. It would actually be a good thing if it does work this way, as that would definitely increase the urgency of this, making it easier for me to argue for adding special checks in our deployment code for this, to prevent this from happening in the first place.