How does kubelet DNS resolution before pod creation work?

Can someone enlighten me on how DNS resolution occurs before pod creation on a worker node?

My problem is the following:

I have a cluster wide docker registry that I have deployed sitting as a service that looks like this:

pidocker-docker-registry.default.svc.cluster.local:5000

However, this address doesn’t resolve BEFORE pod creation if I try to use it, i.e.:

k run -it my-app --image=pidocker-docker-registry:5000 (or pidocker-docker-registry.default.svc.cluster.local:5000) --command – /bin/bash

What happens is I get an immediate ImagePull error because it can’t resolve the docker registry hostname.

But if I create a Pod first (say dnsutils) and then do a dig, everything works (as it should since I noticed that the dnsutil’s Pod has the right /etc/resolv.conf and search/domain entries).

I noticed in the doc under DNS troubleshooting there is the following line:

Kubernetes installs do not configure the nodes’ resolv.conf files to use the cluster DNS by default, because that process is inherently distribution-specific. This should probably be implemented eventually.

Is this what I’m running into? The only way for me to “fix” this is to add the CoreDNS IP address to /etc/resolv.conf on the worker node itself which then allows the kubelet to resolve the FQDN of my docker registry and pull successfully.

Am I doing something wrong?

You really don’t want the nodes depending on Kube services - that’s circular. If the registry pod goes down, how do you bootstrap it again?

It is for internal images only - no images critical to the cluster itself. That’s not so bad (particularly for a learning environment).

Not to mention that this is fairly standard way of getting minikube to use internal images (port-forwarding 5000 to a registry service).

Best to do that explicitly, then. I don’t think we’d want to do this on all clusters.

So you could add cluster DNS to your resolvers list or you could expose that registry as a NodePort service and always access localhost:30000 or something like that.

I did the former and added the CoreDNS server to the resolvers list of the workers and that does the trick.

TBH, you still haven’t made a very case on why the kubelet doesn’t at least consult CoreDNS as a secondary resolver at pod creation time though. It seems extremely logical to me (rely on the workers resolve.conf and use CoreDNS as a backup before giving up).

DNS is fiddly. You rarely want to be consulting 2 DNS servers that have different information because then they are not interchangeable. In most cases the cluster DNS’s upstream is the node’s resolver anyway, so that cyclical. IMO it’s better to special case one time, but there’s a huge diversity of use-cases, so other solutions have their place.

You always consult multiple DNS server with different information. That’s the whole point of DNS! When you find the one that says it’s authoritative, you get a response.

I really do not see why the Kubelet doesn’t use the CoreDNS server pre-pod creation as a secondary DNS source. Especially since anything .cluster.local makes it authoritative.

Btw, from the official doc:

Kubernetes installs do not configure the nodes’ resolv.conf files to use the cluster DNS by default, because that process is inherently distribution-specific. This should probably be implemented eventually.

Source: Debugging DNS Resolution | Kubernetes

Yeah, I think it really should. Once you install the kubelet and run it, you are saying this is a worker node and consequently should know about the CoreDNS nameserver during run-time.

You always consult multiple DNS server with different information.

Beg to differ, or at least to clarify. If you have a local config (e.g. resolv.conf) with 2 different nameservers which return different answers for the same query, you will eventually have chaos. There is no spec for how clients are expected to consume resolv.conf. Some libcs try them in sequence. Some try them in parallel, and some randomize. Some check all responses for one that succeeded, others take the first response (including NXDOMAIN).

It does not always behave as you would expect. We chase COUNTLESS bugs in Kubernetes when we used to set up pods as you describe.

You’re going to have best client compat by having the local-est resolver forward or pass-thru queries to their upstream.

Btw, from the official doc:

I wrote that. I was probably wrong. To do what you want, we would really want kubelet to have a different config than the rest of the machine (to scope the impact) and probably we’d need to tightly spec and verify the resolver behavior we need and make sure that kubelet ALWAYS has that behavior.

Couple of things:

a) You are talking about broken DNS. k8s can’t avoid that and trying to “:work” around it or more design decisions based on it is not what I would do.

b) Are you really taking the positoin that a .cluster.local address is going to resolve differently on two different nameservers on a worker node? What are you exactly trying to avoid?

c) What is “upstream” in your case? The Core DNS server? That makes no sense to me.

Look, you made a design decision. I don’t have to agree with it and I accept my work around.

But it does seem absolutely silly to me that the kubelet won’t even consider using the CoreDNS server on pod creation time automatically (or even optionally) - especially if I give a FQDN of the registry with a domain .cluster.local.

You are talking about broken DNS

There is NO SPEC (that I can find?) for how client resolvers are supposed to operate in the face of multiple nameservers. To a first approximation, all DNS clients are broken.

Are you really taking the positoin that a .cluster.local address is going to resolve differently on two different nameservers on a worker node?

I am saying that one nameserver might respond NXDOMAIN while the other successfully answers the query. SOME clients will issue those queries in the order you specify nameservers, giving the answer you expect. Some clients will issue those queries in parallel and take whichever returns first, giving you utterly non-deterministic results. Some clients will randomize the nameserver order, also giving you non-deterministic results.

I am saying that, unless you CAREFULLY control ALL of the clients, it will explode in your face. We know this because we TRIED doing this in Kubernetes and were beaten into submission by a relentless stream of bug reports, which ended at exactly the same time we stopped doing this.

What is “upstream” in your case?

I meant that clients ask cluster DNS exclusively. Cluster DNS is canonical for cluster.local and any non-result there is NXDOMAIN. All other queries are forwarded to $someone, usually the node’s own DNS nameserver(s). That means clients always get a consistent view of names, and there’s no ambiguity when asking 2 nameservers and getting 2 responses.

But it does seem absolutely silly to me that the kubelet won’t even consider using the CoreDNS server on pod creation time automatically (or even optionally) - especially if I give a FQDN of the registry with a domain .cluster.local.

I didn’t actually say that. What I said was that you have to control the clients. Hypothetically, kubelet could have a distinct resolv.conf or could internally query DNS differently for the cluster suffix (which it knows). It’s just not as simple as throwing another nameserver line into /etc/resolv.conf on the node. If someone wanted to write a KEP about this, to explore how to make it happen, I’d help review that KEP.

Tim

Tim, we are in violent agreement. That was MY point: I shouldn’t have to much with the client’s resolv.conf. I am saying the kubelet should have its own predefined ordering for name server resolution which as you said here is EXACTLY my thoughts on the matter:

Bingo! That’s exactly what I was thinking. The kubelet uses the cluster DNS ALWAYS as the primary DNS server for .cluster.local, i.e it’s authoritative and kicks upstream not the reverse which is what is happening now (or in my case, I’m working around it by flipping the nameserver order manually in resolv.conf - yikes!).

I would love to write this…how do we make this happen? Some general guidance would be appreciated (yeah, yeah, I can Google too but sometimes that yields conflicting paths to salvation).

I am saying the kubelet should have its own predefined ordering for name server resolution

Kubelet’s DNS is defined by /etc/resolv.conf. We don’t have a mechanism for kubelet to have its own ordering. That would need a KEP.

The kubelet uses the cluster DNS ALWAYS as the primary DNS server for .cluster.local

Unfortunately there’s no way (that I know of) to switch nameservers based on suffix match. DNS resolution is not something Kubelet ever does explicitly (today) so we’d need some significantly clever work to make this happen.

I’m working around it by flipping the nameserver order manually in resolv.conf

…which is exactly what I am saying we CAN’T do in general, though it may work for your specific case :slight_smile:

I would love to write this…how do we make this happen?

You write a KEP and lay out the problem statement, some possible implementations, the pros/cons/risks, and propose a path forward. Not too hard, except the “possible solutions” part :slight_smile:

We are in agreement. Do you have an example of a well written KEP I could use as a template?

There’s a KEP template. This idea doesn’t need anything as large as https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20180612-ipv4-ipv6-dual-stack.md but some good KEPs in the same scope:

https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/0030-nodelocal-dns-cache.md

https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20191104-iptables-no-cluster-cidr.md

https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20190226-pod-overhead.md

Thank you for this. Sorry for the delay (bad cold). I will look and write something up this week.

Hello, I know this discussion is old but I’m facing the exact problem with AKS cluster. I want to pull pod images from a registry deployed on another cluster. I configured coreDNS configmap for the FQDNs of the container registry and I can ping it from inside my pods but not from the nodes. I tried to edit the resolv.conf on the node but the file is managed by man:systemd-resolved(8) and changes are not saved to the file.
Sorry but I’m new in networking things, so if you could help it would be great