Problem Summary
I’m experiencing DNS resolution issues in my Talos Linux Kubernetes cluster deployed on DigitalOcean Droplets with a loadbalancer for the controlplane nodes similar to the suggested setup by talos which can be found here. The cluster appears to be running correctly, but pods cannot resolve internal Kubernetes services or external domains.
Cluster information
Kubernetes version: 1.33.3
Cloud being used: Digital Ocean
Installation method: Ansible using talosctl, kubectl, …
Host OS: talos v1.10.6
CNI and version: cilium v1.18.0
CRI and version: containerd v2.0.5
Initial Symptoms
When testing DNS resolution from within a pod:
kubectl exec -ti dnsutils -- nslookup kubernetes.default
The command times out with no response, indicating DNS resolution is not working.
Investigation Steps Taken
1. Verified CoreDNS Pods Status
kubectl get pods -n kube-system -l k8s-app=kube-dns
Both CoreDNS pods are running and healthy.
2. Checked Pod DNS Configuration
kubectl exec -ti dnsutils -- cat /etc/resolv.conf
Output shows correct configuration:
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
3. Verified Network Connectivity
kubectl exec -ti dnsutils -- ping -c 3 10.244.0.85
Network connectivity to CoreDNS endpoints works fine.
4. Checked CoreDNS Configuration
kubectl get configmap -n kube-system coredns -o yaml
CoreDNS is configured to forward to 1.1.1.1 and 1.0.0.1.
Questions
- Has anyone experienced similar DNS issues with Talos on DigitalOcean? If not, would you mind sharing your working config?
- Are there any specific DigitalOcean networking considerations I should be aware of?
- Are there any additional debugging steps I should try?
Current Configuration
My Talos controlplane configuration (Ansible template):
version: v1alpha1 # Indicates the schema used to decode the contents.
debug: false # Enable verbose logging to the console.
persist: true
# Provides machine specific configuration options.
machine:
type: controlplane # Defines the role of the machine within the cluster.
token: "{{ ansible_vars_for_talos_templates['MACHINE_TOKEN'] }}"
# The root certificate authority of the PKI.
ca:
crt: "{{ ansible_vars_for_talos_templates['MACHINE_CA_CRT'] }}"
key: "{{ ansible_vars_for_talos_templates['MACHINE_CA_KEY'] }}"
# Extra certificate subject alternative names for the machine's certificate.
certSANs:
- "{{ control_plane_endpoint }}" # Add the LB IP
- "{{ node.private_ip }}"
- "{{ node.public_ip }}"
# Used to provide additional options to the kubelet.
kubelet:
image: ghcr.io/siderolabs/kubelet:{{ kubernetes_version }} # The `image` field is an optional reference to an alternative kubelet image.
nodeIP:
validSubnets: ["{{ node.private_ip }}/16"]
defaultRuntimeSeccompProfileEnabled: true # Enable container runtime default Seccomp profile.
disableManifestsDirectory: true # The `disableManifestsDirectory` field configures the kubelet to get static pod manifests from the /etc/kubernetes/manifests directory.
# Provides machine specific network configuration options.
# `interfaces` is used to define the network interface configuration.
network:
interfaces:
- interface: eth1 # Use eth1 for private networking on DigitalOcean
addresses:
- {{ node.private_ip }}/16 # Assign static private IP, /16 is common for DigitalOcean private nets
# vip: # Virtual IP, if needed for HA on this interface
# ip: {{ node.private_ip }} # Example, adjust if you have a specific VIP strategy
dhcp: false
- interface: eth0 # Public network interface
dhcp: true
dhcpOptions: # Request DigitalOcean to assign public IP and gateway
routeMetric: 100 # Lower metric for default route via public if desired, adjust as needed
nameservers:
- 1.1.1.1
- 1.0.0.1
disableSearchDomain: true
# Used to provide instructions for installations.
install:
disk: /dev/sda # The disk used for installations.
image: ghcr.io/siderolabs/installer:{{ talos_version }} # Allows for supplying the image used to perform the installation.
wipe: false # Indicates if the installation disk should be wiped at installation time.
# Used to configure the machine's container image registry mirrors.
registries: {}
# Features describe individual Talos features that can be switched on or off.
features:
rbac: true # Enable role-based access control (RBAC).
stableHostname: true # Enable stable default hostname.
kubernetesTalosAPIAccess:
enabled: true
allowedRoles: ["os:admin"]
allowedKubernetesNamespaces: ["actions-runner-system", "system-upgrade"]
apidCheckExtKeyUsage: true # Enable checks for extended key usage of client certificates in apid.
diskQuotaSupport: true # Enable XFS project quota support for EPHEMERAL partition and user disks.
# KubePrism - local proxy/load balancer on defined port that will distribute
kubePrism:
enabled: true # Enable KubePrism support - will start local load balancing proxy.
port: 7445 # KubePrism port.
# Configures host DNS caching resolver.
hostDNS:
enabled: true # Enable host DNS caching resolver.
forwardKubeDNSToHost: false # Disable forwarding kube-dns to host DNS to avoid issues. https://github.com/siderolabs/talos/issues/9784
# Configures the node labels for the machine.
nodeLabels:
node.kubernetes.io/exclude-from-external-load-balancers: ""
topology.kubernetes.io/region: "{{ node.region_slug }}"
topology.kubernetes.io/zone: "{{ node.zone }}"
# Allows the addition of user specified files.
files:
- # Spegel
op: create
path: /etc/cri/conf.d/20-customization.part
content: |
[plugins."io.containerd.cri.v1.images"]
discard_unpacked_layers = false
# Used to configure the machine's sysctls.
sysctls:
fs.inotify.max_user_watches: 1048576 # Watchdog
fs.inotify.max_user_instances: 8192 # Watchdog
net.core.rmem_max: 67108864 # 10Gb/s | Cloudflared / QUIC
net.core.wmem_max: 67108864 # 10Gb/s | Cloudflared / QUIC
vm.nr_hugepages: 1024 # PostgreSQLs
# Provides cluster specific configuration options.
cluster:
id: "{{ ansible_vars_for_talos_templates['CLUSTER_ID'] }}"
secret: "{{ ansible_vars_for_talos_templates['CLUSTER_SECRET'] }}"
# Provides control plane specific configuration options.
controlPlane:
endpoint: https://{{ control_plane_endpoint }}:443
clusterName: {{ cluster_name }}
# Provides cluster specific network configuration options.
network:
dnsDomain: cluster.local # The domain used by Kubernetes DNS.
podSubnets: ["10.244.0.0/16"]
serviceSubnets: ["10.96.0.0/12"]
# The CNI used.
cni:
name: none # Name of CNI to use. Set to none because it is deployed by helmfile.
coreDNS:
disabled: false
token: "{{ ansible_vars_for_talos_templates['CLUSTER_TOKEN'] }}"
secretboxEncryptionSecret: "{{ ansible_vars_for_talos_templates['CLUSTER_SECRETBOXENCRYPTIONSECRET'] }}"
# The base64 encoded root certificate authority used by Kubernetes.
ca:
crt: "{{ ansible_vars_for_talos_templates['CLUSTER_CA_CRT'] }}"
key: "{{ ansible_vars_for_talos_templates['CLUSTER_CA_KEY'] }}"
# The base64 encoded aggregator certificate authority used by Kubernetes for front-proxy certificate generation.
aggregatorCA:
crt: "{{ ansible_vars_for_talos_templates['CLUSTER_AGGREGATORCA_CRT'] }}"
key: "{{ ansible_vars_for_talos_templates['CLUSTER_AGGREGATORCA_KEY'] }}"
# The base64 encoded private key for service account token generation.
serviceAccount:
key: "{{ ansible_vars_for_talos_templates['CLUSTER_SERVICEACCOUNT_KEY'] }}"
# API server specific configuration options.
apiServer:
image: registry.k8s.io/kube-apiserver:{{ kubernetes_version }} # The container image used in the API server manifest.
extraArgs:
enable-aggregator-routing: true
# Extra certificate subject alternative names for the API server's certificate.
certSANs: # For Kubernetes API (port 6443)
- "{{ control_plane_endpoint }}"
- "{{ node.private_ip }}"
- "{{ node.public_ip }}"
- "kubernetes"
- "kubernetes.default"
- "kubernetes.default.svc"
- "kubernetes.default.svc.cluster.local"
- "127.0.0.1"
- "localhost"
disablePodSecurityPolicy: true # Disable PodSecurityPolicy in the API server and default manifests.
# Configure the API server admission plugins.
admissionControl:
- name: PodSecurity # Name is the name of the admission controller.
# Configuration is an embedded configuration object to be used as the plugin's
configuration:
apiVersion: pod-security.admission.config.k8s.io/v1alpha1
defaults:
audit: restricted
audit-version: latest
enforce: baseline
enforce-version: latest
warn: restricted
warn-version: latest
exemptions:
namespaces:
- kube-system
runtimeClasses: []
usernames: []
kind: PodSecurityConfiguration
# Configure the API server audit policy.
auditPolicy:
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: Metadata
# Controller manager server specific configuration options.
controllerManager:
image: registry.k8s.io/kube-controller-manager:{{ kubernetes_version }} # The container image used in the controller manager manifest.
extraArgs:
bind-address: 0.0.0.0
# Kube-proxy server-specific configuration options
proxy:
disabled: true # Disable kube-proxy: Cilium will provide kube-proxy replacement functionality.
image: registry.k8s.io/kube-proxy:{{ kubernetes_version }} # The container image used in the kube-proxy manifest.
# Scheduler server specific configuration options.
scheduler:
image: registry.k8s.io/kube-scheduler:{{ kubernetes_version }} # The container image used in the scheduler manifest.
extraArgs:
bind-address: 0.0.0.0 # Bind to all interfaces.
config: # Configure the scheduler profiles.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: PodTopologySpread # Configure the PodTopologySpread plugin.
args:
defaultingType: List
defaultConstraints: # Evenly spread pods across nodes and zones.
- maxSkew: 1
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: ScheduleAnyway
plugins:
score: # Disable ImageLocality plugin as it is not needed for this setup.
disabled:
- name: ImageLocality
# Configures cluster member discovery.
discovery:
enabled: true # Enable the cluster membership discovery feature.
# Configure registries used for cluster member discovery.
registries:
# Kubernetes registry uses Kubernetes API server to discover cluster members and stores additional information
kubernetes:
disabled: true # Disable Kubernetes discovery registry.
# Service registry is using an external service to push and pull information about cluster members.
service: {}
# Etcd specific configuration options.
etcd:
# The `ca` is the root certificate authority of the PKI.
ca:
crt: "{{ ansible_vars_for_talos_templates['CLUSTER_ETCD_CA_CRT'] }}"
key: "{{ ansible_vars_for_talos_templates['CLUSTER_ETCD_CA_KEY'] }}"
# A list of urls that point to additional manifests.
extraManifests: []
Cilium config (values.yaml for helm chart):
---
# Talos v1.10 + DigitalOcean droplets configuration
# Enable direct node routes for better performance (bypasses kube-proxy)
autoDirectNodeRoutes: true
# BPF configuration for Talos compatibility
bpf:
masquerade: false # Disable masquerade to avoid conflicts with DO load balancer
# Required for Talos compatibility - see https://github.com/siderolabs/talos/issues/10002
hostLegacyRouting: true
cni:
exclusive: false
cgroup:
automount:
enabled: false
hostRoot: /sys/fs/cgroup
# NOTE: devices might need to be set if you have more than one active NIC on your hosts
# devices: eno+ eth+
dashboards:
enabled: true
endpointRoutes:
enabled: true
envoy:
rollOutPods: true
prometheus:
serviceMonitor:
enabled: true
gatewayAPI:
enabled: true
hubble:
enabled: false
ipam:
mode: kubernetes
ipv4NativeRoutingCIDR: "10.244.0.0/16"
k8sServiceHost: 127.0.0.1
k8sServicePort: 7445
kubeProxyReplacement: true
kubeProxyReplacementHealthzBindAddr: 0.0.0.0:10256
l2announcements:
enabled: true
loadBalancer:
algorithm: maglev
mode: "snat"
localRedirectPolicy: true
operator:
dashboards:
enabled: true
prometheus:
enabled: true
serviceMonitor:
enabled: true
replicas: 2
rollOutPods: true
prometheus:
enabled: true
serviceMonitor:
enabled: true
trustCRDsExist: true
rollOutCiliumPods: true
routingMode: native
securityContext:
capabilities:
ciliumAgent:
- CHOWN
- KILL
- NET_ADMIN
- NET_RAW
- IPC_LOCK
- SYS_ADMIN
- SYS_RESOURCE
- PERFMON
- BPF
- DAC_OVERRIDE
- FOWNER
- SETGID
- SETUID
cleanCiliumState:
- NET_ADMIN
- SYS_ADMIN
- SYS_RESOURCE
socketLB:
hostNamespaceOnly: true