Kubernetes version: v1.21.14(I know it’s quiet outdated, upgrading is in-progress but I do need to maintain it for now)
Cloud being used: AWS
Installation method: kOps
Host OS: Linux - Ubuntu
CNI and version: “0.3.0”(weaveworks/weave-npc:2.8.1)
CRI and version: containerd://1.4.9
Recently, my cluster ran into an issue that 5/6 Target Instances registered the ingress-nginx(
helm.sh/chart=ingress-nginx-4.9.0) Classic ELB went ‘Out-of-services’. I’ve checked all the Pods are at Running state and all the Nodes are at Ready state, there is no OOM or out of CPU issue.
ingress-nginx Classic ELB was listening on
443:30443(random) and the health check is the default TCP ping. When SSH to one of the Nodes, I wasn’t able to telnet the 30080 port, and I also couldn’t find any process listening on that port(found the AWS ELB health check troubleshoot).
I’ve tried to deregister then register ‘Out-of-service’ Nodes, it doesn’t fix the issue. I’ve tried to terminate those Nodes, it doesn’t fix it either. The final fix is to manually remove the ingress-nginx Classic ELB and recreate a new one. However, I couldn’t find the root cause of this issue to prevent this happen again. Since k8s is the one maintaining the LoadBalancer type Service and mapping all the ports to Nodes, I’m looking for a documentation describing how k8s create/update/maintain an AWS Classic ELB, especially on how LoadBalancer/Node port mapping works.
[Google DeepMind LLM Assist]
In a Kubernetes cluster running on AWS, the interaction between Kubernetes services of type
LoadBalancer and AWS Elastic Load Balancers (ELBs) is facilitated by the Kubernetes cloud provider integration for AWS. This integration automates the creation, update, and deletion of AWS ELBs when a Kubernetes service of type
LoadBalancer is deployed. Here’s an overview of how this process works, particularly for Kubernetes v1.21.14, which you mentioned is currently in use in your environment:
- Service Creation: When you create a Kubernetes service of type
LoadBalancer, Kubernetes communicates with the AWS API to create an ELB in your AWS account. The specifics of the ELB (such as whether it’s a Classic ELB, an Application Load Balancer (ALB), or a Network Load Balancer (NLB)) depend on annotations within the service definition or the default configuration of your Kubernetes cluster.
- Node Registration: Kubernetes automatically registers all nodes within the cluster as targets in the ELB. This means that the ELB will forward traffic to these nodes on the specified NodePort(s). Your case involves a Classic ELB, which typically uses instance-based targeting.
- Health Checks: The ELB performs health checks on the nodes to ensure they can properly handle the incoming traffic. The default health check is a TCP ping, as you mentioned. If a health check fails, the ELB marks the instance as “Out of Service,” which seems to be the issue you encountered.
- Port Mapping: For services of type
LoadBalancer, Kubernetes opens the specified ports (in your case, 30080 and 30443) on each node to forward traffic from the ELB to the service’s pods. This is done using iptables or ipvs rules, depending on the cluster’s configuration. These ports must be open and accessible for the ELB to successfully route traffic to the nodes.
Given the issue you described, where nodes were marked as “Out of Service” by the ELB despite the pods and nodes being healthy, it’s possible that the problem lies in the network path between the ELB and the nodes, specifically with the NodePort mappings or the security groups/firewall rules allowing traffic on those ports.
To troubleshoot and prevent this issue from recurring, consider the following steps:
- Security Groups: Ensure that the security groups associated with your nodes allow inbound traffic on the NodePort ranges (by default, 30000-32767) from the ELB.
- Node Health: Verify that the nodes are configured to respond to the ELB’s health check requests on the specified NodePort. This could involve ensuring that the kube-proxy component is running correctly and that there are no network policies or firewall rules blocking the health check traffic.
- ELB Configuration: Check the ELB’s health check configuration to ensure it matches the expectations for your services. For example, if your service is HTTP-based, consider using an HTTP health check instead of a TCP ping.
- Kubernetes Network Configuration: Investigate the network configuration within Kubernetes, including the CNI plugin (Weave Net in your case) and kube-proxy, to ensure they are correctly routing and forwarding traffic.
For documentation specific to Kubernetes’ AWS cloud provider integration, the Kubernetes official documentation is a good starting point, though it may not provide in-depth details for each Kubernetes version. The AWS documentation on Elastic Load Balancing also provides valuable insights into ELB’s behavior and configuration options.
Since Kubernetes and its ecosystem, including cloud provider integrations, are open-source, the actual source code for the AWS cloud provider in Kubernetes can also serve as a form of documentation. This code is available in the Kubernetes GitHub repository, under the
legacy-cloud-providers/aws directory for Kubernetes versions prior to the extraction of cloud providers into external components.
Given the complexity of this issue and the potential for subtle configuration nuances, you might also consider engaging with the Kubernetes community, such as the Kubernetes Slack channels or forums, where you can share details about your setup and get advice from experts who have faced similar challenges.
Hi @timwolfe94022 ,
One thing confuses me the most is: when the AWS ELB created by ingress-nginx deployment, it configs:
Port: http 80/TCP
NodePort: http 32516/TCP #<- wasn't able to telnet when node 'out of service'
Endpoints: 100.124.0.22:80 #<- ingress-nginx-controller pod IP
Port: https 443/TCP
NodePort: https 32451/TCP
When a node is in-service, I’m able to telnet all registered node on port 32516, and when a node is out-of-service, I’m not able to telnet. And the port 32516 is not showing on netstat or lsof no matter the node is in-service or out-of-service. I’m not sure why the port is open and able to telnet when in-service but I can’t find a process listening on that port then trace down the root cause.
Also, which part in k8s controls the port opening?