Istio-ingressgateway and istio-pilot containers stop working in one specific network

Hello,

I’m very new to Kubernetes and kubeflow, have been working on a microk8s that I installed on my Ubuntu 20.04 laptop for a month now. I’ve installed microk8s 1.22/stable and charmed kubeflow on top of that. Everything was working fine and all of a sudden, 2 of the pods stopped working viz istio-pilot-0 and istio-ingressgateway, without any change in settings. I’ve done quite a lot of troubleshooting but I couldn’t fix the issue yet. Any help of guidance on this issue would be highly appreciated.

The strange fact is, that it stopped working only while connected to my home broadband wi-fi. It works fine in any other network. As a workaround I’m connected to my mobile hotspot and working. But I would like to fix this for good.

Pasting a section of the error log below which shows there is a SSL violation

2022-12-29T09:09:12.048Z [container-agent] 2022-12-29 09:09:12 ERROR juju-log Uncaught exception while in charm code:
2022-12-29T09:09:12.048Z [container-agent] Traceback (most recent call last):
2022-12-29T09:09:12.048Z [container-agent] File “/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/urllib3/connectionpool.py”, line 703, in urlopen
2022-12-29T09:09:12.048Z [container-agent] httplib_response = self._make_request(
2022-12-29T09:09:12.048Z [container-agent] File “/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/urllib3/connectionpool.py”, line 386, in _make_request
2022-12-29T09:09:12.048Z [container-agent] self._validate_conn(conn)
2022-12-29T09:09:12.048Z [container-agent] File “/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/urllib3/connectionpool.py”, line 1042, in validate_conn
2022-12-29T09:09:12.048Z [container-agent] conn.connect()
2022-12-29T09:09:12.048Z [container-agent] File “/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/urllib3/connection.py”, line 414, in connect
2022-12-29T09:09:12.048Z [container-agent] self.sock = ssl_wrap_socket(
2022-12-29T09:09:12.048Z [container-agent] File “/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/urllib3/util/ssl.py”, line 449, in ssl_wrap_socket
2022-12-29T09:09:12.048Z [container-agent] ssl_sock = ssl_wrap_socket_impl(
2022-12-29T09:09:12.048Z [container-agent] File “/var/lib/juju/agents/unit-istio-pilot-0/charm/venv/urllib3/util/ssl.py”, line 493, in _ssl_wrap_socket_impl
2022-12-29T09:09:12.048Z [container-agent] return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
2022-12-29T09:09:12.048Z [container-agent] File “/usr/lib/python3.8/ssl.py”, line 500, in wrap_socket
2022-12-29T09:09:12.048Z [container-agent] return self.sslsocket_class._create(
2022-12-29T09:09:12.048Z [container-agent] File “/usr/lib/python3.8/ssl.py”, line 1040, in _create
2022-12-29T09:09:12.048Z [container-agent] self.do_handshake()
2022-12-29T09:09:12.048Z [container-agent] File “/usr/lib/python3.8/ssl.py”, line 1309, in do_handshake
2022-12-29T09:09:12.048Z [container-agent] self._sslobj.do_handshake()
2022-12-29T09:09:12.048Z [container-agent] ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:1131)
2022-12-29T09:09:12.048Z [container-agent]

I did a microk8s inspect while connected to wi-fi (not working state) and to hotspot (working state) and compared the errors and I can see that there is one particular log message (snap.microk8s.daemon-containerd/journal.log) which shows a failure in port forwarding while connected to wi-fi (not working state), pasting the error below.

Jan 03 14:10:21 ign-blr-lp-0488 microk8s.daemon-containerd[6496]: E0103 14:10:21.145297 6496 httpstream.go:257] error forwarding port 17070 to pod 9146e4518f289f7bad79dccd4c0930c44c87bcb3ec74d241c404a07930a70ef0, uid : failed to execute portforward in network namespace “/var/run/netns/cni-9a6a040e-d392-f6c9-2bd1-97b16ad44b69”: EOF
Jan 03 14:10:25 ign-blr-lp-0488 microk8s.daemon-containerd[6496]: time=“2023-01-03T14:10:25.012980831+05:30” level=info msg=“Portforward for "9146e4518f289f7bad79dccd4c0930c44c87bcb3ec74d241c404a07930a70ef0" port []”
Jan 03 14:10:25 ign-blr-lp-0488 microk8s.daemon-containerd[6496]: time=“2023-01-03T14:10:25.013146528+05:30” level=info msg=“Portforward for "9146e4518f289f7bad79dccd4c0930c44c87bcb3ec74d241c404a07930a70ef0" returns URL "http://127.0.0.1:45027/portforward/nkIPm0f1\”"
Jan 03 14:10:25 ign-blr-lp-0488 microk8s.daemon-containerd[6496]: time=“2023-01-03T14:10:25.029844697+05:30” level=info msg=“Executing port forwarding in network namespace "/var/run/netns/cni-9a6a040e-d392-f6c9-2bd1-97b16ad44b69"”
Jan 03 14:10:27 ign-blr-lp-0488 microk8s.daemon-containerd[6496]: time=“2023-01-03T14:10:27.455110271+05:30” level=info msg=“Finish port forwarding for "9146e4518f289f7bad79dccd4c0930c44c87bcb3ec74d241c404a07930a70ef0" port 17070”
Jan 03 14:10:30 ign-blr-lp-0488 microk8s.daemon-containerd[6496]: time=“2023-01-03T14:10:30.455777423+05:30” level=info msg=“Portforward for "9146e4518f289f7bad79dccd4c0930c44c87bcb3ec74d241c404a07930a70ef0" port []”
Jan 03 14:10:30 ign-blr-lp-0488 microk8s.daemon-containerd[6496]: time=“2023-01-03T14:10:30.455965625+05:30” level=info msg=“Portforward for "9146e4518f289f7bad79dccd4c0930c44c87bcb3ec74d241c404a07930a70ef0" returns URL "http://127.0.0.1:45027/portforward/h3pcltC_\”"
Jan 03 14:10:30 ign-blr-lp-0488 microk8s.daemon-containerd[6496]: time=“2023-01-03T14:10:30.470547158+05:30” level=info msg="Executing port forwarding in network namespace "/var/run/netns/cni

Then I also did a pod describe of the failing pod in the failing state (wifi) and the working state (hotspot)
Here I see that there is a difference in the cluster IP while connected to different networks (which is obvious), but I’m not sure if there is a conflict between the IP ranges of the pods and the router network etc. Is there a way to figure that out? I couldn’t really zero in on the real root cause. Initially I thought it was a busted wifi router and got it replaced, but I’m still in the same state. Please help.