We are running an AKS cluster behind a firewall. The firewall severs inactive TCP connections after a few minutes, so we’d like to modify the default TCP keepalive configuration. (The default for Linux is to wait 2 hours, which is way too long.) We tried to configure net.ipv4.tcp_keepalive_time, etc. on the nodes, but unfortunately Kubernetes ignores this and our pods continue to use the original Linux defaults.
It seems our only option is to use securityContext.sysctls in every pod spec. Is that correct? Unfortunately, the TCP keepalive sysctls are not considered “safe” so it seems this would require passing --allowed-unsafe-sysctl
to kubelet. Are these particular sysctls actually “unsafe”? If so, why? If not, can they be added to the default allowlist?
Note: I know we can also configure TCP keepalive in the application itself via socket options. Unfortunately, some third-party libraries/applications (e.g., boto3) do not offer any way to set these. Setting the system defaults is the only way, unfortunately.
https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
1 Like
Bumping this. Does anyone have an answer?
DId you ever find a solution for this issue?
The really ugly workaround that may not help everyone is to run a privileged: true
initContainer that sets these flags.
initContainers:
- name: sysctls
image: alpine
command:
- "sh"
- "-c"
- |-
set -x
cd /proc/sys/net/ipv4
echo 240 > tcp_keepalive_time
echo 3 > tcp_keepalive_probes
echo 10 > tcp_keepalive_intvl
cat tcp_keepalive*
securityContext:
privileged: true
runAsUser: 0
… or alternatively, using hostPath
(if your PSP allows hostPath
but not privileged: true
):
volumes:
- name: proc-sys-net-ipv4
hostPath:
path: "/proc/sys/net/ipv4"
readOnly: false
initContainers:
- name: sysctls
image: alpine:3.15
command:
- "sh"
- "-c"
- |-
set -x
cd /mnt/proc-sys-net-ipv4
echo 240 > tcp_keepalive_time
echo 3 > tcp_keepalive_probes
cat tcp_keepalive*
volumeMounts:
- name: proc-sys-net-ipv4
mountPath: /mnt/proc-sys-net-ipv4
securityContext:
runAsUser: 0
I do not understand why the people who decided the “safe” sysctls consider these 3 flags “unsafe”. It’s very easy to verify that they are network-namespace-scoped with a simple unshare --map-root-user --net
test on a Linux VM.
Everything you can accomplish with these flags you can also accomplish with setsockopt
syscalls from an unprivileged program, but many libraries do not nicely expose the kernel API due to lazy implementation (hence the need to be able to set this in the network namespace of the kernel instead).
I am guessing the people who implemented sysctls for kubernetes did not do a study of each net.ipv4.*
flag individually, just picked a few that the stakeholders of the feature wanted and left the rest in the dust as “unsafe”.