Configuring TCP Keepalive

We are running an AKS cluster behind a firewall. The firewall severs inactive TCP connections after a few minutes, so we’d like to modify the default TCP keepalive configuration. (The default for Linux is to wait 2 hours, which is way too long.) We tried to configure net.ipv4.tcp_keepalive_time, etc. on the nodes, but unfortunately Kubernetes ignores this and our pods continue to use the original Linux defaults.

It seems our only option is to use securityContext.sysctls in every pod spec. Is that correct? Unfortunately, the TCP keepalive sysctls are not considered “safe” so it seems this would require passing --allowed-unsafe-sysctl to kubelet. Are these particular sysctls actually “unsafe”? If so, why? If not, can they be added to the default allowlist?

Note: I know we can also configure TCP keepalive in the application itself via socket options. Unfortunately, some third-party libraries/applications (e.g., boto3) do not offer any way to set these. Setting the system defaults is the only way, unfortunately.

https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

1 Like

Bumping this. Does anyone have an answer?

DId you ever find a solution for this issue?

No, we did not.

The really ugly workaround that may not help everyone is to run a privileged: true initContainer that sets these flags.

      initContainers:
      - name: sysctls
        image: alpine
        command:
        - "sh"
        - "-c"
        - |-
          set -x
          cd /proc/sys/net/ipv4
          echo 240  > tcp_keepalive_time
          echo 3    > tcp_keepalive_probes
          echo 10    > tcp_keepalive_intvl
          cat tcp_keepalive*
        securityContext:
          privileged: true
          runAsUser: 0

… or alternatively, using hostPath (if your PSP allows hostPath but not privileged: true):

      volumes:
      - name: proc-sys-net-ipv4
        hostPath:
          path: "/proc/sys/net/ipv4"
          readOnly: false
      initContainers:
      - name: sysctls
        image: alpine:3.15
        command:
        - "sh"
        - "-c"
        - |-
          set -x
          cd /mnt/proc-sys-net-ipv4
          echo 240  > tcp_keepalive_time
          echo 3    > tcp_keepalive_probes
          cat tcp_keepalive*
        volumeMounts:
        - name: proc-sys-net-ipv4
          mountPath: /mnt/proc-sys-net-ipv4
        securityContext:
          runAsUser: 0

I do not understand why the people who decided the “safe” sysctls consider these 3 flags “unsafe”. It’s very easy to verify that they are network-namespace-scoped with a simple unshare --map-root-user --net test on a Linux VM.

Everything you can accomplish with these flags you can also accomplish with setsockopt syscalls from an unprivileged program, but many libraries do not nicely expose the kernel API due to lazy implementation (hence the need to be able to set this in the network namespace of the kernel instead).

I am guessing the people who implemented sysctls for kubernetes did not do a study of each net.ipv4.* flag individually, just picked a few that the stakeholders of the feature wanted and left the rest in the dust as “unsafe”.