Design compatibility between `kubectl debug` and Pod Security Standards

At $WORK*, we are in the process of upgrading our k8s clusters up to and past 1.23, which brings in two exciting new features: ephemeral containers and pod security standards. Unfortunately, it seems that the two features won’t work well with each other as designed. This is explicit in the design of PSS:

Ephemeral containers will be subject to the same policy restrictions, and adding or updating ephemeral containers will require a full policy check (KEP-2579)

I’d like to lay out the following points, showing why this situation does not work well for us:

  • Problems solved by PSP (for us)
  • Problems solved by ephemeral containers
  • Our desired security policy for containers (regular and ephemeral)

Then, I’d like to propose a modification to the PSS specification to allow us to achieve our goals.

Problems solved by PSS Our current mechanism for container security enforcement uses a third-party webhook provider (concretely Gatekeeper) and a self-curated list of policies to enforce. We developed this solution during the interregnum between PSP and PSS. However this has the drawbacks that:

  • the policy selection is ad-hoc and it’s more difficult to communicate with our security teams and auditors (versus using an off-the-shelf policy collection such as PSS restricted profile)
  • the policy engine is used to enforce a variety of other (non-security) policies as well, and due to reliability concerns we have not been able to enable the deny-on-fail mode for its webhook (which we would be able to do for a single-purpose component like the PSA webhook)
  • we believe that a core kubernetes component is likely to be more stable in the long term – in terms not only of operational stability, but also the API, support lifecycle, etc.

For these reasons, we’d keenly like to migrate our security policy enforcement to PSS.

Problems solved by ephemeral containers Our clusters are large and multi-tenanted, only a small proportion of the developers with access to the cluster are trusted with admin-level access. Each non-admin is trusted only in a small slice of the cluster. We’d like to allow developers to run debugging workloads (tcpdump, strace, …) in their slice of the cluster on an as-needed basis. In order to accomplish this in a pre-ephemeral world, our developers have needed to deploy a privileged sidecar container to their app with the debugging binaries in it. This approach has some limitations: our security teams are (rightly) nervous about privileged containers running in prod environments on a permanent basis, and the alternative (redeploying an app with the sidecar enabled only when there’s a live problem to debug) actually makes it more challenging to avoid disruption during a live incident.

Therefore, we’d also like to be able to provide the ability to use ephemeral containers to our platform users for debugging.

Desired security policy for containers Our ultimate desired configuration would be to apply the restricted PSS to all workload containers (i.e. non-ephemeral ones), using the PSA controller (for simplicity, clarity, and better community support/integration). We’d continue to use Gatekeeper or a similar third-party solution to enforce policies around ephemeral containers (basically restricting them to one of the kubectl debug profiles, and not the sysadmin one). Other types of security boundaries would also contribute to achieving our organization’s overall security goals (for instance, audit logging of the creation of privileged ephemeral containers at the k8s api level and the use of endpoint security tools such as crowdstrike or to detect suspicious activity in such containers).

Suggested modification To allow additional configurations for the PSA controller: ephemeral-enforce, ephemeral-audit, and ephemeral-warn. These would be present in the controller configuration file (link) and the namespace labels (as etc; link). If present, each of these labels would define an alternate PSS to be applied to ephemeral containers. If absent, the general level (enforce etc.) would apply. (There’s a question of what to do when there’s a general label like enforce on the namespace but no ephemeral-enforce – but the latter is present in the config file. I haven’t fully thought this through but the least surprising thing to do is probably to stop at the namespace level and use the value from the general label for an ephemeral pod)

This is a modification of the stipulation from KEP-2579 quoted above, but it would give us a way to reap the benefit of the PSS/PSA initiative while also being able to fully utilize ephemeral containers for debugging (incl with privileged binaries such as strace). The status quo effectively forces us to choose one feature or the other.

I’m posting this topic here for initial feedback from the community on this idea, and whether it could prove viable as a modification to the existing PSS design (I guess through a KEP, although any guidance about how to carry this forward would be gratefully received!) Thanks very much. :slight_smile:

*all views in this post are my own and not endorsed by / should not be taken to reflect on my employer

@aecay thanks for the write up, honestly one of the more detailed feature requests I’ve seen :slight_smile:

I don’t want to just point you elsewhere, but your best bet for next steps would be to send your request to the SIG Auth mailing list and join their slack channel to discuss / potentially put it on their next meeting agenda. This forum isn’t monitored by many maintainers…mostly just me and @thockin ^^;;;