Cluster API: Extension mechanism workstream

Extension Mechanism Workstream

Welcome to the extension mechanism workstream! This workstream is related to how cluster-api-providers interact with cluster-api. Longer form async discussions will be held in this thread

Some useful links:

Always feel free to reach out on the k8s slack (@cha).

4 Likes

Vote for possible meeting times here: https://doodle.com/poll/uix76acth4hvnfcq

1 Like

Looks like the highest voted day is April 16th at 1:00 PM New York (10:00 AM San Francisco) (6:00 PM London).

I’ll be sending out a meeting invite later today

Thanks for coming to the meeting today everyone!

Here is a summary of the meeting (copy and pasted notes)

Summary

Agenda

  • [cha] Introduction/Kickoff
    • What is this workstream?
      • How does the Cluster API interact with Cluster API providers?
    • What will this workstream produce?
      • Proposal for how cluster-api and cluster-api-providers should interact
    • What do we have today?
    • Problems
      • Can’t ask the system to tell us what providers are running/installed, have to manually detect it
      • No type safety upon Cluster/Machine create
      • Fairly annoying to program with, lots of deserialize/reserialize. Can be programmed around though so it’s not that bad.
  • Existing discussions around extension mechanisms:
    • Webhooks
    • gRPC
      • Proposal , example implementation
      • Each provider implements CreateMachine, DeleteMachine, and ListMachines
      • Pros: Entire Machine controller is shared; don’t need to vendor entire cluster-api project into providers, providers don’t have to create/maintain controllers.
    • Provider-specific CRDs with a common CRD to link them to the cluster-api core (dhellmann needs to write this up)
    • [pablochacin] Use an object reference to the provider specific CRD which has a reconcile loop in the provider (draft proposal in Issue #833)
    • Golang plugins
    • Code-generator for a “framework” that someone can use to plug in the relevant calls. This would allow each provider to easily choose the implementation details.
  • Use Cases for Cluster API -> Pros & Cons for extension mechanisms
  • [michaelgugino] Suggests reconciliation loop at provider level only
    • [michaelgugino] rpc/webhook model requires implementing control loops in all providers in addition to the control loop in the top-level. More duplication IMO.
    • [jasondetiberus] Keeping reconciliation at the top-level is better for bubbling up status
    • [dhellmann] creating a proxy grpc/rest layer feels like overkill for just translating an imperative request from one format to another. Most providers already have some sort of client that is going to use the network to communicate with the provider infrastructure.
  • [jasondetiberus] we may end up deciding that different extension mechanisms make sense for different use cases
    • [justinsb] also in favor of the possibility of many approaches if it makes things fun
  • [pablochacin] What problem are we trying to solve? We’re talking about opaque data and implementation details, but we should take a step back and talk about what we want the solutions to be able to do.
  • Goals:
    • Cluster API

    • Infrastructure Providers

    • Tooling / UX

    • Enable out of tree development; make it easy

      • [dwat] Add-ons as operator / webhooks / grpc -> all allow out of tree development, but maybe one is easier or more familiar than others. Emphasis on easier.
    • Decoupling providers from upstream

      • [ilya] Providers are already out-of-tree, but they have to vendor Cluster API and are closely coupled to changes.
        • [michaelgugino] We should assess what couplings are pain-points for this model and adapt for easier reusability.
      • [jasondetiberus] end up deploying vendor CRDs alongside upstream Cluster API CRDs and trying to deploy providers alongside each other makes it very easy for them to break each other (depending on which CRD version gets deployed last).
      • [ilya] upstream and providers need to be able to move at their own pace, allowing either one to move ahead of the other (in terms of feature support)
    • Reduce code duplication/increase code reuse across infrastructure providers

    • Make easier to discover which providers are been used without looking inside the provider specific data in Cluster API objects [pablochacin]

      • [michaelgugino] We could achieve this easily by adding a ‘ProviderName’ field to the existing model.
    • Allow multiple providers to run at the same time [loc]

    • [pablo] should also consider clusterctl don’t be forced to have a different binary build for each provider, as the Cluster API should support multiple providers.

    • [detiber]Users should be able to introspect high level information about their deployed clusters without having to traverse into provider-specific crds.

      • [michaelgugino] I think this goal need some specific examples and more enumeration.
    • Prevent version mismatch between Cluster API and providers [loc]

      • [michaelgugino] This is an artifact of using web/rpc extension mechanism, would not be a problem for provider-as-reconciler
    • [michaelgugino] We should treat the reusable bits of code as a library, and move them out of the repo with the bits that are meant to be the single implementation (machine set, etc.). [dhellmann +1]

    • Provide status feedback to users (e.g. invalid provider config) [loc]

  • Send out doc for proposals [cha]

Beyond that, I’ve created a document for folks to help flesh out proposals. I’ve added some relevant requirements and use-cases, but they are by no means authoritative.

https://docs.google.com/document/d/1oS14nPniZg_r3-vmgydFKri6sx_590XQVxH4Yu4wNt0/edit#heading=h.qvzy2dqdm9oj

Along with the recording

2 Likes

Some extension-mechanism problems I’ve been thinking about this week:

  • Can we leverage a mechanism to remove the need for providers to publish their own clusterctl?
  • How will providers register themselves with cluster-api so we can ask “what providers does this cluster support?”

I’d love other folks to weigh in and check out the proposal doc

Can we leverage a mechanism to remove the need for providers to publish their own clusterctl?

Is there a model where we can move the pivoting logic outside of clusterctl? The main reason clusterctl gets its own per-provider compilation right now is to get the kubeconfig via a provider-specific mechanism.

Per-provider clusterctl is a bad UX, imho. There are designs we can use that allow us to have a single clusterctl (assuming we still want to keep it) that works with all infrastructure providers. I’m not sure that moving pivoting logic out of clusterctl is the solution, though… why do you think that’s necessary?

2 Likes

I think both of these are very important.

1 Like

I would second that. In mind it would be about making initial cluster bootstrap/tear-down the responsibility of a provider-specific tool and making the rest a responsibility of kubectl. If we think a commons CLI is needed, we can consider kubectl plugins for that and discuss what common flags we most providers can support. But that would be need for bootstrap/tear-down of initial cluster (and in single-cluster use-case, of course, that would be about the single cluster).

I think a web hook would be pretty good at solving this problem. If clusterctl made a call to /getKubeconfig then we could still keep the pivoting logic in clusterctl but remove the need to register a provider using go-code and could register it through some other mechanism (a CRD perhaps). This would make clusterctl reusable.

This sounds interesting, but I want to make sure I’m following. When you say initial cluster do you mean the bootstrap cluster or the first management cluster? I think an example might help. In your mind what can/should clusterctl do and what would it look like and what might a kubectl plugin do and look like?

I think a web hook would be pretty good at solving this problem. If clusterctl made a call to /getKubeconfig then we could still keep the pivoting logic in clusterctl…

I wouldn’t expect an external tool such as clusterctl to be able to invoke in-cluster webhooks.

I wouldn’t expect an external tool such as clusterctl to be able to invoke in-cluster webhooks.

Yup agreed. clusterctl might be able to do some interaction with the API server, that underneath relies on a controller going out to a webhook, but the CLI itself cannot call the webhook.

ok, fair point. I wonder if clusterctl has too big of a scope, but maybe that’s what @errordeveloper was getting at

I do agree that clusterctl shouldn’t be able to call directly out to the webhook. That said, if part of the Cluster and/or ControlPlane data model is to expose a kubeconfig for the created cluster in some way, then there would be no need for a webhook to get the kubeconfig for the cluster and clusterctl can just wait for the kubeconfig to be available from the data model.

When you say initial cluster do you mean the bootstrap cluster or the first management cluster?

I mean the management cluster, which may or may not have other clusters. In my mind, the idea of a bootstrap cluster is very much an implementation detail to some particular providers.

Say I want to create an EKS cluster. I could use eksclt create cluster --config-file=cluster.yaml to do it, and next I could launch any number of controller inside of that cluster that let me create EKS or really any kind of clusters. At that point I have a controller and I can launch new cluster by creating objects via the Kubernetes API (e.g. kubectl).

So that first management clusters requires some kind of provide aware thing, in this example it’s eksctl. The EKS provider controller (or actuator, if you will) would probably re-use code, but that’s a separate mater. To begin with, there must exist a program that user starts up that has provide-specific logic. That program is also needed to destroy the cluster.
In other words, we can only have a generic tool once we have a cluster with Kubernetes API and controllers that have provider-specific logic. Before or after that, a tool must be provider-specific.

Yes, I do think we should take provide-specific logic for creating management cluster (“from scratch”) out of scope.

I may, at a later point, make sense to discuss if there is a common patter of CLIs that we should agree on (e.g. something like cluster-api-<provider> [create|delete] --name=<name> --config=<path>), but I think that’d be a separate conversation.

1 Like

I don’t necessarily agree that clusterctl should be completely out of scope. Not all providers will want to write their own bootstrapping logic. For the case of Machine-based implementations, there is a general common pattern that will apply, just as it does today.

That said, I do not think clusterctl should be a requirement for using cluster-api.

That’s what I’ve been thinking, but wanted someone else to say it first. Want to see some of the work clusterctl does right now move into the controllers, if at all possible, and perhaps see clusterctl itself turn into a kubectl plugin.

If that’s a way forward, which behaviours of the current clusterctl could potentially be moved?