Scope of v1alpha2

For those who attended the SIG face to face during the KubeCon contributor summit, we went around the room and asked if there was one and only one feature you could get in the next release, what would it be. For reference, here is that list:

  • Remote node ref
  • Health Status
    • Better error reporting.
  • Dedicated api group for worker machines
  • Decoupling bootstrapping config from provider implementation
  • Default test-bed using something like (kind)
    • There already exists in repo as the example provider.
    • Or possibly libvirt
  • Cluster Required
  • Cluster Optional
  • Optionally vs. required fields
    • E.g. cluster reference on machine/node object.
  • Orchestrating upgrades.

At this week’s Cluster API meeting, I proposed that we decrease the scope of v1alpha2 from “anything and everything” (:smile:) to something smaller and more tangible. Additionally, we should strive to have multiple, frequent, iterative releases, which is aligned with yesterday’s release cadence proposal.I’d like to propose that we focus on “Decoupling bootstrapping config from provider implementation” along with an initial formalization and realization of machine states (what the node lifecycle workstream was working on - https://docs.google.com/document/d/14Ec1aKjFtbhrGykLHtw8uPXa4TPtKIR2QiNbXArLVuI/edit#heading=h.c0njve9f8kbn).

It seems like decoupling bootstrapping from machine infrastructure is a good first step that can open up a lot of possibilities going forward. Once this is implemented, we can work on creating a component to manage the control plane lifecycle, leveraging the bootstrapping/infrastructure separation work as a foundation. We can also use the bootstrapping work to help with upgrades.

Some of us spent some time at a whiteboard at KubeCon working through a possible design for the bootstrap/infrastructure separation. We’d like to present a draft proposal early next week and hope to get buy-in from the community.

How does everyone feel about focusing on the bootstrapping work as the one major feature of v1alpha2?

Andy

3 Likes

I am not sure about starting with Bootstrapping because it is a complicated area and one which when we debated at KubeCon it was clear to me that not everyone has the same understanding of the CRD and/or Webhook extension models. I had planned to give a presentation on this at some point…

That said, I am may be wrong.

One aside: I am the one that proposed making the Cluster resource mandatory. This was a mistake which I regret. All I really want is a central point of control so that an operator can say “Please delete this cluster” and et voila, it is so. If there are no objections, I think it would simplify the discussion if we removed the implementation detail of whether there is a Cluster/ControlPlane object and instead see if any of the other implementation plans will address my goal…

1 Like

I noticed some comments from the Kubecon docs regarding immutable machine objects. Is that in scope for the state-machine? We’ve been discussing the machine states internally, and we’ve decided that the state machine today doesn’t really make sense. In particular, there are assumptions of behavior that don’t hold up across cloud providers, I don’t see a way to cleanly unify provider-behavior when it comes to reconciling an already created machine in various states. This is particularly concerning because it makes creating a provider-independent remediation layer problematic.

Are we using the linked workstream doc as a baseline for the state machine? I think that doc has some fundamental flaws, notably tracking state machine across the various ‘hooks’, and passing variable information from one ‘hook’ to another in a repeatable way (everything will need to be persisted as an annotation or providerSpec-like field, we’ll need to read arbitrary annotations/fields to pass into the various providers for each ‘hook’. It’s unlikely that the ‘preboot’ hook would be idempotent, imperative, and also return no data we need to feed into the later part of the process. Most likely, these phases still belong under the Create() interface, state and data management is a function of the provider.

As far as separating the machine-provisioning from the machine-bootstrapping, I agree. But, what is the actual scope? Does this imply changes to the data model? If it does, then what we’re really talking about changing is the machine-object data model. We could isolate bootstrapping from provisioning without changing the data model; we could preserve provider spec and bootstrapping would be handled by ‘some other component.’ I would be open to preserving providerSpec and creating an additional, optional field bootstrapRef, but I’m not keen on the machine-controller operating on the bootstrapRef itself.

1 Like

We are still working on completing our initial draft of the bootstrapping proposal. It will be a separate document.

We will be proposing changes to the Machine data model which include adding fields for bootstrap configuration. The machine controller will not be responsible for processing the bootstrap fields. That will be left to other controllers whose responsibility is bootstrap configuration.

Hopefully when we share the proposal, it either addresses your needs or we’ll be able to work together on revisions that work for everyone. Stay tuned!

Andy

1 Like

Sounds promising. “That will be left to other controllers whose responsibility is bootstrap configuration.” all I ever wanted, lol.

A note on nomenclature:

We may want to distinguish between explicit states and emergent states. The former “explicit states” present a problem if they are transitive because they make it difficult to evolve the API. The latter, “emergent states”, exist and are extremely useful for an operator to understand what is happening.

I feel like the word “state” is going to become the new “provider”… :wink:

Thanks for the write up. I’m definitely interested in working in the bootstrap.

Regarding the testded using kind I can share a draft of the ideas discussed with the kind project and also with kubeadm, because the have a similar case.

However, depending on the scope of bootstrapping work, maybe kind would results insufficient or complicated for testing such features (because it is just simulating nodes as containers) and libvirt could be a more comprehensive solution.

Maybe I’m not understanding you, but I can’t recall we discussed immutable machine objects, but immutable machines in the sense that we are not modeling states for ungrading the machine once it has finalized the creation, and that changes in the machine specs will trigger a logical replacement of the machine That doesn’t mean that the provider implements that as an in place update, but the required states are provider specific and not visible from the cluster api.

Here’s the link to the specific comment: https://docs.google.com/document/d/1Gmc7LyCIL_148a9Tft7pdhdee0NBHdOfHS1SAF0duI4/edit?disco=AAAAC-bMowc

“Machines should be immutable but clusters themselves are mutable.”

We should spend some time discussing this idea in a meeting soon. I think this will help guide what the next data-model will look like, as well as needing to make some change between the machine-controller and provider interface.

You are right.

IIRC correctly, the only way to change the machine specs would be by changing the corresponding machine set’s specs, which will trigger the reconcile process to create new set of machines that match that spec. So my previous comment is wrong and machine objects are effectively immutable.

Now, it is up to the provider how to implement this process and make an in place update or create a new machines. My guess is that the machine deployment used to update a machine set could take care of this strategy.

So the provider can make machines effectively mutable even if this is not modeled in the API.

Yes, we had the same problem with the term “immutable” in the room. It took us several minutes to figure out we were using “resource” in different ways. As Pablo says, the API object would be mutable but there would be no expectation that the controller would modify the “real world” thing backing that database object (cloud VM, bare metal host, etc.).