Cluster API: Node lifecycle management workstream

#1

Welcome to the Cluster API Node Lifecycle Management Workstream.

Next Meeting:
2019/05/03 - Friday @ 11am PDT
Put agenda items in the Google Doc, or reply here.

Some useful links:

Current sub-workstreams in progress are:

Always feel free to reach out to Tim St. Clair on the k8s slack (@timothysc).

#2

Meeting minutes for 2019/04/19:

20190419 - Friday @ 11am PDT

Recording

Attending

  • Jason DeTiberus, Vince Prignano, Tim St. Clair, Loc Nguyen, Harish Udaiya Kumar, Chuck Ha, David vonThenen, Sidharth Surana - VMware
  • Marko Mudrinić - Loodse
  • Justin SB, John Belamaric - Google
  • Danny Berger, Leah Hanson, Shatarupa Nandi - Pivotal
  • Hardik Dodiya - SAP
  • Daniel Lipovetsky - Platform9

Agenda

  • What is a node lifecycle behavior/operations
    • Create
      • Instance Creation and Configuration
        • Call to a provider to give me an instance.
        • Optional: Install some software
        • Optional: Ability to inject configuration
      • Conversion to a Node
        • Join the cluster
        • Optional: Inject node config/policies such as taints, certs, etc.
    • Update / mutability (do we want to even allow mutability)
      • We should discourage, but have an escape hatch?
      • Node Certificate Management? (I think a portion of node-lifecycle cert management k/k, or in a different project)
    • Reconf ~= Destroy + Create
    • hibernate/sleep/suspend? (Punt/re-eval l8r due to complexity)
      • Should it be the same node or a different node?
      • Does k8s even support this behavior?
      • Cloud also effect control plane (quorum)
      • There are use cases here, but willful punt on this for now.
    • Destroy/(Delete?)
      • Do we want things like disruption budgets on machines/machine-sets?
        • In k/k we baked defaults into controllers
      • Strategy (-f/ -graceful)
        • Mark for destroy ~= cordon+drain
        • Timeout (configure ~5 minutes)
        • Fill it with fire
#3

20190426 - Friday @ 11am PDT

Recording

Attending

  • Vince Prignano, Tim St. Clair, Andy Goldstein, Naadir Jeewa, Jason DeTiberus, Sidharth Surana - VMware
  • Andrew Sauber - Linode
  • Ace Eldeib - Microsoft
  • Michael Gugino, Michael Hrivnak - Red Hat
  • Dominik Zyla - The Scale Factory
  • David Watson - Samsung

Agenda

  • Talk about what is in scope and out of scope
    • In Scope
      • API’s and behavior related to list below.
        • Disruption budgets - set a rate for a destroy queue.
          • There is some code in MachineDeployments/Sets that can control how many of the old nodes can be destroyed as new ones are created.
      • Controller defaults and possible primitives to control macro behavior.
      • Bootstrapping
        • [Naadir] Image stamping
          • Software provisioning
        • Image stamping - Cloud Init
        • [Andrew] At Linode, only private disk images exist at the moment.
        • [tsc] Cluster API GCP Provider had a script apparatus for bootstrapping, idempotency would make more sense and leverage it across providers. For providers that don’t support pre-built images, the tool will do the same things.
        • [andrew] We have a 100 line script that does kubeadm. Do we want shell scripts and tools?
        • An entrypoint shell script that allows bootstrap information to come in and then call the script.
        • [andrew] we also have issues with networking where interfaces need to be configured. Believe this should be shell scripts via cloud-init.
        • [spencer] in Talos, we don’t have real cloud-init, but a subset.
        • [tsc] On prem PXE could use kickstart
        • Three things that would be required:
          • User Data / metadata
          • Utility that can bootstrap (script)
            • Software provisioning tool ~= image minter
          • Kick node join
        • [Dan] Would like the controller to determine which image boots will be specific to that controller/provider.
        • [tsc] It will almost be Docker images, and then you can layer your customisations on top.
        • [detiber] we may not be able to avoid a way to run on pre-provisioned machine because an init system is not available at all.
        • [andrew] where this gets confusing is where the provisioning tool stops and the initdata starts
      • AI: tstclair We need user stories for the boundaries between , when we create the KEPs for “bootstrap-sequence”(run time) and “image stamping”(build time) these should be distinct.
        • [dan] happy to provide use cases for pre-provisioned
        • [andrew] can write one for disk images (VMs) (+1 Naadir)
    • Out of Scope
      • Cluster Autoscaler (api-consumer)
        • Autoscaler is a consumer of Cluster API, but CAPI is not an autoscaler.
        • [detiber] Need to take care that Autoscaler CAN be a consumer of CAPI. Some issues with Scale from 0, and there’s some cases where CA needs information about provider resources. Currently, the MachineClass has no provider extension point (though we can bring this up as part of the data model discussion).
      • Updates, or node/machine mutability.
        • Do that out of band. CAPI will support cluster-wide upgrades through destroy & create.
          • [andrew] if you have a proper cloud controller manager, we don’t even need to hook into the machine.
            • [daniel] May not be helpful where there isn’t an infrastructure API (the API is people), and there’s an inventory of resources. May not want to impose this requirement.
            • [michael hrivnak] perhaps pushing the boundaries of “cloud provider”
            • [detiber] would it be onerous to make a requirement for a cloud controller manager
            • [sidharth] previously (1.11) you delete a machine, the node gets deleted, but think it’s been changed so they don’t get deleted but marked “NotReady”, and this was related to storage. There’s a KEP for this (link here). Some cloud providers maybe doing this, but it’s not consistent. Users will expect as consumers of CAPI that the node deletion behaviour is consistent.
            • [detiber] Worry about giving more perms to the CAPI provider. Can we reduce the scope of permissions that CAPI has by deferring to another tool?
            • Action: Please confirm behaviour
      • Certificate management
        • Can be managed via custom hooks, or out of band.
        • Our responsibility is to provide the hook
      • Data Gravity, data management, or storage lifecycle - what to with data on a machine’s local storage, as opposed to network attached storage. There are also backup and restore options that exist.
        • May want hooks
    • Action items
      • KEPlet 1, User stories for build time vs run time
      • KEPlet 2, User stories for Node lifecycle hooks
        • May need a KEP to define which hooks will be provided.
        • [Daniel] User stories for e.g. deleting a node?