I’m currently running an on-premises Kubernetes cluster on version 1.29, and I plan to upgrade to 1.30, with the eventual goal of moving to the latest stable version. My primary concern is ensuring a smooth upgrade process while maintaining the ability to roll back in case of any technical issues during the upgrade.
Here’s the full scenario:
Cluster Details:
On-premises environment.
Deployed using kubeadm (though I’m open to other tools if necessary).
Multiple worker and control plane nodes.
Critical applications are running on the cluster, so downtime must be minimal.
Upgrade Goals:
Upgrade to Kubernetes 1.30 with all associated components (e.g., kubeadm, kubelet, kube-proxy, etc.).
Test application compatibility and performance after the upgrade.
Ensure that any breaking changes in the new version can be mitigated.
Rollback Requirement:
I’d like a reliable way to roll back to version 1.29 if issues arise, without breaking the cluster or applications.
Data integrity (e.g., persistent volumes, etcd snapshots) must remain intact during the rollback process.
Questions:
Is there a specific tool or strategy you recommend for handling Kubernetes upgrades and rollbacks in an on-prem environment?
What’s the best way to back up the cluster state (etcd, cluster configuration, etc.) to facilitate a rollback?
Are there any tools or practices that allow a blue-green upgrade or canary-style testing for Kubernetes clusters?
If you’ve faced a similar situation, what challenges should I expect, and how can I best prepare for them?
Things I’ve Considered:
Using etcdctl to back up and restore etcd snapshots for rollback.
Staging the upgrade in a test environment, but replicating production traffic is difficult.
Tools like Cluster API or Velero for backup and migration, though I’m not sure they can handle complete cluster rollback scenarios.
I’d greatly appreciate your insights on the best practices, tools, or workflows to achieve a reliable upgrade process with rollback capabilities. Thanks in advance for your advice!
Do NOT roll back. There is no good way to ‘downgrade’ a cluster. The best practice is to take a snapshot prior to the upgrade and if you encounter issues, restore from snapshot. Otherwise, you’re looking at having to rebuild or conduct surgery on your environment, which is not really supported by any tool I know of (that actually works and I haven’t seen cause more issues than it resolved).
Your consideration of using etcdctl to snapshot and restore is correct. Personally, I work at Rancher so I use our tools to do these things. I find RKE2 helps a lot in these situations because it manages a lot of the process, but I’m not trying to plug our tool, or get roped into supporting it out of hours
One consideration to add to the list is that since clusters are cattle, not pets, you can just deploy a new environment with the upgraded versions and migrate your DHCP address, persistent storage (with something like Valero), and CICD/GitOpps pipelines to the new env. It’s a bit more work than just a straight upgrade, but I would advise being good at this process and employing it once every few years because you generally don’t want an old K8s environment floating around. They have a tendency to build up cruft like an old windows computer, and like a windows environment suffering from ‘weirdness’, a lot of the time a reformat is the best resolution.
So I realize I didn’t actually address your questions as such.
At the risk of plugging our tooling, I do quite like Rancher as the tool for local handling of K8s management.
Take an etcd snapshot off site (I prefer sending to an S3 bucket somewhere separate from the cluster env like a MinIO deployment in another rack/location) and a backup of persistent storage in something like Valero. With RKE2, at least, I think that’s generally enough for a standalone cluster.
For blue-green and canary deployments (the exact right idea!), I would just deploy the cluster along side and use some DNS withchery or a load balancer to trickle traffic into the new environment. When advising customers about Rancher Manager cluster recoveries, I tell them to deploy a new K8s cluster somewhere, restore a Rancher Backup using the Backup Operator (specific to restoring Rancher Manager), and then just move over the FQDN which was previously pointing to the old cluster.
You can expect to see things like:
Actual k8s env failures (practice your recovery path by taking a snapshot right now and restoring in an isolated test environment and testing that everything is working as expected)
API deprecation affecting deployments (developers need to keep their deployments up to date. A test/dev environment where they first deploy is key. Upgrade it first and make sure everyone is able to deploy without issue for a few weeks at least, before upgrading prod).
CVE or general instability which will require troubleshooting (hopefully your test env exposes these issues and youc an investigate each as needed)
Hope this helps! It’s a new way of thinking about things for a lot of folks and can seem resource heavy but when you really get things working, it’s worth it to do it right. Don’t waste time nursing along failing environments. Shoot them in the head and move on.. Nothing in K8s is persistent except for maybe your PVs, but you should back those up… off site..
Backup First – Take an etcd snapshot using etcdctl snapshot save and back up cluster configurations with kubeadm config view. Ensure persistent volumes are protected.
Test Before Prod – If a staging cluster isn’t feasible, use canary deployments or blue-green upgrades with additional nodes.
Upgrade Step-by-Step – Follow the kubeadm upgrade process carefully. Upgrade control plane nodes first, then worker nodes.
Rollback Plan – If issues arise, revert using the etcd snapshot with etcdctl snapshot restore and reinstall Kubernetes components to version 1.29.
Tools to Consider – Velero for backup and restore, Kured for automated node reboots, and phased rollouts with kubeadm upgrades.
Key challenges include API deprecations, downtime risks, and compatibility issues—be sure to test workloads thoroughly after upgrading!