Cluster upgrade strategy

When you have a production cluster, with new releases coming out very often, upgrading the clusters could be a challenge, because you already have lots of applications running, and you probably have some customized components installed.

I can see there are 2 approaches:

  1. Keeping up with 2 versions behind the latest releases, but have the same frequency. For example, I start with 1.7 when 1.9 is released, go up to 1.8 when 1.10 is released, and go up to 1.9 when 1.11 is released, etc. The advantage of this approach is each upgrade is one minor version up, usually well supported, thus realtively easy and reliable. However, this means upgrading the clusters every 2-3 months, or even more frequently, posting risks to SLA’s, and if you have multiple clusters in multiple regions, you’re pretty much constantly upgrading your clusters.

  2. Skip few releases, perform long jumps very half to 1 year. For example, I start with 1.7, and now upgrade to 1.10 directly. The advantage of this approach is you have more time to plan upgrades, and have less disruptions to services. But with the long gap, lot of things could have changed, it’s probably not safe to upgrade in-place, and you’ll have to build a new cluster and migrate services over, which could be a big task that involves lot of application testing.

Anyone out there have experiences or have explored the different methods on this?

Thank you in advance for any input!

I started with kubernetes 1.1. We used that and created a new cluster for 1.2 to go to production.

Then, it was just too painful to upgrade. And there were even security fixes that were not backported by default to kubernetes 1.2. I asked to, and they gently released a new 1.2 version. But it is a big risk for the option number 2. Security bugs, if you magazines way behind the supported versions, can be a problem. Even make you change plans and rush to upgrade.

Now we use kops and upgrade once a quarter. For me, this is way better. Kops has a policy to not run bleeding edge, while stay up to date. That works great for us.

We upgrade in staging, wait one week and upgrade in production. Never had problems, so far.

Hope it helps!

Thanks Rata! Definitely helpful information.

Are you running your clusters on AWS? As I understand, kops works well with cloud providers, but not necessarily on bare metal. In our case, we run clusters on our own hardware, so we’re using kubesray with ansible.

Anyway, if you upgrade once a quarter, did it happen that you had to jump versions? If so, did kops handle that well?

As for security bugs, even if you upgrade often, wouldn’t they still hit you at some point? For example, right now I’m thinking to upgrade from 1.7 to 1.10. I could go 1.7->1.8->1.9->1.10, one at a time, if there’re security bugs in between, it would affect one of the upgrades. Or I could build a 1.10 cluster, and start migrating all the services over and deal with the same security bug – which probably require some application code change?

Upgrade in staging is a great idea. What release are you on today?

Thanks a lot!