100,000+ K8s nodes

SimplySeth · January 24, 2019, 4:02pm

Greetings,
I work with a company that is going to be doing a deployment similar to the below blog.

Our deployment will be way larger though, easily over 100,000+ devices/nodes.

Has anyone come close, is there a blog, howto, readme, etc?
Thanks.

llarsson · January 25, 2019, 9:45am

You are very much on your own if you want a single cluster to support that many nodes.

The official target threshold is set to 5000 nodes/cluster.

Among other things, etcd does not scale that well, even if you do the recommended “put events on their own separate etcd cluster” configuration for large clusters.

You will likely have to re-think this one. You might wind up with multiple clusters, via e.g. federation-v2 or something like kube-applier. The former lets you manage them all in a way that they are aware of each other, whereas the second is essentially “one git repo per cluster, they all pull changes from their own repo periodically and just apply it”.

Microsoft wants you to wind up with something like this, which is based on virtual-kubelet and their own IoT services for the heavy lifting.

Please update this thread with your thoughts on the matter, as they may be of interest to others, as well.

SimplySeth · January 25, 2019, 6:55pm

Looks like well have to deploy max nodes to a cluster, then create new clusters when max nodes is reached.

So for 100,000+ nodes, We’ll have to have 20+ Clusters using federation-v2 so that all clusters are aware of each other

This is good info. Thanks for your time @llarsson

yomateod · January 26, 2019, 7:40am

What is your workload exactly? It may be a very real realization that Kubernetes may not be the answer for (some, if not all) of your workload(s) requirement(s) at this level.

llarsson · January 28, 2019, 12:58pm

If you do wind up running a huge amount of cluster nodes in a rather large federation, please be aware that you will run into strange issues that you never had to think of before. For instance, the v2 federation control plane uses a lot of etcd storage, due to how Overrides are implemented. You might run into issues when a certain etcd object takes up more than (the default) 1MB of space.

Make sure you are very comfortable with operating and managing etcd, as it will be a limiting factor for you when you make a deployment of this size.

Like @yomateod, I would love to learn more about your workload here. Massive scalability of Kubernetes is something I am very interested in, and it would be very interesting to learn about your thoughts and challenges in making this come true.

mrbobbytables · January 28, 2019, 2:31pm

Just to chime in, I would suggest hopping into both the sig-multicluster and sig-scalability channels and/or calls. They both like to get actual use-cases fleshed and want to hear more from end-users as to what they’re looking for.

SimplySeth · January 29, 2019, 1:52pm

We are still very early in the planning stages.
We have an idea of what the end results should be but no clear path on getting there, yet.

llarsson · January 29, 2019, 2:35pm

What are your requirements? Does it really have to be 100,000+ nodes? How big and how (geographically) distributed are they?

llarsson · January 29, 2019, 2:50pm

The main reason I ask is that etcd is designed to work well in a relatively low-latency environment. So even if you manage to make a cluster with 5,000 nodes in it, those are typically expected to be within a single data center, not “1 per customer” as it kind of sounds like you want to do in some kind of IoT fashion.

You may also be interested in a guide for tuning etcd (as I said before, I hope you are really good with managing and operating etcd, because it will be crucial for supporting any huge environment).

SimplySeth · January 29, 2019, 10:58pm

Actually what we are aiming at is IoT devices that will be at US Intersections.
We may start small, with only a certain section of this city, but we want to be prepared to scale.
Imagine Austin, Atlanta, San Francisco, etc backed by PostGis Databases.

Each city might use it’s own datacenter.
Low latency is definitely key.

I’m guessing well be using a 5 node etcd cluster with SAN storage with 500+ Sequential IOPS; per cluster.

Topic		Replies	Views
Blog: Kubernetes on solar plants — Cloud infrastructure for the Internet of Things General Discussions	0	644	December 4, 2018
How many nodes can be managed in a single k8s cluster? General Discussions	2	108	April 2, 2025
Etcd Cluster recommended Install General Discussions	2	2596	May 27, 2022
Single Node Kubernetes in Production General Discussions	2	1772	October 24, 2018
Hardware Recommendations for Kubernetes Cluster General Discussions	2	1597	March 12, 2025

100,000+ K8s nodes

Related topics