Extended statefulset for pods that require one node per PCIe-card

I’m currently setting up a cluster of approximately 30 nodes, with each node being a PC with up to 4 PCIe cards that do one of the following:

  • Crypto hardware acceleration. StatefulSet Name: ss-crypto-*
  • Image analysis hardware acceleration. StatefulSet Name: ss-graphics-*
  • Audio analysis hardware acceleration. StatefulSet Name: ss-audio-*

The distribution of these cards is usually uniform (i.e. a single node will have 1-4x of the same card installed), and we intend to run a single pod per card.

The problem we’re encountering is as follows: we only seem to be able to create a single instance of a pod in a StatefulSet per node. This means that, for example, if a node has 4x crypto cards installed, we only get a single pod running on it as part of a StatefulSet (i.e. ss-crypto-1), when we’d really like 4x pods running on it. Additionally, we’d like to (if possible) extend the naming schema for the dynamically-generated pod name, so that, instead of box-crypto-${POD_NUMBER}, it would be something like box-crypto-${POD_NUMBER}-${SUB_POD_NUMBER}.

Consider a more complex scenario: we have 3 nodes:

  • Node 1: 4x crypto cards.
  • Node 2: 2x crypto cards, 2x graphics cards.
  • Node 3: 1x crypto cards, 1x graphics cards, 2x audio cards.

Ideally, the pod names would be something like so:

  • Node 1: ss-crypto-0-0, ss-crypto-0-1, ss-crypto-0-2, ss-crypto-0-3.
  • Node 2: ss-crypto-1-0, ss-crypto-1-1, ss-graphics-1-0, ss-graphics-1-1.
  • Node 3: ss-crypto-2-0, ss-graphics-2-0, ss-audio-2-0, ss-audio-2-1.

Is something like this possible? In short, we’re attempting to have multiple pods from the same statefulset (unless there’s an easier way to implement/manage this) so that it’s obvious (from the pod name) that multiple pods from the same statefulset are executing on the same node.

Thank you for your time and assistance.

You won’t really be able to get that sort of naming convention out of the box. You could develop your own controller to do that sort of thing for you, but you should be able to get pretty close with properly labeling each node in the cluster and using daemonsets or deployments to hit the desired amount scheduled per node.

Thank you. In that case, is there some mechanism by which I could setup a taint/tolerance per node based on the number of these proprietary PCIe cards present on the node, and then use a deployment in place of a statefulset on the node? i.e. setup a tuple/list of taints on a per-node basis, based on the number of these PCIe devices (and their respective types) present, and then setup a deployment with the corresponding tolerances to act on it?

If I could approximate the above example, even without the naming schema being identical, that would be sufficient.

Not exactly…but you might be able to get close via device plugins and then use them in resource limits.

It’s most commonly used with GPUs currently, but you’d need to develop one for your other devices as a means to register and advertise the amount per node. That way you could just scale your deployments to be the total number you want to consume and it should handle the placement.

Great. One more question: is it possible, through the use of these plugins, to schedule N StatefulSet instances per node, rather than N simple pods handled through a deployment? If that’s the case, it’d be acceptable (for me) to drop the naming schema requirement, and “just know” that so many stateful instances are running per node.

You won’t be able to schedule a specific amount per node directly, but should be able to achieve it by scheduling the total amount desired and letting it sort of figure it out.