Hi all
sorry a little to K8 and this forum so apologies is this is not following the correct etiquette.
I was looking around trying to find example of people using K8 for HPC style workloads. I can’t seem to find many though, wondering if this is because I’m not looking in the right place or if there’s a good reason to not be using K8 for this type of workload.
I’ve come across a couple of example of being able to share hardware across HPC and K8 workloads such as Shifter and Univa but neither really attempt to run the HPC workload natively on K8.
This is right up @jeefy and my’s alley. We both work for the University of Michigan’s research arm and this space is something we’re actively trying to improve.
There are quite a few hurdles generally stemming from the classic multi-user system design ingrained with those types of schedulers and workloads.
Most classic HPC systems (Slurm / moab+torque) do not have any good way of dynamically scaling. They expect to be managing the host system, and there are a static amount of those systems. Both ‘cloud’ offerings from those companies rely on machines already being provisioned…they can just turn them on or off.
Rely on ssh + users logging into a system. This means that each user must have a valid posix compatible identity and are generally loading a shared file systems (e.g. user home directories) or scratch for things like MPI workloads. This tends to be problematic in the container environment.
Specialty hardware in the HPC space or requiring elevated access (infiband verbs for MPI).
The list can go on, but things are definitely improving, I’ll list some of the stuff we’ve been tossing around to deal with the above.
Deploy an instance of the batch scheduler of choice for each research group in a their own namespace. This maps back well to the standard allocation model and can be done easily with namespace resource quotas. This can be subdivided into ‘nodes’ of designated sizes.
This has been a bit of a challenge – while you can run pods as specific uid:gid, its challenging to do dynamically and still doesn’t do anything for kerberos tickets… . Our solution is loosely based off an earlier sshd gateway poc project that spins up pods as explicit user uid:gids and does some other magic with injecting the user into the user container. The next iteration of this a mutating admission webhook that will perform the user lookup and modify the pod resource with the needed information.
This is still a challenge, theres work being done in the NFV space that should hopefully improve it, but for now it’s mapping through host SRIOV devices and using something like multus.
In the long run though I think we’ll see a shift away from more of these classic systems to more ‘kubernetes’ native style HPC workload systems and projects. A good example would be the kube-arbitrator project that is working on bringing batch style scheduling to Kubernetes. I know @jeefy can expand more on this one that I can.
If you’d like to chat more directly, there’s a good chunk of research HPC folk in the #academia channel in the cncf slack. It’s quiet, but we do kick up good discussions now and then
I run a small company that does a lot of HPC work for the financial sector (risk calculations and the like) and many of our clients use products like IBM’s Platform Symphony and TIBCO GridServer and curiously the challenges I see in trying to move their workloads to K8 is different to the sorts of things you’ve outlined. Happy to share more on what they are if that’s of interest.
The kube-arbitrator project you’ve pointed me to looks very interesting especially as it alludes to being derived form experience on using products like Symphony before. I’ve taken a quick look at the README and the tutorial but if someone could give me a high level idea of the architecture and maybe some kind of idea of the direction you’re hoping to move it in that would be very helpful and potentially something we might be able to help with too.
I certainly agree that we’ll see a shift away from traditional HPC workloads and in fact many of my existing and previous clients are actively exploring this (hence me asking these questions!). There is a definite desire to be able to use K8 to provide a generic compute resource which can be used equally by applications and HPC alike
(and personally I have grander visions of where I’d like to see that go… must put pen to paper on that).
I’ve joined the slack channel too, so thanks for the pointer.
We aren’t active in that project, just have keen eyes on it since it directly applies to our day job.
If you’re interested in diving more into it, there’s a working group trying to bridge some gaps between traditional HPC/Research Computing and Kubernetes.
I think it’s a natural progression that these types of workloads migrate onto Kubernetes. I had heard at the last Supercomputing conference that SchedMD was working on Slurm/k8s integration, but I hadn’t heard anything since then.
It’s target is ML, but theres quite a bit of bleed over The kube-arbitrator project along with things like automatic numa placement have been topics in there.