Challenges & Kubernetes Solutions for Dynamic Node Participation in Distributed System
Hi everyone,
I’m architecting a Split Learning system deployed on Kubernetes. A key characteristic is that the client-side training components are intended to run on nodes that join and leave the cluster dynamically and frequently (e.g., edge devices, temporary workers acting as clients).
This dynamic membership raises fundamental challenges for system reliability and coordination:
- Discovery & Availability: How can the central server/coordinator reliably discover which client nodes are currently active and available to participate in a training round?
- Workload Allocation: What are effective strategies for dynamically scheduling the client-side training workloads (Pods) onto these specific, ephemeral nodes, possibly considering their available resources?
- State & Coordination: How to manage the overall training state (e.g., tracking participants per round, handling partial results) and coordinate actions when the set of available clients changes constantly between or even during rounds?
Currently, I’m exploring a custom Kubernetes controller approach – watching Node labels/events to manage dedicated Deployments and CRDs per client node. However, I’m seeking broader insights and potential alternatives.
Thanks for sharing your expertise!