Distributed System

Minh_Bui · March 27, 2025, 3:04am

Challenges & Kubernetes Solutions for Dynamic Node Participation in Distributed System

Hi everyone,

I’m architecting a Split Learning system deployed on Kubernetes. A key characteristic is that the client-side training components are intended to run on nodes that join and leave the cluster dynamically and frequently (e.g., edge devices, temporary workers acting as clients).

This dynamic membership raises fundamental challenges for system reliability and coordination:

Discovery & Availability: How can the central server/coordinator reliably discover which client nodes are currently active and available to participate in a training round?
Workload Allocation: What are effective strategies for dynamically scheduling the client-side training workloads (Pods) onto these specific, ephemeral nodes, possibly considering their available resources?
State & Coordination: How to manage the overall training state (e.g., tracking participants per round, handling partial results) and coordinate actions when the set of available clients changes constantly between or even during rounds?

Currently, I’m exploring a custom Kubernetes controller approach – watching Node labels/events to manage dedicated Deployments and CRDs per client node. However, I’m seeking broader insights and potential alternatives.

Thanks for sharing your expertise!

Topic		Replies	Views
Dynamically Deploying Pods in Kubernetes Based on Node GPS Position or Other Real-time Characteristics General Discussions	0	138	March 7, 2024
Questions about designing effective Kubernetes solutions at scale General Discussions development	0	381	May 14, 2021
Need suggestions in deploying my application components using kubernetes General Discussions	1	545	September 19, 2019
How to separate different services+deployments into their own nodes General Discussions	0	485	August 8, 2019
Spread deployments across AZs (not replicas) General Discussions development , network	1	273	October 29, 2024

Distributed System

Challenges & Kubernetes Solutions for Dynamic Node Participation in Distributed System

Related topics