How to solve "Remote Memory Access" latency on 128-core EPYC nodes?

mechosilver · May 1, 2026, 3:55pm

Hi all! We are running a heavy HPC-style workload (CFD simulations) on a bare-metal K8s cluster with 128-core EPYC nodes. Even with CPU Manager (static policy), we see huge performance degradation due to inter-socket latency. It seems standard scheduler doesn’t care about L3 cache locality and NUMA distances.

Does anyone use a specific tool or a custom scheduler to “glue” pods to specific NUMA domains and minimize L3 cache misses? Native Topology Manager feels too limited for many-core systems.

0xnode · May 2, 2026, 4:23pm

Topic		Replies	Views
Latency is changed with different number of cores General Discussions development	1	209	March 3, 2024
Can Kubernetes identify different NUMA nodes? General Discussions development	0	66	February 5, 2025
Is it possible to configure Topology Manager to deploy a container on a specific numa node General Discussions	0	479	November 29, 2021
PriorityClass preemption with TopologyManager restricted General Discussions	1	1007	July 14, 2021
Distributing pods based on Node performance General Discussions	14	5936	September 8, 2022

How to solve "Remote Memory Access" latency on 128-core EPYC nodes?

Related topics