Kubeflow MPI communication issue

Issue: Silent Failure in Kubernetes MPIJob for Distributed Inference on 3-GPU Cluster

Cluster Setup

I have successfully set up a Kubernetes cluster with the following configuration:

  • 3 Nodes: Dell Precision 3660 RTX A4000 machines (16GB GPU), Intel Core i7-13700
  • 1 Master/Launcher Node, 2 Worker Nodes
  • Installed Components: NVIDIA plugin, MPI Operator

My goal is to run distributed inference on this cluster using MPI to leverage the combined power of the GPUs.

Inference Task

I’m trying to run inference for Tailor3D, a generative model. The Dockerfile and YAML configuration are correctly set up.

Issue Faced

When I run the inference with two workers (-np 2), the following happens:

  • The job starts, and all pods enter the Running state.
  • No logs, no output, no error messages—just a silent failure.
  • The job remains in this state for hours and eventually fails (I found this out after letting it run overnight).

However, when I change the MPI command to use only one worker (-np 1), the inference runs successfully. Logs appear, and it fails with an out-of-memory error (which is expected in this case).

This suggests that the issue is likely MPI-related, but I’m unsure how to debug it effectively.


Dockerfile & MPIJob YAML

Dockerfile:

FROM horovod/horovod:latest

# Install Python 3.11
RUN apt-get update && apt-get install -y software-properties-common && \
    add-apt-repository -y ppa:deadsnakes/ppa && \
    apt-get update && apt-get install -y python3.11 python3.11-distutils && \
    ln -sf /usr/bin/python3.11 /usr/bin/python && \
    ln -sf /usr/bin/python3.11 /usr/bin/python3 && \
    apt-get clean && rm -rf /var/lib/apt/lists/*

# Set non-interactive mode for apt
ENV DEBIAN_FRONTEND=noninteractive

# Install pip manually
RUN curl -sS https://bootstrap.pypa.io/get-pip.py | python3.11
RUN pip install --upgrade pip

# Install PyTorch 2.2 with CUDA 11.8 support
RUN pip3 install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu118

# Install system dependencies
RUN apt-get update && apt-get install -y git && apt-get clean

# Copy the requirements.txt
COPY requirements.txt /app/requirements.txt

# Install other Python dependencies
RUN pip3 install -r /app/requirements.txt

# Configure SSH for container communication
RUN echo "    UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
    sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config

# Set environment variables
ENV EXPORT_VIDEO=true
ENV EXPORT_MESH=true
ENV DOUBLE_SIDED=true
ENV HUGGING_FACE=true
ENV INFER_CONFIG="./configs/all-large-2sides.yaml"
ENV MODEL_NAME="alexzyqi/tailor3d-large-1.0"
ENV PRETRAIN_MODEL_HF="zxhezexin/openlrm-mix-large-1.1"
ENV IMAGE_INPUT="./assets/sample_input/demo"

# Set working directory and copy project files
WORKDIR /app
COPY . /app/

# Default command
CMD ["/bin/bash"]

### **MPIJob YAML**  
```yaml
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: tailor3d-mpi-job
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          nodeSelector:
            role: launcher
          restartPolicy: Never
          containers:
          - image: taha2509/tailor3d-official:latest
            imagePullPolicy: IfNotPresent
            name: tailor3d-launcher
            command:
            - mpirun
            - --allow-run-as-root
            - -np
            - "2"
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - NCCL_DEBUG=INFO
            - -x
            - NCCL_DEBUG_SUBSYS=ALL
            - -x
            - PTP_VERBOSE=1
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -x
            - MASTER_ADDR
            - -x
            - MASTER_PORT
            - -x
            - OMP_NUM_THREADS=8
            - -x
            - MKL_NUM_THREADS=8
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python3
            - -m
            - openlrm.launch
            - infer.lrm
            - --infer
            - ./configs/all-large-2sides.yaml
            - model_name=alexzyqi/tailor3d-large-1.0
            - pretrain_model_hf=zxhezexin/openlrm-mix-large-1.1
            - image_input=./assets/sample_input/demo
            - export_video=true
            - export_mesh=true
            - double_sided=true
            - inferrer.hugging_face=true
            env:
            - name: MASTER_ADDR
              value: "tailor3d-mpi-job-launcher"
            - name: MASTER_PORT
              value: "12345"
    Worker:
      replicas: 2
      template:
        spec:
          nodeSelector:
            role: worker
          containers:
          - image: taha2509/tailor3d-official:latest
            imagePullPolicy: IfNotPresent
            name: tailor3d-worker
            resources:
              limits:
                nvidia.com/gpu: 1
            env:
            - name: MASTER_ADDR
              value: "tailor3d-mpi-job-launcher"
            - name: MASTER_PORT
              value: "12345"

### **Things I Have Tried So Far**

1. Checked logs with `kubectl logs -f <pod-name>` → **No logs appear for the failing job.**
2. Used `kubectl describe pod <pod-name>` → **No obvious errors.**
3. Ensured that the NVIDIA plugin and drivers are correctly set up.
4. Ran a single-worker MPI job (`-np 1`) → **Inference runs, but hits OOM.**

### **Possible Causes (But Not Sure)**

- **MPI configuration issue?**
- **Inter-node communication problem?**
- **NCCL/SSH misconfiguration?**

### **Request for Help**

I’m not sure how to debug this silent failure. Any guidance from those experienced with **Kubernetes + MPI + GPUs** would be greatly appreciated!