After command "kubectl create -f train.yaml" no pods are created! How to troubleshoot it?

roberto_bruzzese · July 18, 2023, 1:54pm

After issuing the command “kubectl create -f train.yaml” i realized that no pods was created.
How can I troubleshoot what is the cause ?
Thank Bye

Cluster information:

Kubernetes version: Client Version: v1.26.2-eks-a59e1f0
Kustomize Version: v4.5.7
Server Version: v1.26.6-eks-a5565ad
Cloud being used: (AWS)

yaml file is the following

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: cifar10-train
spec:
  elasticPolicy:
    rdzvBackend: etcd
    rdzvHost: etcd-service
    rdzvPort: 2379
    minReplicas: 1
    maxReplicas: 128
    maxRestarts: 100
    metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 80
  pytorchReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: 374402744818.dkr.ecr.us-west-2.amazonaws.com/pytorch-cpu:latest
              imagePullPolicy: IfNotPresent
              env:
              - name: PROCESSOR
                value: "cpu"
              command:
                - python3
                - -m
                - torch.distributed.run
                - /workspace/cifar10-model-train.py
                - "--epochs=10"
                - "--batch-size=128"
                - "--workers=15"
                - "--model-file=/efs-shared/cifar10-model.pth"
                - "/efs-shared/cifar-10-batches-py/"
              volumeMounts:
                - name: efs-pv
                  mountPath: /efs-shared
                # The following enables the worker pods to use increased shared memory 
                # which is required when specifying more than 0 data loader workers
                - name: dshm
                  mountPath: /dev/shm
          volumes:
            - name: efs-pv
              persistentVolumeClaim:
                claimName: efs-pvc
            - name: dshm
              emptyDir:     
                medium: Memory

Topic		Replies	Views
Kubernetes failed job with no pods General Discussions	1	810	October 30, 2024
Failed to create pod sandbox General Discussions	0	871	January 1, 2020
I am able to create deployment from yaml file. But nodes are not getting created General Discussions	2	790	April 3, 2019
Need support with Kubernetes Tutorial Video Windows	0	447	July 20, 2023
Cronjob does not create job, manual crated job does not create pod General Discussions	1	5265	July 2, 2023

After command "kubectl create -f train.yaml" no pods are created! How to troubleshoot it?

Cluster information:

Related topics