Error when pulling large image (>5gb)

Cluster information:

Kubernetes version: 1.26.1
Cloud being used: bare-metal
Installation method: talos
Host OS: talos linux
CNI and version: 1.0.0
CRI and version: 2?

Hi I’m trying to pull this image redislabs/redisai:edge-gpu-bionic via kubernetes. In most cases it works. But every now and then (more often lately) I get the following error message:

Failed to pull image "redislabs/redisai:edge-gpu-bionic": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/redislabs/redisai:edge-gpu-bionic": failed to copy: httpReadSeeker: failed open: 
server message: invalid_token: authorization failed

Output: kubectl describe pod redis-ai-deployment-559898ffd-rn6tb

Summary
Name:                redis-ai-deployment-559898ffd-rn6tb
Namespace:           default
Priority:            0
Runtime Class Name:  nvidia
Service Account:     default
Node:                ml/192.168.1.62
Start Time:          Thu, 12 Oct 2023 10:20:31 +0200
Labels:              app=redis-ai
                     pod-template-hash=559898ffd
Annotations:         <none>
Status:              Pending
IP:                  10.244.0.45
IPs:
  IP:           10.244.0.45
Controlled By:  ReplicaSet/redis-ai-deployment-559898ffd
Containers:
  redis-ai:
    Container ID:  
    Image:         redislabs/redisai:edge-gpu-bionic
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/local/bin/redis-server
    Args:
      --loadmodule
      /usr/lib/redis/modules/redisai.so
      --save
      
      --appendonly
      no
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jcrn (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-9jcrn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  38m                  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
  Normal   Scheduled         36m                  default-scheduler  Successfully assigned default/redis-ai-deployment-559898ffd-rn6tb to ml
  Normal   Pulling           14m (x4 over 36m)    kubelet            Pulling image "redislabs/redisai:edge-gpu-bionic"
  Warning  Failed            8m19s (x4 over 29m)  kubelet            Error: ErrImagePull
  Normal   BackOff           7m23s (x9 over 29m)  kubelet            Back-off pulling image "redislabs/redisai:edge-gpu-bionic"
  Warning  Failed            7m23s (x9 over 29m)  kubelet            Error: ImagePullBackOff
  Warning  Failed            11s (x5 over 29m)    kubelet            Failed to pull image "redislabs/redisai:edge-gpu-bionic": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/redislabs/redisai:edge-gpu-bionic": failed to copy: httpReadSeeker: failed open: server message: invalid_token: authorization failed

The error message (authorization failed) does not make sense to me, because it is a public image and I can pull other images without any problems. The image name is also correct. I have the feeling that some kind of timeout is the reason why I can’t load the image. The image is quite big with about 5.72 GB and most of the problems I have when I am connected to slow internet. When I use the CPU tag from the image I have no problems (redislabs/redisai:edge-cpu-bionic). The CPU image is also only about 550.48 MB.

I have already adjusted the --runtimeRequestTimeout argument in Kubernetes. Although the pull command should not be affected. Reason to test it anyway was this 6 years old comment. That didn’t fix the problem either. My question: Is there a timeout in Kubernetes that interrupts the pull process if it takes too long? If so where can i set the length? Am I looking in the wrong place? Maybe Containerd has a timeout for pulling images? Or does my problem lie somewhere else? Does anyone have an idea?

Thanks

1 Like

Hi skei0, I have the exact same issue, how did you manage to solve this?