Kubernetes version: 1.26.1
Cloud being used: bare-metal
Installation method: talos
Host OS: talos linux
CNI and version: 1.0.0
CRI and version: 2?
Hi I’m trying to pull this image redislabs/redisai:edge-gpu-bionic via kubernetes. In most cases it works. But every now and then (more often lately) I get the following error message:
Failed to pull image "redislabs/redisai:edge-gpu-bionic": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/redislabs/redisai:edge-gpu-bionic": failed to copy: httpReadSeeker: failed open: server message: invalid_token: authorization failed
Output: kubectl describe pod redis-ai-deployment-559898ffd-rn6tb
Name: redis-ai-deployment-559898ffd-rn6tb Namespace: default Priority: 0 Runtime Class Name: nvidia Service Account: default Node: ml/192.168.1.62 Start Time: Thu, 12 Oct 2023 10:20:31 +0200 Labels: app=redis-ai pod-template-hash=559898ffd Annotations: <none> Status: Pending IP: 10.244.0.45 IPs: IP: 10.244.0.45 Controlled By: ReplicaSet/redis-ai-deployment-559898ffd Containers: redis-ai: Container ID: Image: redislabs/redisai:edge-gpu-bionic Image ID: Port: <none> Host Port: <none> Command: /usr/local/bin/redis-server Args: --loadmodule /usr/lib/redis/modules/redisai.so --save --appendonly no State: Waiting Reason: ImagePullBackOff Ready: False Restart Count: 0 Limits: nvidia.com/gpu: 1 Requests: nvidia.com/gpu: 1 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jcrn (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kube-api-access-9jcrn: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 38m default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.. Normal Scheduled 36m default-scheduler Successfully assigned default/redis-ai-deployment-559898ffd-rn6tb to ml Normal Pulling 14m (x4 over 36m) kubelet Pulling image "redislabs/redisai:edge-gpu-bionic" Warning Failed 8m19s (x4 over 29m) kubelet Error: ErrImagePull Normal BackOff 7m23s (x9 over 29m) kubelet Back-off pulling image "redislabs/redisai:edge-gpu-bionic" Warning Failed 7m23s (x9 over 29m) kubelet Error: ImagePullBackOff Warning Failed 11s (x5 over 29m) kubelet Failed to pull image "redislabs/redisai:edge-gpu-bionic": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/redislabs/redisai:edge-gpu-bionic": failed to copy: httpReadSeeker: failed open: server message: invalid_token: authorization failed
The error message (authorization failed) does not make sense to me, because it is a public image and I can pull other images without any problems. The image name is also correct. I have the feeling that some kind of timeout is the reason why I can’t load the image. The image is quite big with about 5.72 GB and most of the problems I have when I am connected to slow internet. When I use the CPU tag from the image I have no problems (redislabs/redisai:edge-cpu-bionic). The CPU image is also only about 550.48 MB.
I have already adjusted the --runtimeRequestTimeout argument in Kubernetes. Although the pull command should not be affected. Reason to test it anyway was this 6 years old comment. That didn’t fix the problem either. My question: Is there a timeout in Kubernetes that interrupts the pull process if it takes too long? If so where can i set the length? Am I looking in the wrong place? Maybe Containerd has a timeout for pulling images? Or does my problem lie somewhere else? Does anyone have an idea?