Cluster information:
Kubernetes version: 1.26.1
Cloud being used: bare-metal
Installation method: talos
Host OS: talos linux
CNI and version: 1.0.0
CRI and version: 2?
Hi I’m trying to pull this image redislabs/redisai:edge-gpu-bionic via kubernetes. In most cases it works. But every now and then (more often lately) I get the following error message:
Failed to pull image "redislabs/redisai:edge-gpu-bionic": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/redislabs/redisai:edge-gpu-bionic": failed to copy: httpReadSeeker: failed open:
server message: invalid_token: authorization failed
Output: kubectl describe pod redis-ai-deployment-559898ffd-rn6tb
Summary
Name: redis-ai-deployment-559898ffd-rn6tb
Namespace: default
Priority: 0
Runtime Class Name: nvidia
Service Account: default
Node: ml/192.168.1.62
Start Time: Thu, 12 Oct 2023 10:20:31 +0200
Labels: app=redis-ai
pod-template-hash=559898ffd
Annotations: <none>
Status: Pending
IP: 10.244.0.45
IPs:
IP: 10.244.0.45
Controlled By: ReplicaSet/redis-ai-deployment-559898ffd
Containers:
redis-ai:
Container ID:
Image: redislabs/redisai:edge-gpu-bionic
Image ID:
Port: <none>
Host Port: <none>
Command:
/usr/local/bin/redis-server
Args:
--loadmodule
/usr/lib/redis/modules/redisai.so
--save
--appendonly
no
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9jcrn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-9jcrn:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 38m default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..
Normal Scheduled 36m default-scheduler Successfully assigned default/redis-ai-deployment-559898ffd-rn6tb to ml
Normal Pulling 14m (x4 over 36m) kubelet Pulling image "redislabs/redisai:edge-gpu-bionic"
Warning Failed 8m19s (x4 over 29m) kubelet Error: ErrImagePull
Normal BackOff 7m23s (x9 over 29m) kubelet Back-off pulling image "redislabs/redisai:edge-gpu-bionic"
Warning Failed 7m23s (x9 over 29m) kubelet Error: ImagePullBackOff
Warning Failed 11s (x5 over 29m) kubelet Failed to pull image "redislabs/redisai:edge-gpu-bionic": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/redislabs/redisai:edge-gpu-bionic": failed to copy: httpReadSeeker: failed open: server message: invalid_token: authorization failed
The error message (authorization failed) does not make sense to me, because it is a public image and I can pull other images without any problems. The image name is also correct. I have the feeling that some kind of timeout is the reason why I can’t load the image. The image is quite big with about 5.72 GB and most of the problems I have when I am connected to slow internet. When I use the CPU tag from the image I have no problems (redislabs/redisai:edge-cpu-bionic). The CPU image is also only about 550.48 MB.
I have already adjusted the --runtimeRequestTimeout argument in Kubernetes. Although the pull command should not be affected. Reason to test it anyway was this 6 years old comment. That didn’t fix the problem either. My question: Is there a timeout in Kubernetes that interrupts the pull process if it takes too long? If so where can i set the length? Am I looking in the wrong place? Maybe Containerd has a timeout for pulling images? Or does my problem lie somewhere else? Does anyone have an idea?
Thanks