Cluster information:
Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5+k3s2", GitCommit:"724ef700bab896ff252a75e2be996d5f4ff1b842", GitTreeState:"clean", BuildDate:"2021-10-05T19:59:14Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5+k3s2", GitCommit:"724ef700bab896ff252a75e2be996d5f4ff1b842", GitTreeState:"clean", BuildDate:"2021-10-05T19:59:14Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Cloud being used: bare-metal
Installation method: K3S
Host OS:
CNI and version:
CRI and version:
Client: Docker Engine - Community
Version: 20.10.11
API version: 1.41
Go version: go1.16.9
Git commit: dea9396
Built: Thu Nov 18 00:37:08 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.11
API version: 1.41 (minimum version 1.12)
Go version: go1.16.9
Git commit: 847da18
Built: Thu Nov 18 00:35:16 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.12
GitCommit: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
runc:
Version: 1.0.2
GitCommit: v1.0.2-0-g52b36a2
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Hi,
I built a poc python device plugin based on the template in this repo and I was able to deploy it as a daemonset on my k3s cluster.
example-device-plugin-ds-r85dg 1/1 Running 15 19h
The plugin registration is successful and ListAndWatch is called, shown by device plugin pod logs
kubectl logs example-device-plugin-ds-nffh6 -n kube-system
DEBUG:asyncio:Using selector: EpollSelector
INFO:example-device-plugin.server:DevicePluginServicer::__init__
INFO:example-device-plugin.server:Listening to gRPC request from /var/lib/kubelet/device-plugins/example-plugin.sock
INFO:example-device-plugin.server:Registering Device Plugin to Kubelet
DEBUG:example-device-plugin.server:Trying to connect to gRPC Channel /var/lib/kubelet/device-plugins/kubelet.sock
DEBUG:example-device-plugin.server:Created gRPC Client
DEBUG:example-device-plugin.server:Trying to send request version: "v1beta1"
endpoint: "example-plugin.sock"
resource_name: "example.ai/foo"
DEBUG:example-device-plugin.server:Device plugin registered. Closing channel
INFO:example-device-plugin.server:gRPC::ListAndWatch
INFO:example-device-plugin.server:foo: 8
DEBUG:example-device-plugin.server:gRPC::ListAndWatch device:
DEBUG:example-device-plugin.server:ID: "668f39ae-4ef9-4fc6-826a-2e4af4594685"
health: "Healthy"
and from the node description I can also see that the resource is allocatable
Capacity:
cpu: 128
ephemeral-storage: 921378636Ki
hugepages-1Gi: 264Gi
memory: 1055395216Ki
pods: 110
example.ai/foo: 8
Allocatable:
cpu: 128
ephemeral-storage: 896317136398
hugepages-1Gi: 264Gi
memory: 778571152Ki
pods: 110
example.ai/foo: 8
but when I submit a job that requires 1 foo device, the pod would be scheduled successfully but the container would not start and stuck at pending state
test.yaml:
apiVersion: batch/v1
kind: Job
metadata:
generateName: hello-node-
spec:
template:
spec:
containers:
- name: hello-node
image: k8s.gcr.io/echoserver:1.4
command: ["pwd"]
resources:
limits:
example.ai/foo: "1"
restartPolicy: Never
pod would be pending
hello-node-hfvrx-2snfn 0/1 Pending 0 49s
description:
Name: hello-node-hfvrx-2snfn
Namespace: default
Priority: 0
Node: host/
Labels: controller-uid=540b61e1-894e-438a-af55-700a79dd6bfa
job-name=hello-node-hfvrx
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: Job/hello-node-hfvrx
Containers:
hello-node:
Image: k8s.gcr.io/echoserver:1.4
Port: <none>
Host Port: <none>
Command:
pwd
Limits:
example.ai/foo: 1
Requests:
example.ai/foo: 1
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qszlx (ro)
Conditions:
Type Status
PodScheduled True
Volumes:
kube-api-access-qszlx:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 98s default-scheduler Successfully assigned default/hello-node-hfvrx-2snfn to host
The same pod would run to completion fine without the example.ai/foo: 1
resource requirement. so it is not the problem with cluster setup.
kubectl describe node shows:
Taints: <none>
Unschedulable: false
And another interesting thing is that, after one pod that requires the resource goes in the pending state. If I submit additional jobs that DO NOT require the resources, they would also be stuck at pending state. Which led me to believe that the scheduler is somehow hung by the plugin.
And once the job is scheduled, I was expecting additional logs to show up in the device plugin pod, especially calls to the Allocate
function. But the device plugin pod continues to be alive and running without showing any new logs.
Which leads me to suspect kubelet is somehow not calling the Allocate
function or ran into issues. However, there is no additional info from the kubectl get events or pod description that shows any error has occurred.
when the hello pod is pending, node description shows that 1 foo device is allocated, but total allocatable remains 8
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 100m (0%) 0 (0%)
memory 70Mi (0%) 170Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
example.ai/foo 1 1
then I noticed, that the most current api.proto is slightly different from what’s listed in the repo with the addition of GetPreferredAllocation
and the supporting request/response protobuf msg definitions.
So I tried to update the api.proto to match this and regenerated the grpc files. But then I still ran into the exact same situation.
I’m using k3s to manage my 1-node cluster and I can’t seem to find a way to enable debug kubelet logs to see where it got stuck.
The system is Ubuntu 18.04.3 LTS
and I’m using python3.7 base image in my dockerfile
for other dependencies I used pip install:
grpclib==0.4.2
grpcio==1.42.0
grpcio-tools==1.42.0
google-api-python-client==2.31.0
Any pointers or advice would be apprciated.
Thank you!