Static pod problem: mirror pod not created if imagePullSecrets set

Dear Colleagues,

I’ve created a static pod as described in Create static Pods | Kubernetes and I see the mirror pod all right in kubectl get pods. However, if I add a list of imagePullSecrets to the static pod spec, a mirror pod is not created though the pod itself is seems running (I see it in docker container ls output on its node). As soon as I remove “imagePullSecrets” from the static pod manifest, I am able to see the mirror pod again.

What am I doing wrong and where do I look for debug info?

K8s v1.21.1 on Debian Buster amd64, non-cloud setup with kubeadm.

I could probably use a regular pod with a nodeSelector instead of a static pod. However, I cannot make a regular pod restart after its node has been down for a while. I think the controller forgets about it (the pod enters the Terminating state) and never comes back alive when its node comes back alive. I have to reapply the pod definition to make it run again.

Static pods bypass admission control and don’t have access to secrets, they are controlled by kubelet directly and not a higher level controller. Their direct use highly discouraged unless you have to control something completely out of band from kubernetes directly.

Have you looked at DaemonSets ? Their design purpose is to run on a per node basis and might fit your use-case better.

Or Deployment or StatefulSet – but don’t use a static Pod…

All right, I won’t use a static pod. But questions remain:

  1. Can a Deployment be bound to one particular node, like a pod with a nodeSelector?
  2. Is a regular pod not supposed to be restarted when its node is alive again after a downtime?

I thought DaemonSets were designed to work on every node of the cluster. My purpose is kind of opposite: restrict a pod to a particular node and make it survive the node’s downtime.

  1. Yes you can, however, that kind of defeats the purpose of Kubernetes (implicit container orchestration – a sort of built-in HA). If you’re in a situation where you have workers with different hardware, and you want to allow the Deployment to prefer a particular set of nodes, then you can use nodeSelector or affinities:

Even though the example shows static pods, it’ll work with Deployments as well and then don’t have to manually do a bunch of other stuff that’s automatically done for you in Deployments.

  1. If you specify it as a deployment, think of it like a declaration to Kubernetes that you always want the pod(s) defined in the Deployment to always be up. So lets assume you have a Deployment specified for a single pod of some application. Once deployed successfully, Kubernetes will run that pod on one of your workers. If something happens to that worker, the Deployment will automatically spin up the pod on another available worker (constrained by your other question) – another available worker that meets the selector labels you apply. If there none available, the pod will fail.

However, lets assume you don’t apply labels and all of your workers are equal – if the node dies or pod dies on a worker, the Deployment will automatically spin up the pod on the same worker (if it hasn’t died) or another worker (based on a bunch of different algorithms). In this way, your pod will only be offline for however long it takes your pod to spin up + a little time for Kubernetes to figure out it’s dead.

If the pod itself dies, the spin-up will be pretty much immediate. If the worker dies that the pod is running on, I’ve run into situations where it takes a few minutes for K8s to figure out it has died to spin it back up on another worker.

1 Like

On what level do I place a nodeSelector in a Deployment declaration? In the deployment spec or in the template spec?

If the worker dies that the pod is running on, I’ve run into situations where it takes a few minutes for K8s to figure out it has died

More than 5 minutes in my experience.

Here’s is an example with affinities using Deployments: Implement Node and Pod Affinity/Anti-Affinity in Kubernetes: A Practical Example – The New Stack

Yes, ours was about 5 minutes also. However, there are add-ons to address this unique situation (ie your workers shouldn’t be dying often). In the more common case where the pod dies, this is not a problem and the container spins back up immediately.

5min is the default. You can tune the settings for reporting unhealthy and pod eviction

Looks like in the template spec.

If you know a way to change the pod-eviction-timeout (are you talking about this parameter?), can you please share how to do it in a working cluster?

I’ve tried adding --pod-eviction-timeout to kube-controller-manager but it did not effect anything, there’s the same 5 minutes timeout after a Node poweroff.

Okay, I had to do a little digging and the k8s docs should be updated. >_>

pod-eviction-timeout works IF the TaintBasedEvictions featuregate is set to false.

This is because there is a newer preferred way to control these settings is to use Taints and Tolerations. These let you control it on a per pod level e.g.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: custom-eviction
spec:
  replicas: 2
  selector:
    matchLabels:
      eviction: "true"
  template:
    metadata:
      labels:
        eviction: "true"
    spec:
      tolerations:
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 10
     - key: "node.kubernetes.io/not-ready"
       operator: "Exists"
       effect: "NoExecute"
       tolerationSeconds: 10
      containers:
      - image: busybox
        command:
        - sleep
        - "3600"
        name: busybox

Alternatively, you CAN set defaults at the cluster level, but this is controlled by by the kube-apiserver since its an admission controller setting for DefaultTolerationSeconds.

The two settings are:

--default-not-ready-toleration-seconds int     Default: 300
Indicates the tolerationSeconds of the toleration for notReady:NoExecute
that is added by default to every pod that does not already have such a
toleration.

--default-unreachable-toleration-seconds int     Default: 300
Indicates the tolerationSeconds of the toleration for unreachable:NoExecute
that is added by default to every pod that does not already have such 
 toleration.

I have set kube-apiserver --default-not-ready-toleration-seconds=60 --default-unreachable-toleration-seconds=60 ... but the time before K8s begins to recreate pods is still over 5 minutes.

I had to delete all deployments and create them anew for this setting to work. Still, with --default-not-ready-toleration-seconds=60 --default-unreachable-toleration-seconds=60 it takes almost 2 minutes (instead of the expected 1 minute) for the services to come online again.