In a previous post, you learned about DaemonSets in Kubernetes. In this article, you will learn how the Kubernetes scheduler works. The kube-scheduler is a core control plane component that helps schedule Pods across various worker nodes of a Kubernetes cluster. Scheduling in Kubernetes is the process of assigning Pods to nodes so that the kubelet service can run them.
Avatar

Understanding the kube-scheduler

The kube-scheduler is the default scheduler that watches for newly created Pods that have no node assigned. It inserts them into a queue and finds feasible nodes that meet the Pod's requirements by taking into account various factors, such as resource requirements, node availability, node affinity, taints and tolerations, and other constraints. Node selection happens in two steps: node filtering and node scoring.

Node filtering

In this step, feasible nodes are identified, and other nodes that are not suitable for the Pod are filtered out. A few default filters that help in this phase are as follows:

  • PodFitsResources: This filter checks whether a node has enough resources to meet the Pod's requirements.
  • PodFitsHost: This filter checks whether a Pod specifies a particular node by its hostname.
  • PodFitsHostPorts: This filter checks whether a node has free ports for the Pod to avoid conflict.

At the end of this step, you get one or more nodes known as feasible nodes. If there is no node, the Pod remains unscheduled.

Node scoring

In this step, feasible nodes are assigned a score ranging from 0 to 10, where the highest score means that the node is most suitable for running a Pod. The scoring rules are implemented as functions (AKA scoring plugins) that are plugged into the Kubernetes scheduler framework. Some built-in scoring plugins that help in this phase are as follows:

  • NodeResourcesLeastAllocated: This plugin favors nodes that have low resources allocated.
  • NodeAffinity: This plugin implements node selector and node affinity preferences for the Pod.
  • TaintToleration: This plugin prepares the priority list for each node based on the taints and tolerations.
  • ImageLocality: This plugin favors the nodes that have the container image already cached locally.

You can even add your custom filters and scoring functions using the scheduler framework. At the end of the scoring phase, if there are multiple nodes, one is randomly picked to schedule the Pod. After both steps, the best node is determined, and the nodeName property is set in the Pod configuration. Finally, the API server is notified to direct the corresponding node's kubelet to manage the Pod.

Manual scheduling

But what will happen if there is no kube-scheduler in the cluster? In this case, the Pod remains in the Pending state unless you manually schedule it.

Stopping the kube-scheduler

To demonstrate how manual scheduling works, we will stop the Kube schedule. Please note that stopping the default kube-scheduler is a sensitive operation that could cause unexpected downtime in a production cluster. Therefore, I strongly recommend performing this task only in a test lab environment.

If you have read my earlier post on static Pods, you know that kubeadm runs all control plane components as static Pods, and the configuration files are located in the /etc/kubernetes/manifests directory on the control plane node.

View the static Pods and their configuration files on the control plane node

View the static Pods and their configuration files on the control plane node

View the static Pods and their configuration files on the control plane node

To view the TCP port used by the default kube-scheduler, run the netstat command as shown below:

sudo netstat -tlnp
View the TCP port used by the default kube scheduler

View the TCP port used by the default kube scheduler

The screenshot shows that the default kube-scheduler listens on TCP Port 10259, which is defined in the static Pod config file (/etc/kubernetes/manifests/kube-scheduler.yaml).

To stop the default kube-scheduler, I will temporarily move the kube-scheduler.yaml file to another location.

sudo mv /etc/kubernetes/manifests/kube-scheduler.yaml ~
kubectl get pods -n kube-system
Stopping the default kube scheduler static Pod

Stopping the default kube scheduler static Pod

As soon as the file is moved, the kube-scheduler static Pod is automatically removed by the kubelet service. Since there is no default scheduler, you need to manually schedule your Pods.

Let's create a new Pod configuration file named webapp-pod.yaml.

apiVersion: v1
kind: Pod
metadata:
  name: webapp-pod
spec:
  containers:
    - name: webapp-container
      image: nginx
Create a simple Pod configuration file

Create a simple Pod configuration file

The screenshot shows a simple Pod configuration. Let's apply this file to create a Pod and see what happens.

Create and view the Pod status in Kubernetes

Create and view the Pod status in Kubernetes

You can see that the Pod is stuck in the Pending state. Why is that? Because there is no scheduler in our cluster. Let's discuss some ways to schedule Pods manually.

Node name field

First, use the kubectl get nodes command to see how many nodes exist in a Kubernetes cluster.

View the Kubernetes cluster nodes

View the Kubernetes cluster nodes

You can see there are two worker nodes in my cluster: kube-srv2.testlab.local and kube-srv3.testlab.local. Let's see how to use the nodeName field to constrain a Pod to run on a particular node. Open the Pod configuration file we created earlier and add a nodeName field, as shown below:

apiVersion: v1
kind: Pod
metadata:
  name: webapp-pod
spec:
  containers:
    - name: webapp-container
      image: nginx
  nodeName: kube-srv3.testlab.local
Defining the nodeName field in the Pod configuration to constrain it to a particular node

Defining the nodeName field in the Pod configuration to constrain it to a particular node

Usually, the default kube-scheduler sets this field for your Pods, but since we don't have a default scheduler, we defined this field manually to control where this Pod will be placed. I specified the kube-srv3.testlab.local worker node to run this Pod. Now, delete the pending Pod, and apply the updated configuration. You will notice that the Pod ends up running on the desired node.

kubectl delete pods webapp-pod
kubectl apply -f webapp-pod.yaml
kubectl get pods -o wide
Verify that the Pod is placed on the desired node set through the nodeName field

Verify that the Pod is placed on the desired node set through the nodeName field

The nodeName field works for certain use cases, but remember that if the specified node ceases to exist in the cluster, the Pods will remain in the Pending state, even if there are other suitable worker nodes. Therefore, it is important to use the nodeName field carefully.

Node selector field

Another super-simple way of constraining a Pod on a node is the nodeSelector field. Let's use it to manually schedule a Pod on a node labeled small (i.e., kube-srv2.testlab.local). To view the node labels, run the following command:

kubectl get nodes --show-labels
View node labels in Kubernetes

View node labels in Kubernetes

To use the nodeSelector field, open the Pod configuration file we created earlier and replace the nodeName field with a nodeSelector, as shown below:

apiVersion: v1
kind: Pod
metadata:
  name: webapp-pod
spec:
  containers:
    - name: webapp-container
      image: nginx
  nodeSelector:
    disk: ssd
Defining the nodeSelector field in the Pod configuration to constrain it to a particular node

Defining the nodeSelector field in the Pod configuration to constrain it to a particular node

Here, I used the nodeSelector field with disk: ssd, which essentially says that the Pod should be scheduled on a node having a disk=ssd label. Read my previous post to learn about node selectors in more detail. Since there can be multiple nodes with the same label, it allows you to schedule your Pods on the nodes with a specific type of hardware (such as GPU or SSD). Let's delete the previously created Pod and create a new one with the updated configuration.

kubectl delete pods webapp-pod
kubectl apply -f webapp-pod.yaml
kubectl get pods -o wide
Verify that the Pod is placed on the desired node set through the nodeSelector field

Verify that the Pod is placed on the desired node set through the nodeSelector field

The Pod is now placed on the kube-srv2.testlab.local worker node, which has the disk=ssd label. Remember, the nodeSelector field is more resilient than the nodeName field, since it depends on node labels rather than node names. Labels are usually easier to set and update, but node names rarely change once set.

In the same manner, you can use node affinity and taints and tolerations to schedule your Pods manually instead of just relying on the default kube-scheduler.

When you are done, don't forget to move the default scheduler config file back to the original location because you don't always want to schedule all the Pods manually.

sudo mv ~/kube-scheduler.yaml /etc/kubernetes/manifests/
kubectl get pods -n kube-system
Ensuring that the default kube scheduler is available in the Kubernetes cluster

Ensuring that the default kube scheduler is available in the Kubernetes cluster

The default kube-scheduler in a cluster ensures that Pods are automatically assigned to a suitable node. You may choose to manually schedule certain Pods, though.

Multiple schedulers in Kubernetes

Kubernetes supports multiple schedulers, which means you can define a custom scheduler in a cluster in addition to the default kube-scheduler. Now, you might ask why you would need a custom scheduler when a default scheduler is already there. There could be situations in which the default scheduler doesn't meet your needs. With a custom scheduler, you can implement your policies and scheduling algorithms to schedule your Pods the way you like.

Create a custom scheduler

First, let's create a kube-scheduler configuration file at /etc/kubernetes/my-scheduler-config.yaml.

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: my-scheduler
leaderElection:
  leaderElect: false
clientConnection:
  kubeconfig: /etc/kubernetes/scheduler.conf
Create a kube scheduler configuration for the custom scheduler

Create a kube scheduler configuration for the custom scheduler

A brief explanation of the fields:

  • kind: KubeSchedulerConfiguration is the type of Kubernetes object for a custom scheduler.
  • profiles: Specifies the custom scheduler name (my-scheduler, in this case).
  • leaderElection: In a Kubernetes cluster, if there are multiple copies of the same scheduler, such as in the case of highly available control plane nodes, the leaderElect property is set to true by default. This means that only one scheduler can be active at a time in performing all the scheduling tasks in the cluster. We set this property to false since there is only one control plane node in our cluster.
  • clientConnection: Specifies the configuration file that connects to the Kubernetes API server. Here, we have specified the config file that is used by the default kube-scheduler.

Since my Kubernetes cluster is set up using the kubeadm tool, all control plane components are running as static Pods. So, we can simply duplicate the static Pod configuration file of the default kube-scheduler (i.e., /etc/kubernetes/manifests/kube-scheduler.yaml), make the necessary changes, and run a custom scheduler as a static Pod. To do so, run the following commands:

sudo cp /etc/kubernetes/manifests/kube-scheduler.yaml ~/my-scheduler.yaml
sudo nano ~/my-scheduler.yaml

Remember, the kubelet is constantly monitoring the designated directory (/etc/kubernetes/manifests/), so I copied the file to my home directory instead with a new name, my-scheduler.yaml. Now, open this file in a text editor and make the necessary changes as shown below.

apiVersion: v1
kind: Pod
metadata:
  labels:
    component: kube-scheduler
    tier: control-plane
  name: my-scheduler
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-scheduler
    - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
    - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
    - --bind-address=127.0.0.1
    - --kubeconfig=/etc/kubernetes/scheduler.conf
    - --config=/etc/kubernetes/my-scheduler-config.yaml
    - --secure-port=10282
    image: registry.k8s.io/kube-scheduler:v1.27.3
    livenessProbe:
      failureThreshold: 8
      httpGet:
        host: 127.0.0.1
        path: /healthz
        port: 10282
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    name: my-scheduler
    resources:
      requests:
        cpu: 100m
    startupProbe:
      failureThreshold: 24
      httpGet:
        host: 127.0.0.1
        path: /healthz
		port: 10282
        scheme: HTTPS
      initialDelaySeconds: 10
      periodSeconds: 10
      timeoutSeconds: 15
    volumeMounts:
    - mountPath: /etc/kubernetes/scheduler.conf
      name: kubeconfig
      readOnly: true
    - mountPath: /etc/kubernetes/my-scheduler-config.yaml
      name: config
      readOnly: true
  hostNetwork: true
  priority: 2000001000
  priorityClassName: system-node-critical
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  volumes:
  - hostPath:
      path: /etc/kubernetes/scheduler.conf
      type: FileOrCreate
    name: kubeconfig
  - hostPath:
      path: /etc/kubernetes/my-scheduler-config.yaml
      type: FileOrCreate
    name: config
Change custom scheduler settings in the static Pod manifest

Change custom scheduler settings in the static Pod manifest

There are many settings in the configuration file, but you only need to change the lines marked in red. You can keep the remaining settings the same as they are in the default kube-scheduler. Let's briefly discuss the fields that we change.

  • name: Defines the name of the static Pod that runs our custom scheduler.
  • namespace: Defines the namespace where the static Pod will run.
  • secure-port: Defines the secure TCP port that our custom scheduler will listen on. We have already seen that the default scheduler listens on Port TCP 10259. Make sure you specify an unused port here. You can use the netstat command to find out which ports are used/unused on the control plane node.
  • Liveness and startup probe ports: Make sure you specify the same TCP port that you defined with the --secure-port option.

The screenshot below shows how to mount the kube-scheduler configuration file (/etc/kubernetes/my-scheduler-config.yaml) into the static Pod:

Mount the kube scheduler configuration file as a volume in the Pod

Mount the kube scheduler configuration file as a volume in the Pod

You can see that there is already a hostPath volume defined for the kubeconfig. We used the same approach to mount the kube-scheduler configuration file located at /etc/kubernetes/my-scheduler-config.yaml on the control plane node and made this file available to the static Pod.

Once you have properly updated the my-scheduler.yaml file, move it to the /etc/kubernetes/manifests/ directory so that the Kubelet can create a static Pod for our custom scheduler.

sudo mv ~/my-scheduler.yaml /etc/kubernetes/manifests/
kubectl get pods -n kube-system
Create and view the custom scheduler running as a static Pod

Create and view the custom scheduler running as a static Pod

You can see that my-scheduler is running as a static Pod. The READY column should show 1/1, which essentially means that 1 out of 1 containers is up and running in the Pod. Now we have 2 schedulers running in the cluster.

Schedule a Pod with a custom scheduler

To schedule a Pod using the custom scheduler, you need to specify the scheduler name under the spec.schedulerName field in the Pod or Deployment configuration file.

apiVersion: v1
kind: Pod
metadata:
  name: webapp-pod
spec:
  schedulerName: my-scheduler
  containers:
    - name: webapp-container
      image: nginx
Pod configuration file for using a custom scheduler

Pod configuration file for using a custom scheduler

Now, create the Pod with this configuration, and view its status.

kubectl apply -f webapp-pod.yaml
kubectl get pods
Create a Pod with a custom scheduler

Create a Pod with a custom scheduler

You can see that the Pod is running. Now, how would you verify whether this Pod was scheduled by the custom scheduler? Run the kubectl describe pods command and take a look at the events section at the end.

View the name of the scheduler that scheduled a Pod

View the name of the scheduler that scheduled a Pod

The first event shows the scheduler name, and the message proves that the Pod was assigned to the kube-srv3 node by my-scheduler, which is our custom scheduler.

If the status of your Pod shows Pending, there is a problem with your custom scheduler. In this case, you need to double-check the scheduler configuration and make sure you followed all the steps correctly. You can also use the kubectl events and kubectl logs <pod-name> commands to see what's going on.

Conclusion

You just learned how Pod scheduling works in a Kubernetes cluster. You also saw that Kubernetes is highly extensible. If you don't want to use the default scheduler, you can create and use a custom scheduler. Moreover, Kubernetes lets you control which Pods are scheduled by a custom scheduler, whereas the other Pods are scheduled by the default kube-scheduler.

0 Comments

Leave a reply

Your email address will not be published. Required fields are marked *

*

© 4sysops 2006 - 2023

CONTACT US

Please ask IT administration questions in the forums. Any other messages are welcome.

Sending

Log in with your credentials

or    

Forgot your details?

Create Account