- Update container images with Copa - Mon, Nov 27 2023
- Deploying stateful applications with Kubernetes StatefulSets - Wed, Nov 1 2023
- Install and enable IIS Manager for Remote Administration - Thu, Oct 26 2023
Understanding the kube-scheduler
The kube-scheduler is the default scheduler that watches for newly created Pods that have no node assigned. It inserts them into a queue and finds feasible nodes that meet the Pod's requirements by taking into account various factors, such as resource requirements, node availability, node affinity, taints and tolerations, and other constraints. Node selection happens in two steps: node filtering and node scoring.
In this step, feasible nodes are identified, and other nodes that are not suitable for the Pod are filtered out. A few default filters that help in this phase are as follows:
- PodFitsResources: This filter checks whether a node has enough resources to meet the Pod's requirements.
- PodFitsHost: This filter checks whether a Pod specifies a particular node by its hostname.
- PodFitsHostPorts: This filter checks whether a node has free ports for the Pod to avoid conflict.
At the end of this step, you get one or more nodes known as feasible nodes. If there is no node, the Pod remains unscheduled.
In this step, feasible nodes are assigned a score ranging from 0 to 10, where the highest score means that the node is most suitable for running a Pod. The scoring rules are implemented as functions (AKA scoring plugins) that are plugged into the Kubernetes scheduler framework. Some built-in scoring plugins that help in this phase are as follows:
- NodeResourcesLeastAllocated: This plugin favors nodes that have low resources allocated.
- NodeAffinity: This plugin implements node selector and node affinity preferences for the Pod.
- TaintToleration: This plugin prepares the priority list for each node based on the taints and tolerations.
- ImageLocality: This plugin favors the nodes that have the container image already cached locally.
You can even add your custom filters and scoring functions using the scheduler framework. At the end of the scoring phase, if there are multiple nodes, one is randomly picked to schedule the Pod. After both steps, the best node is determined, and the nodeName property is set in the Pod configuration. Finally, the API server is notified to direct the corresponding node's kubelet to manage the Pod.
But what will happen if there is no kube-scheduler in the cluster? In this case, the Pod remains in the Pending state unless you manually schedule it.
Stopping the kube-scheduler
To demonstrate how manual scheduling works, we will stop the Kube schedule. Please note that stopping the default kube-scheduler is a sensitive operation that could cause unexpected downtime in a production cluster. Therefore, I strongly recommend performing this task only in a test lab environment.
If you have read my earlier post on static Pods, you know that kubeadm runs all control plane components as static Pods, and the configuration files are located in the /etc/kubernetes/manifests directory on the control plane node.
View the static Pods and their configuration files on the control plane node
To view the TCP port used by the default kube-scheduler, run the netstat command as shown below:
sudo netstat -tlnp
The screenshot shows that the default kube-scheduler listens on TCP Port 10259, which is defined in the static Pod config file (/etc/kubernetes/manifests/kube-scheduler.yaml).
To stop the default kube-scheduler, I will temporarily move the kube-scheduler.yaml file to another location.
sudo mv /etc/kubernetes/manifests/kube-scheduler.yaml ~ kubectl get pods -n kube-system
As soon as the file is moved, the kube-scheduler static Pod is automatically removed by the kubelet service. Since there is no default scheduler, you need to manually schedule your Pods.
Let's create a new Pod configuration file named webapp-pod.yaml.
apiVersion: v1 kind: Pod metadata: name: webapp-pod spec: containers: - name: webapp-container image: nginx
The screenshot shows a simple Pod configuration. Let's apply this file to create a Pod and see what happens.
You can see that the Pod is stuck in the Pending state. Why is that? Because there is no scheduler in our cluster. Let's discuss some ways to schedule Pods manually.
Node name field
First, use the kubectl get nodes command to see how many nodes exist in a Kubernetes cluster.
You can see there are two worker nodes in my cluster: kube-srv2.testlab.local and kube-srv3.testlab.local. Let's see how to use the nodeName field to constrain a Pod to run on a particular node. Open the Pod configuration file we created earlier and add a nodeName field, as shown below:
apiVersion: v1 kind: Pod metadata: name: webapp-pod spec: containers: - name: webapp-container image: nginx nodeName: kube-srv3.testlab.local
Usually, the default kube-scheduler sets this field for your Pods, but since we don't have a default scheduler, we defined this field manually to control where this Pod will be placed. I specified the kube-srv3.testlab.local worker node to run this Pod. Now, delete the pending Pod, and apply the updated configuration. You will notice that the Pod ends up running on the desired node.
kubectl delete pods webapp-pod kubectl apply -f webapp-pod.yaml kubectl get pods -o wide
The nodeName field works for certain use cases, but remember that if the specified node ceases to exist in the cluster, the Pods will remain in the Pending state, even if there are other suitable worker nodes. Therefore, it is important to use the nodeName field carefully.
Node selector field
Another super-simple way of constraining a Pod on a node is the nodeSelector field. Let's use it to manually schedule a Pod on a node labeled small (i.e., kube-srv2.testlab.local). To view the node labels, run the following command:
kubectl get nodes --show-labels
To use the nodeSelector field, open the Pod configuration file we created earlier and replace the nodeName field with a nodeSelector, as shown below:
apiVersion: v1 kind: Pod metadata: name: webapp-pod spec: containers: - name: webapp-container image: nginx nodeSelector: disk: ssd
Here, I used the nodeSelector field with disk: ssd, which essentially says that the Pod should be scheduled on a node having a disk=ssd label. Read my previous post to learn about node selectors in more detail. Since there can be multiple nodes with the same label, it allows you to schedule your Pods on the nodes with a specific type of hardware (such as GPU or SSD). Let's delete the previously created Pod and create a new one with the updated configuration.
kubectl delete pods webapp-pod kubectl apply -f webapp-pod.yaml kubectl get pods -o wide
The Pod is now placed on the kube-srv2.testlab.local worker node, which has the disk=ssd label. Remember, the nodeSelector field is more resilient than the nodeName field, since it depends on node labels rather than node names. Labels are usually easier to set and update, but node names rarely change once set.
When you are done, don't forget to move the default scheduler config file back to the original location because you don't always want to schedule all the Pods manually.
sudo mv ~/kube-scheduler.yaml /etc/kubernetes/manifests/ kubectl get pods -n kube-system
The default kube-scheduler in a cluster ensures that Pods are automatically assigned to a suitable node. You may choose to manually schedule certain Pods, though.
Multiple schedulers in Kubernetes
Kubernetes supports multiple schedulers, which means you can define a custom scheduler in a cluster in addition to the default kube-scheduler. Now, you might ask why you would need a custom scheduler when a default scheduler is already there. There could be situations in which the default scheduler doesn't meet your needs. With a custom scheduler, you can implement your policies and scheduling algorithms to schedule your Pods the way you like.
Create a custom scheduler
First, let's create a kube-scheduler configuration file at /etc/kubernetes/my-scheduler-config.yaml.
apiVersion: kubescheduler.config.k8s.io/v1beta2 kind: KubeSchedulerConfiguration profiles: - schedulerName: my-scheduler leaderElection: leaderElect: false clientConnection: kubeconfig: /etc/kubernetes/scheduler.conf
A brief explanation of the fields:
- kind: KubeSchedulerConfiguration is the type of Kubernetes object for a custom scheduler.
- profiles: Specifies the custom scheduler name (my-scheduler, in this case).
- leaderElection: In a Kubernetes cluster, if there are multiple copies of the same scheduler, such as in the case of highly available control plane nodes, the leaderElect property is set to true by default. This means that only one scheduler can be active at a time in performing all the scheduling tasks in the cluster. We set this property to false since there is only one control plane node in our cluster.
- clientConnection: Specifies the configuration file that connects to the Kubernetes API server. Here, we have specified the config file that is used by the default kube-scheduler.
Since my Kubernetes cluster is set up using the kubeadm tool, all control plane components are running as static Pods. So, we can simply duplicate the static Pod configuration file of the default kube-scheduler (i.e., /etc/kubernetes/manifests/kube-scheduler.yaml), make the necessary changes, and run a custom scheduler as a static Pod. To do so, run the following commands:
sudo cp /etc/kubernetes/manifests/kube-scheduler.yaml ~/my-scheduler.yaml sudo nano ~/my-scheduler.yaml
Remember, the kubelet is constantly monitoring the designated directory (/etc/kubernetes/manifests/), so I copied the file to my home directory instead with a new name, my-scheduler.yaml. Now, open this file in a text editor and make the necessary changes as shown below.
apiVersion: v1 kind: Pod metadata: labels: component: kube-scheduler tier: control-plane name: my-scheduler namespace: kube-system spec: containers: - command: - kube-scheduler - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf - --bind-address=127.0.0.1 - --kubeconfig=/etc/kubernetes/scheduler.conf - --config=/etc/kubernetes/my-scheduler-config.yaml - --secure-port=10282 image: registry.k8s.io/kube-scheduler:v1.27.3 livenessProbe: failureThreshold: 8 httpGet: host: 127.0.0.1 path: /healthz port: 10282 scheme: HTTPS initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 15 name: my-scheduler resources: requests: cpu: 100m startupProbe: failureThreshold: 24 httpGet: host: 127.0.0.1 path: /healthz port: 10282 scheme: HTTPS initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 15 volumeMounts: - mountPath: /etc/kubernetes/scheduler.conf name: kubeconfig readOnly: true - mountPath: /etc/kubernetes/my-scheduler-config.yaml name: config readOnly: true hostNetwork: true priority: 2000001000 priorityClassName: system-node-critical securityContext: seccompProfile: type: RuntimeDefault volumes: - hostPath: path: /etc/kubernetes/scheduler.conf type: FileOrCreate name: kubeconfig - hostPath: path: /etc/kubernetes/my-scheduler-config.yaml type: FileOrCreate name: config
There are many settings in the configuration file, but you only need to change the lines marked in red. You can keep the remaining settings the same as they are in the default kube-scheduler. Let's briefly discuss the fields that we change.
- name: Defines the name of the static Pod that runs our custom scheduler.
- namespace: Defines the namespace where the static Pod will run.
- secure-port: Defines the secure TCP port that our custom scheduler will listen on. We have already seen that the default scheduler listens on Port TCP 10259. Make sure you specify an unused port here. You can use the netstat command to find out which ports are used/unused on the control plane node.
- Liveness and startup probe ports: Make sure you specify the same TCP port that you defined with the --secure-port option.
The screenshot below shows how to mount the kube-scheduler configuration file (/etc/kubernetes/my-scheduler-config.yaml) into the static Pod:
You can see that there is already a hostPath volume defined for the kubeconfig. We used the same approach to mount the kube-scheduler configuration file located at /etc/kubernetes/my-scheduler-config.yaml on the control plane node and made this file available to the static Pod.
Once you have properly updated the my-scheduler.yaml file, move it to the /etc/kubernetes/manifests/ directory so that the Kubelet can create a static Pod for our custom scheduler.
sudo mv ~/my-scheduler.yaml /etc/kubernetes/manifests/ kubectl get pods -n kube-system
You can see that my-scheduler is running as a static Pod. The READY column should show 1/1, which essentially means that 1 out of 1 containers is up and running in the Pod. Now we have 2 schedulers running in the cluster.
Schedule a Pod with a custom scheduler
To schedule a Pod using the custom scheduler, you need to specify the scheduler name under the spec.schedulerName field in the Pod or Deployment configuration file.
apiVersion: v1 kind: Pod metadata: name: webapp-pod spec: schedulerName: my-scheduler containers: - name: webapp-container image: nginx
Now, create the Pod with this configuration, and view its status.
kubectl apply -f webapp-pod.yaml kubectl get pods
You can see that the Pod is running. Now, how would you verify whether this Pod was scheduled by the custom scheduler? Run the kubectl describe pods command and take a look at the events section at the end.
The first event shows the scheduler name, and the message proves that the Pod was assigned to the kube-srv3 node by my-scheduler, which is our custom scheduler.
If the status of your Pod shows Pending, there is a problem with your custom scheduler. In this case, you need to double-check the scheduler configuration and make sure you followed all the steps correctly. You can also use the kubectl events and kubectl logs <pod-name> commands to see what's going on.
You just learned how Pod scheduling works in a Kubernetes cluster. You also saw that Kubernetes is highly extensible. If you don't want to use the default scheduler, you can create and use a custom scheduler. Moreover, Kubernetes lets you control which Pods are scheduled by a custom scheduler, whereas the other Pods are scheduled by the default kube-scheduler.