Dancing with Kubernetes part I
Introduction
In the series of Kubernetes posts we will share issues, analysis and solutions we faced with our clusters.
For the last couple of weeks we were facing unexplained problems with one of our Kubernetes cluster. Problem got so strange that even Azure support decided to put it in their X files.
Once upon a time we noticed moments of glitch in our applications, after a while everything was running fine again. When we checked the status of the K8 cluster we saw that one node was marked with NotReady state
kubectl get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME aks-default-14442969-vmss000018 NotReady agent 29h v1.14.8 172.20.8.206Ubuntu 16.04.6 LTS 4.15.0-1061-azure docker://Unknown aks-default-14442969-vmss000019 Ready agent 28h v1.14.8 172.20.8.4 Ubuntu 16.04.6 LTS 4.15.0-1061-azure docker://3.0.7 aks-default-14442969-vmss00001a Ready agent 53m v1.14.8 172.20.8.105 Ubuntu 16.04.6 LTS 4.15.0-1061-azure docker://3.0.7
When node goes into NotReady state, Kubernetes Controller Manager will monitor the node for 5 minutes (default setting pod-eviction-timeout parameter of kube-controller-manager) before taking any action. If node state will not transition to Ready during this time, kube-scheduler will try to reschedule all pods that were running on this node.
Our cluster was composed of 2 nodes with Cluster Autoscaler enabled, so that in case of higher demand for resources, additional nodes will be provisioned automatically. What we didn’t realize is that autoscaler also added self healing mechanism to our cluster. Node vmss000018 went offline, thus available resources in the cluster decreased by 50%. Pods that previously run on vmss000018 had to be rescheduled somewhere, so vmss00001a was instantiated by cluster autoscaler.
Pods needing to be scheduled while there are not enough resources in the cluster will go into Pending state
So we thought to ourselves- OK, this is cloud… things can break and fix themselves at any time… let’s not get crazy about it. We cleared the faulty node from VMSS pool and went back to normal life. However this scenario repeated itself the next day and was frequently coming back. Moreover there were no kube event logs which could indicate the root cause. This is how our cluster was behaving:

When we started to look closer into impacted clusters we noticed that workload in particular nodes was not balanced.
Kube Scheduler
Our AKS cluster consisted of 2 nodes. If the pods were distributed evenly across the nodes, then the load would be split 50% across both nodes. Kube-scheduler algorithm tries to allocate a pod in two steps:
- Filter out nodes which are unable to run a pod. Node could end up in this list due multiple reasons – lack of resources, unavailability, taints, lack of required labels, etc.
- Ranking rest of the nodes – with multiple scoring functions, kube-scheduler creates a list, ordered in a way that “best” node is on top. It takes into account available resources, pods membership of the same deployment (spread them by minimizing the number of pods belonging to the same service on the same node) etc.
Node Score = (weight1 * priorityFunc1) + (weight2 * priorityFunc2) + …
More insight about the prioritization can be found here in kube-scheduler code
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/algorithmprovider/registry.go#L116 )
The algorithm is able to spread the load across the nodes relatively well, but there is one caveat – all nodes should be available at the time of scheduling.
Visualization
For simplicity let’s assume all pods have the same memory request. Below animations show how each node memory utilization presented itself in two different scenarios
1 . Scheduler spreads pods across two available nodes.
2. One node fails, so existing pods have to be rescheduled. When the remaining node runs out of available memory, kube-scheduler sets pods in Pending state, Cluster Autoscaler creates additional node, kube-scheduler resumes scheduling of Pending nodes.
In result we are left with two nodes, one of them having close to maximum allocated memory and the second node underutilized.
kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% aks-default-14442969-vmss00001a 3426m 43% 21795Mi 90% aks-default-14442969-vmss00001b 618m 7% 5233Mi 21%
Of course our pods were working very hard and in result their memory consumption grew over time. After some time memory utilization on one node grows to 100% and the story repeats itself
kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% aks-default-14442969-vmss00001a 5093m 65% 24036Mi 100% aks-default-14442969-vmss00001b 661m 8% 5860Mi 24%
We were able to connect node outages with memory consumption thanks to Azure Container Insights charts. Whenever node crossed memory usage of 25GB it was becoming unresponsive and went into NotReady state. We suspect that under the hood our nodes were facing high swap
Final evidence was gathered when failing node was able to send his last status to Kube API. It was scream for help informing about node memory pressure. This is the output of kubectl describe node of a failing node
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 31 Mar 2020 07:58:59 +0200 Tue, 31 Mar 2020 03:13:53 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 31 Mar 2020 07:58:59 +0200 Fri, 27 Mar 2020 11:53:22 +0100 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 31 Mar 2020 07:58:59 +0200 Fri, 27 Mar 2020 11:53:22 +0100 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 31 Mar 2020 07:58:59 +0200 Fri, 27 Mar 2020 11:53:22 +0100 KubeletReady kubelet is posting ready status. AppArmor enabled
Final dance
Looking at all evidence it became clear to use that our two node cluster simply can not handle failure of one node. Permanently adding third machine to our cluster gave kube-scheduler a chance to distribute pods across two healthy nodes, not allowing cluster unbalance. We also plan to review our pods requests and limits not to have such big memory overcommitment. Below is simulation of one node failure in 3 node cluster