Dancing with Kubernetes part I

Introduction

In the series of Kubernetes posts we will share issues, analysis and solutions we faced with our clusters.

For the last couple of weeks we were facing unexplained problems with one of our Kubernetes cluster. Problem got so strange that even Azure support decided to put it in their X files.

Once upon a time we noticed moments of glitch in our applications, after a while everything was running fine again. When we checked the status of the K8 cluster we saw that one node was marked with NotReady state

kubectl get node -o wide                           
NAME                              STATUS     ROLES   AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-default-14442969-vmss000018   NotReady   agent   29h   v1.14.8   172.20.8.206           Ubuntu 16.04.6 LTS   4.15.0-1061-azure   docker://Unknown
aks-default-14442969-vmss000019   Ready      agent   28h   v1.14.8   172.20.8.4             Ubuntu 16.04.6 LTS   4.15.0-1061-azure   docker://3.0.7
aks-default-14442969-vmss00001a   Ready      agent   53m   v1.14.8   172.20.8.105           Ubuntu 16.04.6 LTS   4.15.0-1061-azure   docker://3.0.7

When node goes into NotReady state, Kubernetes Controller Manager will monitor the node for 5 minutes (default setting  pod-eviction-timeout parameter of kube-controller-manager) before taking any action. If node state will not transition to Ready during this time, kube-scheduler will try to reschedule all pods that were running on this node.

Our cluster was composed of 2 nodes with Cluster Autoscaler enabled, so that in case of higher demand for resources, additional nodes will be provisioned automatically. What we didn’t realize is that autoscaler also added self healing mechanism to our cluster. Node vmss000018 went offline, thus available resources in the cluster decreased by 50%. Pods that previously run on vmss000018 had to be rescheduled somewhere, so vmss00001a was instantiated by cluster autoscaler.

Pods needing to be scheduled while there are not enough resources in the cluster will go into Pending state

So we thought to ourselves- OK, this is cloud… things can break and fix themselves at any time… let’s not get crazy about it. We cleared the faulty node from VMSS pool and went back to normal life. However this scenario repeated itself the next day and was frequently coming back. Moreover there were no kube event logs which could indicate the root cause. This is how our cluster was behaving:

When we started to look closer into impacted clusters we noticed that workload in particular nodes was not balanced.

Kube Scheduler

Our AKS cluster consisted of 2 nodes. If the pods were distributed evenly across the nodes, then the load would be split 50% across both nodes. Kube-scheduler algorithm tries to allocate a pod in two steps:

  • Filter out nodes which are unable to run a pod. Node could end up in this list due multiple reasons – lack of resources, unavailability, taints, lack of required labels, etc.
  • Ranking rest of the nodes – with multiple scoring functions, kube-scheduler creates a list, ordered in a way that “best” node is on top. It takes into account available resources, pods membership of the same deployment (spread them by minimizing the number of pods belonging to the same service on the same node) etc.

Node Score = (weight1 * priorityFunc1) + (weight2 * priorityFunc2) + …

More insight about the prioritization can be found here in kube-scheduler code
https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/algorithmprovider/registry.go#L116 )
The algorithm is able to spread the load across the nodes relatively well, but there is one caveat – all nodes should be available at the time of scheduling.

Visualization

For simplicity let’s assume all pods have the same memory request. Below animations show how each node memory utilization presented itself in two different scenarios

1 . Scheduler spreads pods across two available nodes.

2. One node fails, so existing pods have to be rescheduled. When the remaining node runs out of available memory, kube-scheduler sets pods in Pending state, Cluster Autoscaler creates additional node, kube-scheduler resumes scheduling of Pending nodes.

In result we are left with two nodes, one of them having close to maximum allocated memory and the second node underutilized.

kubectl top nodes  
NAME                              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%      
aks-default-14442969-vmss00001a   3426m        43%    21795Mi         90%          
aks-default-14442969-vmss00001b   618m         7%     5233Mi          21%        

Of course our pods were working very hard and in result their memory consumption grew over time. After some time memory utilization on one node grows to 100% and the story repeats itself

kubectl top nodes
NAME                              CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%   
aks-default-14442969-vmss00001a   5093m        65%       24036Mi         100%      
aks-default-14442969-vmss00001b   661m         8%        5860Mi          24%  

We were able to connect node outages with memory consumption thanks to Azure Container Insights charts. Whenever node crossed memory usage of 25GB it was becoming unresponsive and went into NotReady state. We suspect that under the hood our nodes were facing high swap

Final evidence was gathered when failing node was able to send his last status to Kube API. It was scream for help informing about node memory pressure. This is the output of kubectl describe node of a failing node

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Tue, 31 Mar 2020 07:58:59 +0200   Tue, 31 Mar 2020 03:13:53 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Tue, 31 Mar 2020 07:58:59 +0200   Fri, 27 Mar 2020 11:53:22 +0100   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Tue, 31 Mar 2020 07:58:59 +0200   Fri, 27 Mar 2020 11:53:22 +0100   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Tue, 31 Mar 2020 07:58:59 +0200   Fri, 27 Mar 2020 11:53:22 +0100   KubeletReady                 kubelet is posting ready status. AppArmor enabled

Final dance

Looking at all evidence it became clear to use that our two node cluster simply can not handle failure of one node. Permanently adding third machine to our cluster gave kube-scheduler a chance to distribute pods across two healthy nodes, not allowing cluster unbalance. We also plan to review our pods requests and limits not to have such big memory overcommitment. Below is simulation of one node failure in 3 node cluster

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.