How we hack densely packed GKE cluster

Tarasov Aleksandr
5 min readDec 12, 2019

Disclaimer: This story is based on my teamwork in ANNA Money with Denis Arslanbekov. He helps me to write this one.

Photo by Tim Mossholder on Unsplash

Hey folks! Let us share our experience with GKE. We use it for a while, and we are happy to not concern about the management of Kubernetes clusters.

At the moment, we have all our test environments and distinctive infrastructure cluster under GKE control. We are in the state of migration of the production environment, but it is another story. Today we want to tell how we faced an issue on the test cluster, and we are happy if this article saves your time.

In the beginning, to understand the reason for the issue, we should provide some information about test infrastructure.

Let’s look at numbers first. We have more than fifteen test environments. The number of pods goes to 2000 and tends to grow. We pack pods very tight to save the cost because the load is volatile, and resources oversell the best strategy for us. So, we have only 20 nodes to serve them all.

This configuration worked fine until the day we received the alert and could not delete a namespace.

The error about namespace deletion was:

> kubectl delete namespace tarasoffError from server (Conflict): Operation cannot be fulfilled on namespaces "tarasoff": The system is ensuring all content is removed from this namespace.  Upon completion, this namespace will automatically be purged by the system.

and even force deletion not helped:

> kubectl get namespace tarasoff -o yamlapiVersion: v1
kind: Namespace
metadata:
...
spec:
finalizers:
- kubernetes

status:
phase: Terminating

We fixed stuck namespace using this guide, but this was a strange way of life because our developers could create their environments on-demand (under the namespace abstraction) and could delete them as well.

We decided to dig a little deeper. Alert had a piece of information that something was wrong with metrics and we proved that via the following command:

> kubectl api-resources --verbs=list --namespaced -o nameerror: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request

We saw out of memory for the metrics-server pod and the panic error in logs:

apiserver panic'd on GET /apis/metrics.k8s.io/v1beta1/nodes: killing connection/stream because serving request timed out and response had been started
goroutine 1430 [running]:

The reason was in limits for pod’s resources:

The container just hit them because it has the following definition:

resources:
limits:
cpu: 51m
memory: 123Mi

requests:
cpu: 51m
memory: 123Mi

51m equals around 0.05 of one core, and this is not enough to operate metrics from a large number of pods. Primarily the CFS scheduler is used.

Typically we fix these issues clear and straightforward. Just add more resources to the pod and be happy, but this option is not available in GKE UI or via gcloud CLI (at least for 1.14.8-gke.17 version). It is reasonable to protect the system resources from their modification because of all management on the Google Cloud side. If we were Google, we would do it the same way.

We discovered that we are not alone and found a similar but not exact problem where the author tries to change the pod definition manually. He was lucky; we were not. We changed resource limits in the YAML, and GKE very fast rollbacked them.

We should have found another way.

The first moment that we wanted to understand was why limits have these values? The pod consists of two containers: metrics-server and addon-resizer. And the last has the responsibility to change resources while cluster adds or removes nodes just like a nanny (vertical autoscale). It has the following command line definition:

command:
- /pod_nanny
- --config-dir=/etc/config
- --cpu=40m
- --extra-cpu=0.5m
- --memory=35Mi
- --extra-memory=4Mi

...

Here cpu and memory are a baseline, and extra-cpu and extra-memory are additional resources per node. Calculations for 21 nodes:

0.5m * 21 + 40m=~51m

For memory resources, it is the same.

We became unfortunate that increasing the number of nodes was the only way to increase resources. We didn’t want to do that, so we tried to do something else.

We could not fix the problem, but we decided to stabilize deployment as fast as possible. We read that some properties in the YAML definition we could change, and GKE does not rollback them. We increased the number of replicas from 1 to 5 and added heatlhcheck and fixed rollout strategy according to this article.

These actions helped us to decrease a load to an instance of metrics-server, and at any time, we have at least one working pod and could provide metrics.

We had a timeout to rethink the problem and reload our minds.

The solution was simple, stupid, but unobvious.

We dived deeper in addon-resizer internals and found out that it could be configured by config and command line. At first sight, command line params should override config values… but not!

We checked that config file is connected to the pod in the command line params of addon-resizer container:

--config-dir=/etc/config

Config file was mapped as ConfigMap with name metrics-server-config in the system namespace. And GKE does not rollback this configuration!

We added resources via this config as follow:

apiVersion: v1
data:
NannyConfiguration: |-
apiVersion: nannyconfig/v1alpha1
kind: NannyConfiguration
baseCPU: 100m
cpuPerNode: 5m
baseMemory: 100Mi
memoryPerNode: 5Mi
kind: ConfigMap
metadata:

And it worked! And that was a victory!

We left two pods with healthchecks and zero-downtime strategy while cluster resizes and didn’t get alerts after manipulations.

Outcomes:

  1. There are problems with metrics-server pod if your GKE cluster is densely packed. There are not enough resources by default for the pod if the number of pods per node is the near limit (110 per node).
  2. GKE protects its system resources, including system pods. We could not control them, but sometimes we are lucky to hack them.
  3. There is no guarantee that after the next update, something can change. Still, we have these problems only in test environments with resources overselling strategy, so it is painful, but we can live with that.

--

--

Tarasov Aleksandr

Principal Platform Engineer @ Cenomi. All thoughts are my own.