Interested in Kubernetes scaling?

A holistic perspective

5 min readApr 15, 2024

Scalling101: Stating the obvious

Generally speaking vertical scaling (increasing the sizing of the instances) is rather limited, because you could quickly hit an upper limit in terms of resources. Vertical scaling besides having a hard limit, also doesn’t have failover and redundancy, therefore horizontal scaling is more desirable for large-scale applications.

Organizations typically find themself having the infrastructure either under-provisioned or over-provisioned. Now the tradeoff is between cost and performance….if the infrastructure is under-provisioned, your solution is going to experience instability, on the other hand, if the infrastructure is over-provisioned, most probably there are cost overruns.

The answer lies in traffic: As a rule of thumb when traffic is low, vertical scaling is a good option, but as the number of users increases horizontal scaling may be more fitted, providing fault tolerance as well.

Scaling in Kubernetes

This problem of automatically scaling in Kubernetes is addressed by:

Cluster Autoscaler, which adjusts the cluster’s nodepool based on the following conditions:

increase no of nodes: if pods failed to be scheduled (due to insufficient resources).
decrease no of nodes: if nodes are underutilized for an extended period and their pods can be placed on other existing nodes.

Cluster Autoscaler 0.5.X is the official version shipped with K8S 1.6, and of course, there are various flavors for the desired cloud provider.

But one of the main drawbacks of Cluster Autoscaler is the fact that you don’t really have control over the size of the nodes that are created, and Cluster Autoscaler doesn’t make scaling decisions based on actual resource usage, it only checks pod’s requests and limits for CPU and Memory resources.

A project that addresses the aforementioned drawbacks is Karpeneter, which aims to improve the efficiency and cost of running workloads in Kubernetes. Karpeneter is an open-source node provisioning project built for Kubernetes, which will automatically provision new nodes in response to un-schedulable pods.

Horizontal Pod Autoscaler (HPA) which is a Kubernetes API resource and a controller that automatically updates a workload (e.g. Deployment, StatefulSet) by deploying more Pods.

HPA scales the workload’s ReplicasSet based on the observed average CPU or Memory utilization (or other custom metrics) to match the target specified by the user.

A HPA can be created imperatively, and by setting the target average CPU utilization across all Pods in the <deployment>deployment to 50%, if the average CPU utilization exceeds this target, the HPA will create new Pods.

kubectl autoscale deployment <deployment> --cpu-percent=50 --min=1 --max=10

Vertical Pod Autoscaler (VPA) is an autoscaler that enables automatic CPU and Memory requests and limits adjustments based on historical resource usage measurements.

The VPA does not update the resources of pods that are not running under a controller and interfere with spec updates i.e. does not modify the template i.e. deployment, but the actual requests of the pods are updated.

VPA has three components: Recommender (monitors resource utilization and computes target values) → Updater ( if updateMode: Auto is defined it will implement the Recommender’s recommendations) → Admission Controller (which is an admission webhook that does the actual changes of the CPU and Memory settings).

Important mentions

As prerequisites HPA and VPA need the Metrics Server deployed within the cluster.

without metrics server

Do not use the VPA with the HPA on the same resource metric CPU or MEMORY, because when a metric reaches its defined threshold, the scaling event will happen for both VPA and HPA at the same time, which may have unknown side effects and may lead to issues.

As a rule of thumb: horizontal scaling is best suited for stateless workloads and vertical scaling for stateful.

Other interesting projects

Goldilocks: A utility that can help you identify a starting point for resource requests and limits, and it has two components:

Controller which creates VPA objects for your workloads
Dashboard which summarizes data and provides visualization.

As a prerequisite Goldilocks needs: The Metrics Server deployed in your cluster and VPA installed (side note Goldilocks only needs the Recommender component of the VPA).

Kubecost: A solution that provides real-time cost visibility and insights for your Kubernetes cluster. Kubecost’s strongest points are that is fully deployed in your infrastructure (no need to egress any data to a remote service) and it is based on the CNCF OpenCost OSS project.

KEDA: An autoscaler based on the number of events needing to be processed.

Kubeeye: An open-source diagnostic tool built on top of Polaris and Node Problem Detector. Kubeye detects misconfigurations, unhealthy components, and node failures in your Kubernetes cluster. Why? Because Kubernetes misconfigurations are costly.

Closing notes

When autoscaling be careful with the minimum number of pods needed to remain functional during disruptions, therefore Specifying a Disruption Budget for your Application.

If you encounter Pod eviction due to a node being out of resources, remember that node overcommitment is not a bad thing in itself, but it works on the assumption that not all the pods will claim all of their usable resources at the same time.

Last but not least it might be handy to limit individual namespaces from monopolizing cluster resource consumption, so don’t hesitate resource quotas to track usage and limit the number of objects as well as the total amount of resources that may be consumed by resources in a particular namespace.