February 29, 2024
Decision Nodes

The Secret Weapon of Modern Apps: Unveiling Kubernetes Autoscaling

Brad Morgan
Senior DevOps Engineer

Brad is an Engineer who specializes in designing and architecting Kubernetes clusters for high availability, scalability, and frictionless delivery. He also shares insights on Software Development and DevOps through his YouTube channel.


Imagine being the administrator of a Kubernetes cluster hosting a new online game that's taken the world by storm. Servers are buckling under the weight of millions of eager players, and you need to scale up fast. Can you handle the surge without crashes or lag? This article is your guide to Kubernetes autoscaling, exploring how it dynamically adapts your cluster's resources to meet the demands of your users. In this article, we will discuss the methods of scaling in Kubernetes and how these methods can be implemented for seamless scalability, cost efficiency, and rock-solid uptime.

Scaling Methods in Kubernetes

If you are unfamiliar with scaling in Kubernetes, there are three primary components to understand:

  • Horizontal Pod Autoscaling
  • Vertical Pod Autoscaling
  • Cluster Autoscaler

Each of these can be used independently, but they work best when they are combined. I will go through each of these and how they work, but if you are already familiar with these autoscaling objects you can skip to the implementation section where I discuss how they can be implemented together.

Horizontal Pod Autoscaling

In Kubernetes, applications run in an object known as a "Pod". To scale applications, more "replicas" of the Pod are created.  This can be done manually by an Administrator or dynamically through an object known as a HPA. HPA stands for Horizontal Pod Autoscaler which is an object in Kubernetes that monitors your application's metrics, such as CPU usage, and dynamically scales the amount of Pod replicas based on those metrics. This method of scaling is referred to as "scaling out" the application.

Open in Eraser

Vertical Pod Autoscaling

VPA stands for Vertical Pod Autoscaler, and it's another tool in your Kubernetes autoscaling arsenal, but unlike HPA, it focuses on adjusting the CPU and Memory requests and limits of individual pods instead of scaling their replicas. This gives you finer-grained control over the resource allocation of your application. VPA's are Kubernetes's strategy for "scaling up" applications.

Open in Eraser

Cluster Autoscaling

Kubernetes clusters are constrained by the CPU and memory capacity of their worker nodes. As applications within a cluster scale, they will eventually surpass the cluster's available capacity. The solution to this problem is Cluster Autoscaling. Cluster Autoscaling focuses on the infrastructure layer instead of individual pods. A Cluster Autoscaler will dynamically add or remove new worker nodes to the cluster depending on the current utilization.  The Cluster Autoscaler lives inside the Kubernetes cluster as a Controller that communicates to your Cloud Provider via API calls to provision new servers.

Open in Eraser

So now that you know how Kubernetes can scale an application as well as the cluster itself. In the next section, I will discuss how scaling can be implemented.

Implementing Scaling

What is the key success factor for autoscaling? Implementation! It's not just about flipping a switch. Gaming companies, in particular, navigate unique challenges due to demanding workloads and stringent SLAs. Let's dive into the decision-making we had to make for a cluster that hosted our online game.

Choosing the Right Scaling Strategy

We first outlined the guiding principles that would help us decide on the scaling solution that would best fit our needs:

  • The games services and cluster need to scale autonomously.
  • The solution had to be easy to maintain and understand.
  • Availability of the games services trumps everything.

We then assessed the different scaling methods and came to a choice on how we would implement them.  Our first decision would be how we would implement Pod Scaling, and then how we would implement Cluster Scaling.

Decision #1 - Pod Scaling Implementation

In Kubernetes, each of our "game servers" runs as an application Pod. If you recall from earlier, two objects in Kubernetes allow you to scale your application Pods; Horizontal Pod Autoscalers and Vertical Pod Autoscalers. This brought us to three choices on how to implement Pod Autoscaling:

Implement  Horizontal Pod Autoscaling

With this approach, we could have our Pod replicas dynamically increase based on key metrics such as users in the game or CPU usage of the Pod.  However, the amount of resources (memory/CPU) used by the individual Pods would need to be defined in our manifests. This approach would allow us to scale the amount of game servers we had available, but they would be less elastic on the amount of resources each server uses, which could lead to either over-provisioning or under-provisioning the cluster.

Implement Vertical Pod Autoscaling

With this approach, we would manually set how many "game servers" (Pod replicas) would be available in the cluster but their resources would be sized automatically by the VPA. This would allow our game servers to increase their resource quotas based on the number of players hosted by the server, but the number of game servers we had would remain static. The downside to this approach is that there is only so much memory and CPU we can provide our game server Pods before they lose resource efficiency, so our ceiling to how much we could "scale up" would be limited.

Hybrid approach: Combine Horizontal Pod Autoscaling with Vertical Pod Autoscaling

We could also combine both the approaches of HPAs and VPAs for a hybrid solution and have the best of both worlds - the HPA would scale the amount of game servers we had running at any time and the VPA would dynamically set how much resources those servers used. This approach got our Engineers very excited. There was just one problem, complexity and turnaround time. Implementing VPAs requires a lot of configuration as well as testing.  Implementing them alongside HPAs requires even more testing and fine-tuning.

Open in Eraser

We outlined the positives and negatives and found that implementing Horizontal Pod Autoscalers would be the most pragmatic solution.  Although there are many benefits of adding VPA's the timing for us wasn't right - we needed scalability as soon as possible and if it had to come at the cost of running less optimized workloads we were okay with it. We tabled the use of VPA's for something we could implement later down the road.

Decision #2 - Cluster Scaling Implementation

Now that we had Pod Autoscaling - we next needed to decide on how we were going to implement Cluster Autoscaling. This decision is highly dependent on the cloud provider you are using. As we are on AWS, we were faced with two predominant solutions; AWS Autoscaling groups with Cluster Autoscaler or Karpenter an open-source solution for scaling Kubernetes. Yes, that first solution is just called "Cluster Autoscaler" - finally a project named after what it actually does!

From a high level, both "Cluster Autoscaler" and Karpenter work similarly.  A Controller object is installed into the Kubernetes cluster which monitors the utilization of the cluster through the Kubernetes Metrics server.  When the controller determines that the cluster is being over-utilized (or under-utilized) it talks to the AWS API and provisions new servers.  The way that it provisions new servers is the key difference.

With "Cluster Autoscaler" it controls the number of servers via the "desired capacity" of an AWS Auto Scaling Group, increasing the desired capacity to add additional servers and decreasing the desired capacity to remove underutilized servers from the cluster.

Open in Eraser

Karpenter works a little differently. Instead of using an Auto Scaling Group, Karpenter is directly responsible for launching and destroying new instances for the cluster.

Open in Eraser

Although on the surface, it seems like both solutions are accomplishing the same task, Karpenter's method has several distinct advantages; mainly with the scaling logic itself.  Since Karpenter is given more control, it can decide on which individual servers should be added and/or removed. It is also able to make decisions on the sizing of those instances during the time of scaling, allowing Karpenter to choose the most cost-effective servers.  "Cluster Autoscaler" with Autoscaling groups on the other hand will be limited to cycling nodes based on how the Auto Scaling Group is setup in AWS , which usually defaults to the creation date of the server.

We found that the benefits that something like Karpenter provides were exactly what we needed, and our autoscaling strategy started to look like this:

Open in Eraser

We started with a basic configuration of the HPA's targeting pod CPU usage of 50% and Karpenter to target Node utilization at 80%.

Scaling up - The Easy Part

Our simple implementation proved effective for scaling up. As more users connected to our services, the CPU utilization of our Pods increased. When the average pod CPU utilization reached 50%, a new pod was automatically created. Consequently, as the number of Pods in the cluster increased, Karpenter saw server utilization increase and it dynamically provisioned new servers. Our simple solution successfully addressed our scaling needs, or so we thought.

Scaling down - Where we ran into Trouble

Scaling up to meet the demand during peak times was working, but we needed to scale down our infrastructure during non-peak hours to reduce the significant costs we were facing from running such a highly provisioned EKS cluster.

We thought our initial configurations would be sufficient, but we ran into many issues that most organizations will face when scaling down a Kubernetes cluster.

Issue #1 - Unconsolidatable Nodes

Our first issue was with Karpenters Consolidation Policy. Originally we had it set to WhenEmpty which means Karpneter would only remove nodes that had no pods running on them. We thought this would be the least disruptive option, however, the way our workloads get distributed throughout the cluster there would never be a node that didn't have at least one or two pods running on it. Our solution to this problem was to change our consolidation policy from WhenEmpty  -> WhenUnderutilized

Open in Eraser

WhenUnderutilized proved to be the better configuration, however, it led to another problem that we had to solve...

Issue #2 - Service Disruption

When nodes in a Kubernetes Cluster get removed, all the Pods running on that server get evicted using the Kubernetes Eviction API. This is usually fine as there are other nodes available on the cluster for those Pods to move to, however, if too many pods of a single service get evicted at the same time it can cause end users of the service to experience issues.  To solve this issue we implemented Pod Disruption Budgets, which allow you to configure what percentage of Pods for a single service can be evicted at the same time. In our case, we set it so only 10% of our game-servers pods could be evicted at one time.

Open in Eraser

It's worth mentioning that each application should have logic built in for graceful termination. In our case, our game services gracefully transition players to a new "server" (In the case of Kubernetes, a new Pod) when it determines the current one is being shut down.  This logic takes two minutes to complete; so in our case, we had Pods configured with a terminationGracePeriodSeconds of 120 second. To learn more, check out Kubernetes documentation on Container Lifecycle Hooks.

Node Consolidation Budgets

The final item we implemented is a new feature of Karpenter which is Node Consolidation Budgets. The goal of Node Consolidation Budgets is to make it so not too many nodes are removed at once during a scale-down period. Here's a look at our complete configuration:

Open in Eraser

In the above configuration, you can see that we remove nodes when they are underutilized, but we will never remove more than 20% OR 5 nodes (whichever is smaller) at once.  We also have a schedule set so no nodes can be brought offline during our peak hours which are 3 PM - 10 PM daily.


Modern Applications require elasticity to scale to users' demands. Kubernetes has several features that allow you to achieve this; Horizontal Pod Autoscalers, Vertical Pod Autoscalers, and Cluster Autoscalers.

Implementation of your Autoscaling solution should be pragmatic and easy for teams to implement and maintain. A popular choice is Horizontal Pod Autoscalers that scale Pods based on CPU consumption and a Cluster Autoscaler to dynamically add new nodes to the cluster.

It's much easier to add more Pods and more Nodes to a Kubernetes cluster (scaling up/out) than it is to remove Pods and nodes (scaling down/in). Removing nodes from a cluster is a disruptive activity, but when configured appropriately, it should go unnoticed by the end-users.

Some special considerations that can be taken into account for scaling down applications and clusters are Pod Disruption Budgets, Node Disruption Budgets, and Container Lifecycle Hooks .