Cutting down Kubernetes Costs: Cast.ai vs. Karpenter
Despite all its advantages and growing popularity, Kubernetes has several drawbacks that cannot be overlooked.
Chief among these is the cost and complexity related to managing dynamically provisioned infrastructure with extremely dynamic workloads.
While many FinOps practices focus on monitoring resource use and optimizing cost by manually selecting “what works for us at an acceptable cost”, this approach fails miserably when confronting the dynamic behavior of an active Kubernetes cluster. And of course, it gets even worse when you manage many of these…
In the last year, we have seen the entry of a new breed of cost optimizers: Dynamic, fully automated modules that take control of the cluster scaling functions and optimize continuously in real-time.
First among these was the Open source product Karpenter, backed by AWS (and geared towards EKS).
Our team has implemented Karpenter for several customers and has been happy with both the approach and the results.
But recently we found a competing commercial tool, and while we LOVE open source — we could not ignore its inherent advantages.
So in this post, I want to list the reasons we shifted from Open Source Karpenter to the commercial tool Cast.ai and are using it for all our Kubernetes projects.
Karpenter, The Hidden Costs
Free does not mean there are no costs…
Work: Karpenter is not a “fire and forget” type of product. It requires non-trivial work and deep understanding to configure correctly per cluster and will require additional tweaking as your product and workloads evolve.
Visibility: Once you have it running, you will see savings — but how? And are these the best that you can manage? Karpenter does not give you any visibility and you need to put effort into analysis to show your achievements and measure them going forward.
Instance matching: Karpenter will optimize the instance type within family boundaries, but due to the dynamics of the spot instance market will fail to see substantial savings possible when moving between families
Degradation: Karpenter optimizes hardware use by bin packing workloads — moving them between nodes to utilize hardware continuously. This is great but is limited over cluster lifecycle as you cannot aggressively pack bins that constantly change. In the long term you are limited in the level of optimization you can achieve.
Cast.ai Advantages
The bottom line for cast.ai is 20–25% better savings, a fraction of the work, and consistent optimization over the cluster lifecycle. How is that achieved?
Free matching: cast.ai spins new nodes from the full range of available offerings. It is not limited to node groups, or pre-selected families (unless you set such limits) — and will always select the best cost-saving fit. For example, we regularly see it spinning INF1 spot instances in AWS even though they are augmented C5 with hardware beyond our needs. The reason — they are currently in lower demand and cost a fraction.
Faster scale-down: One constant frustration of Kubernetes administrators is the huge waste of hardware as a cluster slowly spins nodes down after a peak. We regularly see clusters 5 times their required size due to a peak that happened several hours ago. Not so with cast.ai — it spins off nodes extremely fast based on the patterns identified over time.
Re-balancing: this nice little feature allows us to replace all cluster nodes periodically without downtime, and allows fully optimized bin packing. Doing this once a week (on top of the ongoing continuous optimization) keeps a tight cluster.
Visibility: Cast.ai comes with an excellent cost analysis dashboard out of the box, allowing us to monitor, slice and dice all expenses by workloads, tags, and namespaces. This part of cast.ai is free — so we install it regardless, and give the cluster owner a full understanding of costs effortlessly.
Right-sizing: A nice feature recently released by cast analyzed workload use of CPU and memory and gives you the insights needed to calibrate your resource limits and get even more out of the product without risking pod stability.
Image Security: Cast.ai also includes an image scanner and reports on all CVEs found in your cluster, along with remediation suggestions. This is not a replacement for a full-fledged security suite (e.g. Aqua Security’s excellent product), but in places where such a tool is not in place yet this is a very important stopgap.
Last, but not least in any way — cast.ai works on the 3 public clouds and even on OpenShift.
Going Forward
Talking with the Cast.ai team, we have learned that their roadmap also includes network cost optimization (i.e. co-locate those chatty micro-services that overload our egress), and storage optimization. Both of these are sorely needed, and when they do arrive will be additional reasons to stick with cast.ai.
So — we love open source, and we like Karpenter, BUT for us, it is currently a no-brainer. Cast.ai is superior in every way.