Over-provisioned requests, idle nodes, and misconfigured autoscaling add 40–60% to cluster spend. VPA recommendations are typically 30–50% lower than hand-tuned requests. Spot instances cut compute cost 60–80% for fault-tolerant workloads. Phased approach: measure → right-size → scale dynamically → clean up → commit. Run optimized 2–3 months before purchasing reserved capacity.

Kubernetes doesn't enforce resource requests; when added, engineers tend to over-buffer. Cluster Autoscaler and node pools require explicit tuning; many clusters run fixed node counts by default.

How we got here

We run production workloads on Kubernetes. While reviewing our cloud spend, we noticed that our compute costs were significantly higher than we expected given actual utilization. Digging in, we found a few patterns: resource requests were often 2–3x actual usage, node pools were sized for peak load and never scaled down, and batch jobs were running on on-demand instances during business hours when spot availability was low.

The discrepancy came down to defaults. Kubernetes doesn't enforce resource requests—pods can run without them—but when you add them, the natural tendency is to add buffer. "Just in case" multiplied across hundreds of pods adds up. Similarly, Cluster Autoscaler and node pool configuration require explicit tuning; out of the box, many clusters run with fixed node counts.

This raised a larger question: which optimizations actually move the needle, and which are marginal?

We ran a before/after comparison across several clusters, applying optimizations incrementally and measuring spend. The biggest wins came from right-sizing requests and scaling node count dynamically. Smaller but meaningful gains came from shifting batch workloads off-peak and cleaning up idle resources. Committed use discounts helped at the end, but only after we'd reduced the baseline.

Right-sizing resource requests

Over-provisioned requests are the largest waste source. VPA recommendations are typically 30–50% lower than hand-tuned requests for stable workloads. Apply gradually; monitor for OOMKills.

We use Vertical Pod Autoscaler (VPA) in recommendation mode first—it analyzes actual usage and suggests request values. Running it in "Off" mode lets you review before applying:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"  # Recommendation only

VPA uses historical data; for new or spiky workloads, recommendations can be too aggressive. Apply gradually; monitor for OOMKills and CPU throttling.

Cluster Autoscaler and spot instances

Spot instances cut compute cost 60–80% versus on-demand for fault-tolerant workloads. Use for batch jobs, CI, non-critical services; keep on-demand for stateful or latency-sensitive.

Cluster Autoscaler adjusts node count by pending pod demand. Fixed node pools mean paying for idle capacity. Spot/Preemptible can be preempted with little notice.

One failure mode we've seen: Cluster Autoscaler can be slow to scale down. By default, it waits to ensure nodes are truly idle before removing them. If your workloads have periodic spikes, you may need to tune scale-down delay and utilization thresholds. We've also seen issues when pod disruption budgets are too strict—PDBs can prevent scale-down by blocking evictions.

Scheduling off-peak and cleaning up idle workloads

Batch jobs, data pipelines, and CI builds don't need to run during peak hours. Shifting them to off-peak periods improves spot availability (fewer competing workloads) and often reduces costs. We use CronJobs and pod priority classes to schedule non-critical work during nights and weekends.

We also audit clusters quarterly. Staging environments left running overnight, forgotten test deployments, and abandoned preview apps accumulate costs silently. Implementing namespace expiry policies for ephemeral environments—e.g., auto-delete namespaces older than 7 days for preview deployments—reduces drift.

Committed use discounts

1-year commitments for baseline + spot for burst: 50–70% savings versus pure on-demand. Commit only after 2–3 months of stable baseline—right-sizing shifts the baseline.

Reserved Instances, Savings Plans, or GCP Committed Use lock in savings for baseline capacity. Don't commit before right-sizing; inflated baseline locks in waste.

The Meterra approach

We've landed on a phased approach: measure first, then right-size, then scale dynamically, then clean up, then commit. Skipping straight to committed use discounts before right-sizing locks in waste. Similarly, turning on Cluster Autoscaler without cleaning up over-provisioned requests can lead to more nodes than you need—autoscaling scales to satisfy requests, and inflated requests mean inflated node count.

The key is treating each step as a prerequisite for the next. Get visibility into actual usage, then adjust requests, then enable scale-to-zero or spot where possible, then remove idle workloads, then commit what's left.

Given how Kubernetes cost leakage actually behaves, we recommend:

1. Run VPA in recommendation mode first.
Review suggestions before applying. For stable workloads, apply gradually. For new or spiky workloads, use recommendations as a starting point and validate.

2. Enable Cluster Autoscaler with appropriate scale-down settings.
Tune utilization thresholds and scale-down delay for your workload pattern. Avoid over-restrictive pod disruption budgets that block scale-down.

3. Use spot for fault-tolerant workloads.
Batch jobs, CI, and non-critical services are good candidates. Keep on-demand for stateful or latency-sensitive workloads.

4. Schedule batch work off-peak.
CronJobs and priority classes can shift load to times when spot availability is higher and costs are lower.

5. Audit and automate cleanup.
Quarterly audits for idle namespaces and workloads. Namespace expiry for ephemeral environments. Document what "idle" means for your org.

6. Commit only after stabilizing.
Run optimized for 2–3 months before purchasing reserved capacity. Your baseline should be stable.

At the margins, the difference between "we've optimized our Kubernetes spend" and "we're still overpaying" often comes down to whether you've addressed the biggest levers first. Right-sizing and dynamic scaling typically deliver the largest gains; committed use discounts amplify those gains but don't replace them.

About Meterra

Meterra is an AI & software development company specializing in custom AI agents, LLM integration, custom software, and cloud-native infrastructure. We build production-ready systems for startups, SMBs, and enterprises—from RAG pipelines and agentic workflows to Kubernetes and multi-cloud operations.

Learn more Contact us

Continue reading

Mar 13, 2026AICloud

The Atomic Unit of Intelligence: How Tokens Define Economy

Mar 9, 2026AIAgentEnterprise

How we got here

Right-sizing resource requests

Cluster Autoscaler and spot instances

Scheduling off-peak and cleaning up idle workloads

Committed use discounts

The Meterra approach

About Meterra

Continue reading

The Atomic Unit of Intelligence: How Tokens Define Economy

The Software Stack of an AI-Native Company

Engineering Partners for Products That Ship

Headquarters

Contact

How we got here

Right-sizing resource requests

Cluster Autoscaler and spot instances

Scheduling off-peak and cleaning up idle workloads

Committed use discounts

The Meterra approach

What we recommend

About Meterra

Continue reading

The Atomic Unit of Intelligence: How Tokens Define Economy

The Software Stack of an AI-Native Company

Engineering Partners for Products That Ship