Orphaned and Overprovisioned Resources in EKS Clusters

Go back

Yisrael Gross

Service Category

Compute

Cloud Provider

AWS

Service Name

AWS EKS

Inefficiency Type

Inefficient Configuration

Explanation

In EKS environments, cluster sprawl can occur when workloads are removed but underlying resources remain. Common issues include persistent volumes no longer mounted by pods, services still backed by ELBs despite being unused, and overprovisioned nodes for workloads that no longer exist. Node overprovisioning can result from high CPU/memory requests or limits, DaemonSets running on every node, restrictive Pod Disruption Budgets, anti-affinity rules, uneven AZ distribution, or slow scale-down timers. Preventative measures include improving bin packing efficiency, enabling Karpenter consolidation, and right-sizing node instance types and counts. Dev/test namespaces and short-lived environments often accumulate without clear teardown processes, leading to ongoing idle costs.

These remnants contribute to excess infrastructure cost and control plane noise. Since AWS bills independently for each resource (e.g., EBS, ELB, EC2), inefficiencies can add up quickly. Without structured governance or cleanup tooling, clusters gradually fill with orphaned objects and unused capacity.

Relevant Billing Model

EKS clusters incur compute charges through EC2 nodes or Fargate profiles, storage charges from attached EBS volumes, and networking charges from load balancers created via Kubernetes Services. Additional network-level costs can include NAT Gateway hourly and per-GB data processing fees, cross-Availability Zone (AZ) data transfer, and other inter-VPC or internet-bound traffic. Orphaned resources such as unused PVCs (backed by EBS), idle Services (backed by Elastic Load Balancers), and underutilized nodes can generate persistent charges even when no active workloads are running.

Detection

Identify namespaces that no longer contain active Deployments or Pods
Review Services still provisioned with ELBs but lacking backend endpoints
Check for PVCs that are not mounted by any current workloads
Analyze node utilization to detect consistently underused EC2 nodes, referencing median CPU and memory utilization per node group (targeting 60–70%+ in production)
Audit test or sandbox namespaces that have not been updated recently
Verify whether terminated Helm releases or jobs left residual resources

Remediation

Remove unused PVCs to deprovision backing EBS volumes
Delete idle Services to release associated ELBs and IP addresses
Clean up inactive namespaces and workloads
Resize or scale down overprovisioned EC2 nodes, incorporating proactive prevention strategies such as Karpenter consolidation, bin packing optimization, and right-sizing of node instance types and counts
Implement environment lifecycle policies for dev/test clusters

Relevant Documentation

Submit Feedback