Zombie resources: the silent budget drain most teams ignore

Detached EBS volumes. Load balancers pointing at nothing. Stopped instances from a feature branch nobody remembers. This is where cloud waste hides — and why it keeps growing.

Visualization of idle cloud resources accumulating cost

What a zombie resource is — and is not

A zombie resource is a cloud resource that is still running, still accruing hourly charges, and no longer serving any useful function. It is not a resource that has been intentionally kept for disaster recovery or standby capacity. It is not a resource in a test environment that gets used periodically. It is a resource that was created for a specific purpose — a load test, a feature branch, a one-off migration — and was never cleaned up when that purpose ended.

The distinction matters because zombie remediation strategies that are too aggressive decommission things that are still needed. The goal is not zero idle resources. The goal is zero resources that no engineer would consciously choose to keep running if asked.

The most common zombie resource types in AWS

Detached EBS volumes accumulate quietly. When an EC2 instance is terminated, the instance goes away, but the EBS volume remains if its Delete on Termination flag was not set. A volume that was attached to an instance running six months ago, holding a snapshot of application data that was migrated to S3, is still accruing $0.10 per GB-month. A 500 GB volume that has been sitting detached for eight months has cost approximately $400 with no benefit.

Unattached Elastic IP addresses cost $0.005 per hour when not associated with a running instance — approximately $3.65/month per EIP. That is small per address, but environments with dozens of unattached EIPs from decommissioned services accumulate hundreds of dollars per month in charges that appear nowhere in a service-level cost breakdown.

Application Load Balancers with zero healthy targets are charged at $0.008 per LCU-hour plus $0.0225 per hour regardless of traffic. An ALB configured for a service that was moved behind an API gateway, with the target group deregistered but the load balancer left running, can cost $200–$400 per month and is essentially invisible unless someone looks at the target group registration count.

Stopped EC2 instances continue to incur EBS storage charges for their root volumes and any attached data volumes. They also hold onto their associated security groups, Elastic IPs if attached, and reserved capacity in some configurations. A stopped m5.4xlarge that was part of a performance testing environment might cost $80–120/month in attached storage charges even without running a single compute hour.

Idle RDS instances with less than 1% CPU utilization over 14 days are a higher-cost zombie category. An idle db.r6g.xlarge Multi-AZ costs approximately $480/month. Teams with development or staging databases that were provisioned for a project that ended often find multiple instances in this state.

Why zombies accumulate despite cleanup intentions

Engineers know they are supposed to clean up after themselves. The problem is that cleanup is deferred until after the immediate task — the migration, the test, the experiment — is marked done. Once the Jira ticket is closed, the cost of the resource drops off the engineer's mental model. The resource continues to exist. Nobody notices it.

Terraform-managed resources are somewhat better because a terraform destroy removes the full resource graph. But not all resources are Terraform-managed. Resources created through the console, through automated testing pipelines, or through one-off CLI commands frequently have no IaC record. They are not in any state file. They have no owner tag. They appear in billing data as charges on a service line item, impossible to attribute without manual investigation.

Tag coverage is usually the compounding factor. If a resource was created without an owner or team tag — which is common for resources created outside of IaC pipelines — there is no mechanism to route a cleanup notification to the responsible engineer. The resource sits in the account, accumulating charges, until someone runs a manual audit. Manual audits happen quarterly at best, and they are time-consuming enough that they often focus on larger instances and miss the sub-$100/month line items.

What detection actually requires

Accurate zombie detection requires two data sources joined together: billing data that shows current resource charges, and a resource state API that shows attachment status, traffic metrics, and recent activity. A detached EBS volume that had snapshot activity yesterday is a different risk profile from one that has had no activity for 180 days. An ALB with zero healthy targets but an active DNS record and recent request logs is a different situation from an ALB with no DNS records and no traffic in 60 days.

This is why Cost Explorer alone is insufficient for zombie detection. Cost Explorer shows cost by service and resource ID. It does not show attachment state. It does not show traffic metrics. The billing data must be enriched with resource-state data from AWS Config, CloudWatch metrics, and EC2/RDS describe API calls to produce a detection signal that is specific enough to act on without false positives.

False positives in zombie detection have real consequences. A cleanup recommendation that fires on an RDS instance that is in active use for a monthly batch job — one that does not show activity in a 14-day window but runs on the 28th of every month — will cause engineers to stop trusting zombie alerts entirely. Getting the detection model right is the prerequisite for getting engineers to act on the output.

The detection-not-deletion model

Automated cleanup — deleting resources without engineer confirmation — is the wrong model for most teams. The reasons are practical: a false positive that deletes a volume holding production backups is not recoverable. An EIP deletion that breaks a service that was routing traffic through it causes an outage. The cost of a zombie resource is a slow accumulation of waste. The cost of a false-positive auto-deletion is immediate and potentially severe.

The right model is detection routed to the resource owner with enough context to make a cleanup decision in under two minutes. That means: resource type, resource ID, estimated monthly cost, attachment status, last-active timestamp, and the team attribution signal (tag, Terraform module, or CODEOWNERS). An engineer who receives that information in their team's Slack channel can approve or dismiss the cleanup in a single message, without needing to log into the AWS console and cross-reference billing data manually.

What a zombie resource is — and is not

The most common zombie resource types in AWS

Why zombies accumulate despite cleanup intentions

What detection actually requires

The detection-not-deletion model

Why cloud cost allocation breaks at the team level

A tagging strategy for AWS cost attribution that actually works