💸 When Tests Cost More Than Bugs: A Costly Lesson from AWS
Not all bugs show up in your test cases, some hide in your cloud bill.
What started as a normal day took a sharp turn when we noticed a huge spike in our AWS billing dashboard. It wasn’t your usual fluctuation. I opened up the AWS Cost Explorer, downloaded the CSV data, and rolled up my sleeves for some deep-dive investigation.
🕵️♂️ Tracing the Cost Monster
Hours went by as I combed through event history logs, trying to pinpoint what exactly triggered the surge. Finally, one event caught my eye.
But before we go there; let me give you some context.
⚙️ How Our Test Infrastructure Works
When we trigger a test, our system spins up temporary resources. Once the tests are done, a cleanup process kicks in to destroy the instances. We use auto-scaling groups (ASGs) with powerful c7i.xlarge
instances to process complex data. These are not cheap machines, and our ASG is configured to scale up to 9 instances max during high-load testing.
🔥 The Oversight That Burned a Hole
On May 12, an ASG was spun up for a test run. But here’s the kicker:
👉 The clean up process failed to work :|
It just sat there running. EVERY. SINGLE. DAY.
The daily cost graph? Flatlined from May 12 to June 20—steady burn. I checked the event history for June 20, and finally saw that a teammate had manually deleted the rogue ASG. The cost graph immediately dropped after that.
In total, those idle c7i.xlarge
instances ran for 1,944 hours. With zero free tier, we paid the full price 💥
🧠 What Could We Have Done Differently?
This incident wasn’t just a costly mistake. It was a wake-up call. Here’s what we’re taking away from it:
✅ 1. Implement Cleanup Validation
Automation is great—when it works. But we should validate that cleanup actually happened after test execution.
🔹 Suggestion: Add a step to your CI/CD pipeline or monitoring tool that verifies no resources are left behind post-test.
✅ 2. Monitor ASG Lifecycle Events
AWS provides detailed event logs. We just need to actually use them proactively.
🔹 Suggestion: Set up CloudWatch Alarms or EventBridge rules to alert if an ASG exists longer than expected, say 6 hours.
✅ 3. Enforce TTL (Time-To-Live) on Test Resources
If your infra is temporary, treat it like milk:add an expiry date.
🔹 Suggestion: Use AWS Lambda with scheduled checks to terminate aged ASGs or EC2 instances tagged as “test”.
✅ 4. Use Cost Anomaly Detection (Seriously)
AWS has this feature for a reason. It doesn’t cost extra and could’ve saved us a painful invoice.
🔹 Suggestion: Enable Cost Anomaly Detection for services like EC2, EBS, and S3 used in test environments.
✅ 5. Tag Everything
If it doesn’t have a tag, it doesn’t exist, at least in cost tracking.
🔹 Suggestion: Use mandatory tagging (e.g., Environment: Test
, Owner: QA
) and periodically audit resources by tag.
✅ 6. Review Billing Weekly
Don’t wait till the end of the month.
🔹 Suggestion: Set a recurring calendar reminder for weekly cost reviews using AWS Cost Explorer or Budgets.
This incident was a classic case of “It’s just test infra, what could go wrong?”
Turns out, a lot. Running c7i.xlarge
instances for almost 2,000 hours without any actual workload is an expensive reminder that testing doesn’t stop at test cases,it includes infrastructure hygiene too.
To every QA and DevOps engineer reading this:
💡 Add cost-awareness to your testing mindset.
It’s not just about what the code does. It’s also about what the cloud keeps running behind the scenes.
