Skip to main content

Command Palette

Search for a command to run...

Why Engineering Teams Skip Resource Right-Sizing And the Policy-Driven Framework That Actually Fixes It

Updated
9 min read
Why Engineering Teams Skip Resource Right-Sizing
And the Policy-Driven Framework That Actually Fixes It

Here's something that comes up in almost every cloud cost conversation: non-production environments, dev, staging, and testing are eating a huge chunk of your cloud bill. In large enterprises, they often account for more than 40% of total cloud spend. Yet they're almost always the last thing anyone actually governs.

This post isn't about blaming teams or pointing out obvious waste. Engineers know the dev server ran all weekend. They know staging spun up on Thursday, and nobody shut it down. The problem is that the manual process to fix it is more painful than just ignoring it. That's what we need to change. We'll walk through why manual governance always breaks down, why the common fixes miss the point, and what a practical, policy-driven approach actually looks like with real numbers from a production implementation.

The Real Cost of Doing Nothing

Let's start with what happens when you don't govern non-production environments. The patterns are almost identical across every organization we've seen:

  1. Dev environments run nights and weekends, even though no one is using them

  2. Shutdown and startup depend on people remembering or a spreadsheet someone updates inconsistently

  3. As teams grow, the whole thing falls apart because it was never automated to begin with.

Here's a real example.

A large consumer goods enterprise think global FMCG had a big cloud estate. SAP HANA in production, lots of non-production infrastructure for dev, testing, upgrades, and continuous validation. Teams owned their own AWS accounts. Everyone knew what they were responsible for.

But runtime enforcement was manual. Someone had to remember to shut things down. And in practice? They didn't. Not consistently.

The math was blunt. 302 non-production resources across 4 AWS accounts. These environments were actively used, maybe 4–10 hours a day on weekdays. But they were running 24/7.

302 Non-prod resources governed 81 Automated schedules running daily 24% Reduction in non-prod cloud costs < 4 weeks to measurable results

After setting up automated scheduling across all 302 resources, 81 schedules ran every day, shutting things down outside business hours, bringing them back up when needed. The result was a 24% reduction in non-production cloud costs. No code changes. No re-architecture. Just enforcement.

Why Manual Governance Always Breaks Down

Most articles about cloud costs say the problem is that engineers don't realize how much they're wasting. That's wrong. Engineers know exactly what's happening. They just can't fix it manually without creating more problems than they solve. Here's what "fixing it manually" actually looks like at any company with more than a few teams:

  1. Someone sends a Slack message every Friday: "Hey everyone, please shut down your dev instances."

  2. Six out of twelve people actually do it.

  3. The other six were in back-to-back meetings, or had an integration test running that "should be done by Monday."

  4. Monday morning: three pages of Slack messages asking why staging is down

  5. The person who remembered to shut things down now has to explain why they shouldn't have.

This cycle doesn't get better with stronger reminders. It only breaks when you change the mechanism, moving enforcement from people to the system.

The root cause isn't laziness or ignorance. It's that the manual process has too much friction, too many edge cases, and no real enforcement. You fix that by automating the enforcement, not by nagging people harder.

Infographic

Why Traditional Right-Sizing Approaches Miss the Point

There are three standard tools people reach for when someone raises the cloud cost alarm. Each one falls short for non-production environments. Here's why

  1. Manual Shutdown Checklists Every team has tried this. It doesn't scale. People are in different time zones. Someone always has a test running. The checklist becomes "optional" within a month. The larger the team, the faster this falls apart.

  2. Instance Right-Sizing (Picking Smaller Machines) This is the recommendation you'll get from most FinOps platforms: profile your workload, find a smaller instance type, test it, and resize it. This is the right approach for production. It makes total sense in a context where you need to understand workload patterns, run performance benchmarks, and validate carefully.

For non-production? The math doesn't work. You spend two weeks profiling a dev environment that costs \(800/month. You save \)150/month by switching to a smaller instance. Meanwhile, that same environment runs 16 hours a day when it's only used for 4. You're optimizing at the wrong level. Runtime is the lever, not instance size.

  1. Budget Alerts Alerts tell you what's already happened. They don't stop it. If there's no automated response attached to the alert, the behavior doesn't change; you just get an email confirming that the thing that was already happening is still happening, but louder.

The Policy-Driven Framework: Three Layers

What we implemented with the FMCG enterprise is a three-layer structure. Each layer works independently, but together they create something sustainable, not a one-time cleanup that drifts back in 3 months.

Layer 1: Discover Resources and Define Schedules

Resources get grouped by environment, team, or project using tags. No manual catalogues. No spreadsheets. If a resource gets the right tag, it gets picked up automatically.

Each group gets a schedule that matches how it's actually used:

  1. Dev environments: On during working hours (e.g., 07:00–18:00 UTC on weekdays). Straightforward for environments with predictable daily usage.

  2. Staging and integration environments: Turn on when the CI/CD pipeline fires, shut down 30 minutes after it finishes. No one has to think about it.

  3. QA and UAT environments: Active only during structured testing windows, often just two weeks per month during sprint cycles.

Because it's tag-based, new resources provisioned by anyone on the team get automatically governed from day one. You don't have to retroactively add them.

Layer 2: Automated Enforcement (No Human Required)

Once schedules are set, the platform runs them every day without exception. No checklist. No Slack reminder. No reliance on anyone's memory. In the FMCG case, this was 81 schedules running across 302 resources, completely hands-off after the initial setup.

This is the core of why the model works. Enforcement becomes structural, it's part of how the infrastructure operates, not a process that depends on people.

Layer 3: Safe Overrides (So It Doesn't Break Developer Flow)

This is the part most governance systems miss, and it's why engineers resist automation in the first place. If a schedule shuts down an environment in the middle of an important release, people will rightly reject the entire system.

Overrides exist so teams can extend runtime when they actually need to. But they're designed with guardrails:

  1. Time-bounded: You can extend runtime, but not indefinitely. Every override has a maximum duration.

  2. Auto-expiry: When the override window ends, the schedule picks back up automatically. No one has to "remember to re-enable governance."

  3. Audit logged: Every override is recorded who requested it, why, and for how long. This creates accountability without bureaucracy.

  4. Self-service: For standard dev environments, overrides don't need manager approval. If every extension requires a ticket or sign-off, engineers will just work around the system entirely.

Infographic

How It All Works End to End

Here's how the FMCG implementation actually played out:

  1. Four AWS accounts were connected to the governance platform

  2. Tag-based discovery found all 302 non-production resources automatically, with no manual input

  3. Engineering team leads spent about 30 minutes in a workshop defining their usage patterns

  4. Schedules were first deployed to a small subset for a one-week validation period

  5. After that, they rolled out to all 302 resources

  6. 81 automated schedules were running daily by the end of week three

The platform used native AWS Role-Based Access Control (RBAC) to map onto the existing team ownership structure. Teams only governed their own resources. There was no extra admin burden on the central team, and no learning curve for engineers.

What Changed: Before vs. After The 24% reduction in non-production costs showed up in the first billing cycle. The mechanism is simple: dev environments that previously ran 168 hours a week are now running roughly 55 hours a week. That's a 67% reduction in runtime for the largest environment category with zero changes to any application.

These numbers aren't impressive because the approach is complicated. They're achievable because the approach is simple. It works because it addresses how non-production infrastructure is actually used, not how we wish it were used.

Infographic

Making This Stick: Turning a Fix into an Operating Model

Most cloud cost optimization projects follow a predictable arc. A FinOps team finds waste, shuts things down, reports savings, and declares victory. Three months later, the costs creep back because nothing actually changed about how teams provision and run infrastructure.

Policy-driven scheduling is different because the savings are structural. Once a schedule is set, it runs every day without anyone deciding to run it. But to keep that working as your organization grows, you need to integrate governance at four points in your standard process:

  1. Account onboarding: Every new AWS account should be enrolled in scheduling governance during standard provisioning, not added later as a cleanup task. Day one, governance is in place.

  2. Resource tagging standards: Consistent environment and team tags let resources be grouped automatically. Enforcing tags at creation (via AWS Service Control Policies) eliminates the need to manually onboard resources.

  3. Ownership mapping via RBAC: Access controls should mirror how your teams are actually organized. Teams govern their own resources within the framework — less admin burden on the central team, clearer ownership for everyone.

  4. Quarterly schedule reviews: Utilization data should be reviewed every quarter. Teams change their hours, sprint cadences shift, and projects end. Reviewing schedules proactively catches misalignment before it turns into waste.

The Takeaway

Engineering teams aren't skipping resource right-sizing because they don't understand the opportunity. They're skipping it because the traditional approaches introduce friction, risk, and overhead that don't feel worth it.

The answer isn't trying harder. It's changing the mechanism.

When you replace the manual process with policy-driven, autonomous scheduling, right-sizing stops being a task someone has to do and starts being something the system enforces. Resources run when needed, shut down when not. Savings repeat every month without anyone revisiting the decision.

The mechanism is proven. The path is straightforward: connect accounts, define schedules, enforce automatically, measure, and adjust. The only real question is when you start.

13 views

More from this blog

O

One control plane. Every cloud. Continuously governed.

55 posts