Case Studies – Cloudico

01

Cloud Infrastructure · Migration sprint

Series B FinTech · ~80 engineers · payment infrastructure for ~$1.2B annual GMV · based in North America

Duration9 weeks

FormatFixed-price sprint

Year2025

From seven AWS accounts nobody owned to an Organizations structure their team can actually run.

Single-account sprawl had become a security review blocker. Three SOC 2 auditors had flagged it in successive years. The fix wasn’t technical — it was political and operational, and that’s where most of the engagement actually went.

Accounts consolidated

7 → 4

Under a clean Organizations structure with SCPs

Drift incidents (prior 12 mo)

38

Manual changes that broke staging or prod

Drift incidents (post)

2

In the 6 months following handover

SOC 2 audit findings cleared

11

Including all three multi-year carryovers

Architecture migration · before, during, after

9-week sprint

Prod

prod-account (mixed)

audit + freeze

move & isolate

prod-payments (clean)

Staging

eng-staging-1 + eng-staging-2

merge inventory

re-IaC

staging (Terraform-only)

Data

analytics + bi-sandbox

classify PII

SCP guardrails

data-prod (least-priv)

Sandbox

eng-sandbox + 1 forgotten

archive forgotten

redirect dev

sandbox (dev only)

What was actually broken.

Seven AWS accounts had accumulated over four years of incidental decisions. No clear owner per account. Cross-account access was a tangle of manual IAM roles, half of them with *:* trust policies. The team couldn’t answer the auditor’s simplest question: “which account does production payments traffic terminate in?” Because the honest answer was: two of them, and they didn’t agree.

This isn’t a unique story. It’s how every Series B cloud account looks when growth happened faster than the operational discipline. The problem wasn’t the technology — the AWS primitives were fine. The problem was that nobody had ownership of the structure, and every engineer who’d touched it had moved on or moved teams.

What we shipped, in order.

WK 1-2

Read-only audit and inventory

Catalogued every resource across all seven accounts. Identified 312 resources without tags, 41 IAM roles unused for 90+ days, 18 security groups with 0.0.0.0/0 inbound. Built the resource ownership map by interviewing 11 engineers individually — the political work, not the technical work.

WK 3-4

Target architecture proposal · signed off internally

Four-account model: prod-payments, staging, data-prod, sandbox. Each with named owner, SCPs limiting blast radius, IAM Identity Center for human access. Walked the architecture through three review meetings with their security, infrastructure, and engineering leads before any changes shipped.

WK 5-7

Migration · non-prod first, prod last

Started with sandbox and staging consolidation. Built the Transit Gateway with their network team in the room. Migrated data-prod with a 48-hour cutover window and full rollback plan. Prod-payments moved over a weekend with their incident team on call — zero customer-facing impact.

WK 8-9

Handover and SOC 2 prep

Wrote 14 runbooks covering account creation, IAM lifecycle, SCP exception process, and the drift detection workflow. Sat in on the SOC 2 readiness meeting with their auditors and walked through the new structure. All 11 prior findings closed.

What didn’t work

Our original plan called for closing two of the seven accounts. We had to keep one of them open because a third-party vendor integration had hard-coded its account ID into webhook payloads, and we couldn’t coordinate the change with the vendor in time. We left it in the structure as “legacy-webhook-only” with tight SCPs preventing anything else from running there. It’s probably still there.

What changed operationally.

The numbers in the stat row are real, but the deeper change is harder to chart: their team can now answer the auditor’s question. Every account has an owner. Every IAM role has a justification document. Every SCP exception goes through a written process that lives in their wiki, not in someone’s head.

Six months after handover, the SRE lead emailed: “We onboarded two new engineers last week and they were able to ship a change to staging on day three. That used to take a month.” That’s the engagement’s real result — not the migration itself, but the operational ground that the migration cleared.

The technical migration was the easy half. The political work — getting eleven engineers to agree on who owned what, before we changed anything — was where the engagement actually lived. We spent more time in conference rooms than in Terraform that month.

Hassan Ali

Founder · Lead engineer on this engagement

02

Cost Optimization · Audit + implementation

Series C B2B SaaS · ~140 engineers · collaboration platform · $94k/month AWS spend at engagement start · EU-based

Duration3 wk audit + 8 wk impl

Format$8k audit → $22k sprint

Year2024

$94k a month, down to $31k — without touching a single product feature.

Their CFO had circled the AWS line item three months running. The infra team knew they were over-provisioned but couldn’t justify which workloads to cut without a methodical audit. We ran the audit, then implemented the top 11 of 18 recommendations.

Monthly spend (before)

$94k

Baseline averaged over 90 days pre-engagement

Monthly spend (after)

$31k

After 8 weeks of implementation work

Reduction

−67%

Sustained at 6 months post-handover

Annual run-rate saved

$756k

After our $30k total engagement cost

Monthly spend · the cuts, in order

$94k → $31k

Starting baseline

$94,000

Right-sized EC2 fleet

$77,200

Savings Plans + RI strategy

$60,150

S3 lifecycle + Glacier moves

$43,800

+ idle resource cleanup & egress fix

$31,000

What the $8k audit found.

Three categories of waste, in order of dollar impact:

Over-provisioned compute ($16.8k/mo recoverable). They were running m5.4xlarge instances for workloads that fit easily on m5.large. Standard story: a senior engineer had specced the original cluster two years prior under load assumptions that never materialized. Nobody re-checked. Right-sizing on its own returned 18% of the monthly bill.

No Savings Plans coverage ($17k/mo recoverable). They were paying full on-demand for ~80% of compute that was running 24/7. A 3-year compute Savings Plan at the right commitment level would have covered it — but nobody had modeled what “the right commitment level” was, so the finance team had vetoed buying any. We built the model.

S3 sat at standard tier ($14.5k/mo recoverable). Log buckets containing 80TB of pre-2023 data that hadn’t been accessed in 18 months. Standard tier. Paying $0.023 per GB month for cold logs. Lifecycle rules moved most of it to Glacier Instant Retrieval at a fraction of the cost; the buckets nobody had touched in 12 months went to Glacier Deep Archive.

What we did not recommend.

The audit identified 18 cost-saving opportunities. We implemented the top 11. The other 7 we explicitly recommended against, or deferred. Examples of what we didn’t cut:

SKIP

Spot instances for batch jobs

Theoretical savings: ~$3k/mo. Real cost: their batch jobs were on the critical path for two customer-facing features. The team didn’t have the operational maturity to handle spot interruptions cleanly. We recommended revisiting in 6 months after they’d built more graceful retry semantics.

SKIP

Multi-region degradation to single-region

Theoretical savings: $5k/mo. Real cost: their enterprise contracts required multi-region availability for DR. Cutting that to save 5% of the bill would have been a $40M+ contract risk for a $60k/yr savings. Not even close to worth it.

DEFER

Graviton migration for ARM-compatible workloads

Theoretical savings: ~$4k/mo. Real cost: significant engineering work to validate compatibility across their service mesh. Recommended as a future engagement once cost-per-engineering-hour math improved.

What didn’t work the first time

Our initial Savings Plans recommendation called for a 3-year all-upfront commitment of $36k/mo — the math was perfect for their steady-state. But the CFO refused to commit 3 years at once. We re-modeled around two 1-year commits stacked, which cost slightly more but matched their financial appetite. A 7% optimization is real if it ships; a 9% one nobody approves saves nothing.

What changed operationally.

Their CFO stopped circling the AWS line. The infra team got a quarterly cost review process and a monthly Slack-bot anomaly alert (anything >10% week-over-week pings them). The cost curve is now linear and predictable in their service growth, not a quarterly mystery.

The $756k annual savings paid back our $30k engagement cost in 13 days. The remainder is shareholder value or new engineering hires, depending on how their CFO chooses to count.

The hardest part wasn’t finding the savings. It was telling the CFO that seven of the recommendations were wrong to ship. A $4k/mo theoretical save isn’t worth a $40M contract risk. Most cost audits we’ve seen would have happily included those numbers in the report.

Hassan Ali

Founder · Lead engineer on this engagement

03

Reliability Engineering · Sprint + 6-month embedded retainer

Series B marketplace · ~110 engineers · consumer-facing platform with weekend traffic spikes · APAC-based

Duration7 wk sprint + 6 mo retainer

Format$22k + $12k/mo

Year2024–2025

From 3.2 incidents/week to one every two weeks — without burning out the on-call rotation.

Their on-call rotation was bleeding senior engineers. Two had quit citing burnout. The CEO had asked us to make it stop — ideally without spending another six months hiring SREs they couldn’t find. The fix was 70% process, 30% tooling, and almost none of it was glamorous.

Incident rate (prior)

3.2/wk

P1 + P2 averaged over Q3 of prior year

Incident rate (post)

0.5/wk

P1 + P2 averaged over 6 months post-handover

MTTR improvement

−72%

From 47 min median to 13 min

On-call hours/eng/month

52 → 14

Hours actually spent paging / responding

Incidents per week · 12 weeks before vs 12 weeks after

P1 + P2 only

Before

Q3 2024 · 12 weeks

WK 1WK 12

Total P1+P2 incidents 38

Median MTTR 47 min

Weekend pages 14

After

Post-handover · 12 weeks

WK 1WK 12

Total P1+P2 incidents 6

Median MTTR 13 min

Weekend pages 1

What was actually broken.

The on-call rotation was running on heroics. Three engineers carried 80% of the pages because they were the only ones who knew how the legacy services failed. Alert fatigue was real: their PagerDuty had 1,200+ active alert rules, and the median engineer ignored ~60% of pages because most of them were noise. When a real incident hit, it took 15+ minutes just to figure out which signal was the actual signal.

The CEO wanted to “hire two more SREs.” That was probably the wrong move — it would have just added two more people to the same broken process. We pushed back: fix the alerts first, then decide if you still need to hire.

What we shipped, in order.

WK 1-2

SLO design with the product team in the room

Most reliability work starts with engineering setting SLOs in isolation. We refused. Spent the first two weeks running SLO workshops with product, customer support, and engineering together. Result: 14 SLOs that everyone agreed represented “what good looks like for the customer,” not “what’s easy to measure.”

WK 3-4

Alert rule audit · 1,200 → 89

Deleted 1,111 alert rules. That’s not a typo. Most were copy-paste duplicates, threshold-based alerts on noisy metrics, or alerts on conditions that nobody had owned in years. The 89 remaining alerts each had: a named owner, a linked runbook, an SLO connection, and a paging severity. If it couldn’t pass all four checks, it didn’t survive.

WK 5-7

Runbook templating + incident review cadence

Built runbook templates for the top 15 historical incident types. Set up weekly incident review at a fixed time (Wednesdays 4pm), facilitated the first six personally. Drove postmortem quality up by requiring every postmortem to identify exactly one process change, not three or zero.

MO 2-7

Embedded retainer · on-call backup + monthly review

For 6 months after the sprint, Maryam was on a $12k/mo retainer covering: (a) backup on-call rotation on weekends, (b) monthly SLO review with their product team, (c) facilitating their incident reviews. Worked herself out of the role by month 6. Their team owns it now.

What didn’t work / what we got wrong

We initially proposed eliminating the “follow-the-sun” rotation entirely — we thought their geography didn’t justify it. We were wrong. Two months in, a real APAC-hours incident caught us with no on-call coverage in the right time zone, and we had to walk that decision back. The on-call structure now includes a small APAC presence we’d initially proposed removing. Worth saying out loud.

What changed operationally.

The 6× incident reduction is the headline. But the more important change was qualitative: their senior engineers stopped quitting. The two who’d resigned during the prior burnout period both said in their exit interviews that on-call burden was the #1 reason. Six months in, retention conversations stopped including “on-call” as a complaint at all.

They didn’t hire the two SREs the CEO originally wanted. They didn’t need to. The headcount budget went to a senior platform engineer instead.

We deleted more than ninety percent of their alert rules. Everyone assumed something would break catastrophically. Nothing did. The signal was always there — it was just buried under a thousand alerts no one had owned in years. The cleanup was the work.

Maryam Khan

Senior SRE · Lead engineer on this engagement

04

AI Infrastructure · Migration sprint + retainer

Series A AI-native SaaS · 60 engineers · AI-powered document workflow · $87k/month Bedrock spend · US-based

Duration14 wk sprint + 3 mo retainer

Format$28k + $14k/mo

Year2025

Bedrock at $87k/month, replaced with self-hosted vLLM at $19k/month — same model, better latency.

Their Bedrock dependency had grown 22× in nine months. The board was asking pointed questions about gross margin. Self-hosted inference was the answer — but doing it without breaking customer experience required a careful migration, not a flag day.

Monthly inference cost (before)

$87k

Bedrock + small Vertex AI dependency

Monthly inference cost (after)

$19k

Self-hosted vLLM on H100 SXM cluster

p99 latency change

−38%

From 1.8s to 1.12s on the streaming endpoint

Annual savings vs Bedrock

$816k

After $70k total engagement cost

Inference economics · before vs after migration

Llama-3 70B equivalent workload

Before · Bedrock

$4.20

per 1M tokens · managed inference with frontier-model pricing applied to a 70B-class workload

9-month spend growth 22× in 9 months

After · vLLM self-hosted

$0.92

per 1M tokens · tuned vLLM with PagedAttention on a 4×H100 SXM cluster, including amortized cluster cost

Cost curve post-migration Flat · linear in tokens

What was actually broken.

Bedrock had been the right call when they were prototyping. It was the wrong call by month four, when their daily token volume crossed 80M and the cost compounding became visible. The team knew this. They’d been afraid to migrate because (a) nobody on staff had run production inference at scale before, (b) their eval data was inadequate to prove model parity post-migration, and (c) the engineering CEO had personally championed Bedrock at the start and walking that back was politically expensive.

By the time we showed up, the technical problem was solved (vLLM on H100s would obviously work). The problem was the migration confidence loop: how do you cut over a production workload when you can’t prove the new system is as good as the old one?

What we shipped, in order.

WK 1-3

Eval harness first, before any infrastructure

Most teams build the new inference stack first and figure out eval later. We refused. Spent the first three weeks pulling 5,000 real customer prompts from their Bedrock logs (with consent), running them through both Bedrock and a candidate vLLM setup, and scoring outputs with both LLM-as-judge and human review. Without this eval set, the rest of the engagement would have been guessing.

WK 4-8

vLLM cluster build + tuning

4× H100 SXM 80GB nodes, vLLM with PagedAttention and continuous batching, chunked prefill enabled. Tuned --max-num-seqs, --gpu-memory-utilization, and the KV cache parameters for their specific token distribution. GPU utilization landed at 87% in steady state — well above the 40-50% we usually see on naive serving setups.

WK 9-11

Shadow traffic + gradual cutover

Built a router that split traffic between Bedrock and vLLM at configurable percentages. Started at 1%, then 10%, then 25%, then 50%, then 100%. Each step had a 48-hour bake-in with eval-set regression detection. One regression caught at the 25% step required us to tune the temperature parameter on the vLLM side to match Bedrock’s default behavior. Caught it cleanly because of the harness from weeks 1-3.

WK 12-14

Cost guardrails + handover

Per-tenant token budgets. Anomaly detection on cost-per-1M-tokens. Monthly cost review template. Recorded walkthrough of every component for their two ML platform engineers. Handed off with a 3-month retainer for backstop support during the bedding-in period.

What didn’t work / what surprised us

Our initial vLLM tuning produced p99 latency that was worse than Bedrock for the first two weeks — about 2.1s vs Bedrock’s 1.8s. We almost abandoned the chunked-prefill setting before realizing the issue was our router’s connection pooling, not vLLM itself. The fix was 30 lines of Go in the router. We lost about 8 days chasing the wrong root cause. Worth mentioning because the case study would be misleading if we pretended this part was clean.

What changed operationally.

$816k in annual savings is the headline. The deeper change: their team owns the inference stack now. Their CEO no longer has to defend the Bedrock dependency in board meetings, their CFO has a linear-in-tokens cost forecast model, and their ML engineers can swap models in days, not quarters — we ship-tested a Mistral swap during the retainer period to demonstrate this, even though they didn’t use it in production.

The $70k engagement cost paid back in 31 days. The 3-month retainer that followed cost another $42k and they renewed for one quarter past that on their own initiative.

We built the eval harness before the inference cluster. Everyone wants to see the new shiny vLLM setup first — but without 5,000 real prompts scored both ways, you’re cutting over on faith. We spent three weeks on harness work that buys you the confidence to actually press the button.

David Reyes

ML Platform Engineer · Lead on this engagement

Real work, fully written up.

From seven AWS accounts nobody owned to an Organizations structure their team can actually run.

What was actually broken.

What we shipped, in order.

Read-only audit and inventory

Target architecture proposal · signed off internally

Migration · non-prod first, prod last

Handover and SOC 2 prep

What changed operationally.

$94k a month, down to $31k — without touching a single product feature.

What the $8k audit found.

What we did not recommend.

Spot instances for batch jobs

Multi-region degradation to single-region

Graviton migration for ARM-compatible workloads

What changed operationally.

From 3.2 incidents/week to one every two weeks — without burning out the on-call rotation.

What was actually broken.

What we shipped, in order.

SLO design with the product team in the room

Alert rule audit · 1,200 → 89

Runbook templating + incident review cadence

Embedded retainer · on-call backup + monthly review

What changed operationally.

Bedrock at $87k/month, replaced with self-hosted vLLM at $19k/month — same model, better latency.

What was actually broken.

What we shipped, in order.

Eval harness first, before any infrastructure

vLLM cluster build + tuning

Shadow traffic + gradual cutover

Cost guardrails + handover

What changed operationally.

The other 43 engagements are private.