Now Live AI Infrastructure Audit — Free 30-min review for SaaS & AI teams
Book Discovery Call
Home / Case studies
4 engagements, fully written up

Real work, fully written up.

Four detailed case studies — one for each practice area. Numbers, method, named engineers, and the parts that didn’t work. The remaining 43 engagements are under NDA; we’ll share names on a call when relevant.

47
Engagements completed Since 2022. Across all four practice areas.
4
Written up here One per practice area · client-approved
43
Under NDA Names shared privately on the first call
01
Cloud Infrastructure · Migration sprint
Series B FinTech · ~80 engineers · payment infrastructure for ~$1.2B annual GMV · based in North America
Duration9 weeks
FormatFixed-price sprint
Year2025

From seven AWS accounts nobody owned to an Organizations structure their team can actually run.

Single-account sprawl had become a security review blocker. Three SOC 2 auditors had flagged it in successive years. The fix wasn’t technical — it was political and operational, and that’s where most of the engagement actually went.
Accounts consolidated
7 → 4
Under a clean Organizations structure with SCPs
Drift incidents (prior 12 mo)
38
Manual changes that broke staging or prod
Drift incidents (post)
2
In the 6 months following handover
SOC 2 audit findings cleared
11
Including all three multi-year carryovers
Architecture migration · before, during, after
9-week sprint
Prod
prod-account (mixed)
audit + freeze
move & isolate
prod-payments (clean)
Staging
eng-staging-1 + eng-staging-2
merge inventory
re-IaC
staging (Terraform-only)
Data
analytics + bi-sandbox
classify PII
SCP guardrails
data-prod (least-priv)
Sandbox
eng-sandbox + 1 forgotten
archive forgotten
redirect dev
sandbox (dev only)

What was actually broken.

Seven AWS accounts had accumulated over four years of incidental decisions. No clear owner per account. Cross-account access was a tangle of manual IAM roles, half of them with *:* trust policies. The team couldn’t answer the auditor’s simplest question: “which account does production payments traffic terminate in?” Because the honest answer was: two of them, and they didn’t agree.

This isn’t a unique story. It’s how every Series B cloud account looks when growth happened faster than the operational discipline. The problem wasn’t the technology — the AWS primitives were fine. The problem was that nobody had ownership of the structure, and every engineer who’d touched it had moved on or moved teams.

What we shipped, in order.

WK 1-2
Read-only audit and inventory

Catalogued every resource across all seven accounts. Identified 312 resources without tags, 41 IAM roles unused for 90+ days, 18 security groups with 0.0.0.0/0 inbound. Built the resource ownership map by interviewing 11 engineers individually — the political work, not the technical work.

WK 3-4
Target architecture proposal · signed off internally

Four-account model: prod-payments, staging, data-prod, sandbox. Each with named owner, SCPs limiting blast radius, IAM Identity Center for human access. Walked the architecture through three review meetings with their security, infrastructure, and engineering leads before any changes shipped.

WK 5-7
Migration · non-prod first, prod last

Started with sandbox and staging consolidation. Built the Transit Gateway with their network team in the room. Migrated data-prod with a 48-hour cutover window and full rollback plan. Prod-payments moved over a weekend with their incident team on call — zero customer-facing impact.

WK 8-9
Handover and SOC 2 prep

Wrote 14 runbooks covering account creation, IAM lifecycle, SCP exception process, and the drift detection workflow. Sat in on the SOC 2 readiness meeting with their auditors and walked through the new structure. All 11 prior findings closed.

What didn’t work

Our original plan called for closing two of the seven accounts. We had to keep one of them open because a third-party vendor integration had hard-coded its account ID into webhook payloads, and we couldn’t coordinate the change with the vendor in time. We left it in the structure as “legacy-webhook-only” with tight SCPs preventing anything else from running there. It’s probably still there.

What changed operationally.

The numbers in the stat row are real, but the deeper change is harder to chart: their team can now answer the auditor’s question. Every account has an owner. Every IAM role has a justification document. Every SCP exception goes through a written process that lives in their wiki, not in someone’s head.

Six months after handover, the SRE lead emailed: “We onboarded two new engineers last week and they were able to ship a change to staging on day three. That used to take a month.” That’s the engagement’s real result — not the migration itself, but the operational ground that the migration cleared.

Hassan Ali
The technical migration was the easy half. The political work — getting eleven engineers to agree on who owned what, before we changed anything — was where the engagement actually lived. We spent more time in conference rooms than in Terraform that month.
Hassan Ali
Founder · Lead engineer on this engagement
02
Cost Optimization · Audit + implementation
Series C B2B SaaS · ~140 engineers · collaboration platform · $94k/month AWS spend at engagement start · EU-based
Duration3 wk audit + 8 wk impl
Format$8k audit → $22k sprint
Year2024

$94k a month, down to $31k — without touching a single product feature.

Their CFO had circled the AWS line item three months running. The infra team knew they were over-provisioned but couldn’t justify which workloads to cut without a methodical audit. We ran the audit, then implemented the top 11 of 18 recommendations.
Monthly spend (before)
$94k
Baseline averaged over 90 days pre-engagement
Monthly spend (after)
$31k
After 8 weeks of implementation work
Reduction
−67%
Sustained at 6 months post-handover
Annual run-rate saved
$756k
After our $30k total engagement cost
Monthly spend · the cuts, in order
$94k → $31k
Starting baseline
$94,000
Right-sized EC2 fleet
$77,200
Savings Plans + RI strategy
$60,150
S3 lifecycle + Glacier moves
$43,800
+ idle resource cleanup & egress fix
$31,000

What the $8k audit found.

Three categories of waste, in order of dollar impact:

Over-provisioned compute ($16.8k/mo recoverable). They were running m5.4xlarge instances for workloads that fit easily on m5.large. Standard story: a senior engineer had specced the original cluster two years prior under load assumptions that never materialized. Nobody re-checked. Right-sizing on its own returned 18% of the monthly bill.

No Savings Plans coverage ($17k/mo recoverable). They were paying full on-demand for ~80% of compute that was running 24/7. A 3-year compute Savings Plan at the right commitment level would have covered it — but nobody had modeled what “the right commitment level” was, so the finance team had vetoed buying any. We built the model.

S3 sat at standard tier ($14.5k/mo recoverable). Log buckets containing 80TB of pre-2023 data that hadn’t been accessed in 18 months. Standard tier. Paying $0.023 per GB month for cold logs. Lifecycle rules moved most of it to Glacier Instant Retrieval at a fraction of the cost; the buckets nobody had touched in 12 months went to Glacier Deep Archive.

What we did not recommend.

The audit identified 18 cost-saving opportunities. We implemented the top 11. The other 7 we explicitly recommended against, or deferred. Examples of what we didn’t cut:

SKIP
Spot instances for batch jobs

Theoretical savings: ~$3k/mo. Real cost: their batch jobs were on the critical path for two customer-facing features. The team didn’t have the operational maturity to handle spot interruptions cleanly. We recommended revisiting in 6 months after they’d built more graceful retry semantics.

SKIP
Multi-region degradation to single-region

Theoretical savings: $5k/mo. Real cost: their enterprise contracts required multi-region availability for DR. Cutting that to save 5% of the bill would have been a $40M+ contract risk for a $60k/yr savings. Not even close to worth it.

DEFER
Graviton migration for ARM-compatible workloads

Theoretical savings: ~$4k/mo. Real cost: significant engineering work to validate compatibility across their service mesh. Recommended as a future engagement once cost-per-engineering-hour math improved.

What didn’t work the first time

Our initial Savings Plans recommendation called for a 3-year all-upfront commitment of $36k/mo — the math was perfect for their steady-state. But the CFO refused to commit 3 years at once. We re-modeled around two 1-year commits stacked, which cost slightly more but matched their financial appetite. A 7% optimization is real if it ships; a 9% one nobody approves saves nothing.

What changed operationally.

Their CFO stopped circling the AWS line. The infra team got a quarterly cost review process and a monthly Slack-bot anomaly alert (anything >10% week-over-week pings them). The cost curve is now linear and predictable in their service growth, not a quarterly mystery.

The $756k annual savings paid back our $30k engagement cost in 13 days. The remainder is shareholder value or new engineering hires, depending on how their CFO chooses to count.

Hassan Ali
The hardest part wasn’t finding the savings. It was telling the CFO that seven of the recommendations were wrong to ship. A $4k/mo theoretical save isn’t worth a $40M contract risk. Most cost audits we’ve seen would have happily included those numbers in the report.
Hassan Ali
Founder · Lead engineer on this engagement
03
Reliability Engineering · Sprint + 6-month embedded retainer
Series B marketplace · ~110 engineers · consumer-facing platform with weekend traffic spikes · APAC-based
Duration7 wk sprint + 6 mo retainer
Format$22k + $12k/mo
Year2024–2025

From 3.2 incidents/week to one every two weeks — without burning out the on-call rotation.

Their on-call rotation was bleeding senior engineers. Two had quit citing burnout. The CEO had asked us to make it stop — ideally without spending another six months hiring SREs they couldn’t find. The fix was 70% process, 30% tooling, and almost none of it was glamorous.
Incident rate (prior)
3.2/wk
P1 + P2 averaged over Q3 of prior year
Incident rate (post)
0.5/wk
P1 + P2 averaged over 6 months post-handover
MTTR improvement
−72%
From 47 min median to 13 min
On-call hours/eng/month
52 → 14
Hours actually spent paging / responding
Incidents per week · 12 weeks before vs 12 weeks after
P1 + P2 only
Before
Q3 2024 · 12 weeks
WK 1WK 12
Total P1+P2 incidents 38
Median MTTR 47 min
Weekend pages 14
After
Post-handover · 12 weeks
WK 1WK 12
Total P1+P2 incidents 6
Median MTTR 13 min
Weekend pages 1

What was actually broken.

The on-call rotation was running on heroics. Three engineers carried 80% of the pages because they were the only ones who knew how the legacy services failed. Alert fatigue was real: their PagerDuty had 1,200+ active alert rules, and the median engineer ignored ~60% of pages because most of them were noise. When a real incident hit, it took 15+ minutes just to figure out which signal was the actual signal.

The CEO wanted to “hire two more SREs.” That was probably the wrong move — it would have just added two more people to the same broken process. We pushed back: fix the alerts first, then decide if you still need to hire.

What we shipped, in order.

WK 1-2
SLO design with the product team in the room

Most reliability work starts with engineering setting SLOs in isolation. We refused. Spent the first two weeks running SLO workshops with product, customer support, and engineering together. Result: 14 SLOs that everyone agreed represented “what good looks like for the customer,” not “what’s easy to measure.”

WK 3-4
Alert rule audit · 1,200 → 89

Deleted 1,111 alert rules. That’s not a typo. Most were copy-paste duplicates, threshold-based alerts on noisy metrics, or alerts on conditions that nobody had owned in years. The 89 remaining alerts each had: a named owner, a linked runbook, an SLO connection, and a paging severity. If it couldn’t pass all four checks, it didn’t survive.

WK 5-7
Runbook templating + incident review cadence

Built runbook templates for the top 15 historical incident types. Set up weekly incident review at a fixed time (Wednesdays 4pm), facilitated the first six personally. Drove postmortem quality up by requiring every postmortem to identify exactly one process change, not three or zero.

MO 2-7
Embedded retainer · on-call backup + monthly review

For 6 months after the sprint, Maryam was on a $12k/mo retainer covering: (a) backup on-call rotation on weekends, (b) monthly SLO review with their product team, (c) facilitating their incident reviews. Worked herself out of the role by month 6. Their team owns it now.

What didn’t work / what we got wrong

We initially proposed eliminating the “follow-the-sun” rotation entirely — we thought their geography didn’t justify it. We were wrong. Two months in, a real APAC-hours incident caught us with no on-call coverage in the right time zone, and we had to walk that decision back. The on-call structure now includes a small APAC presence we’d initially proposed removing. Worth saying out loud.

What changed operationally.

The 6× incident reduction is the headline. But the more important change was qualitative: their senior engineers stopped quitting. The two who’d resigned during the prior burnout period both said in their exit interviews that on-call burden was the #1 reason. Six months in, retention conversations stopped including “on-call” as a complaint at all.

They didn’t hire the two SREs the CEO originally wanted. They didn’t need to. The headcount budget went to a senior platform engineer instead.

Maryam Khan
We deleted more than ninety percent of their alert rules. Everyone assumed something would break catastrophically. Nothing did. The signal was always there — it was just buried under a thousand alerts no one had owned in years. The cleanup was the work.
Maryam Khan
Senior SRE · Lead engineer on this engagement
04
AI Infrastructure · Migration sprint + retainer
Series A AI-native SaaS · 60 engineers · AI-powered document workflow · $87k/month Bedrock spend · US-based
Duration14 wk sprint + 3 mo retainer
Format$28k + $14k/mo
Year2025

Bedrock at $87k/month, replaced with self-hosted vLLM at $19k/month — same model, better latency.

Their Bedrock dependency had grown 22× in nine months. The board was asking pointed questions about gross margin. Self-hosted inference was the answer — but doing it without breaking customer experience required a careful migration, not a flag day.
Monthly inference cost (before)
$87k
Bedrock + small Vertex AI dependency
Monthly inference cost (after)
$19k
Self-hosted vLLM on H100 SXM cluster
p99 latency change
−38%
From 1.8s to 1.12s on the streaming endpoint
Annual savings vs Bedrock
$816k
After $70k total engagement cost
Inference economics · before vs after migration
Llama-3 70B equivalent workload
Before · Bedrock
$4.20
per 1M tokens · managed inference with frontier-model pricing applied to a 70B-class workload
9-month spend growth 22× in 9 months
After · vLLM self-hosted
$0.92
per 1M tokens · tuned vLLM with PagedAttention on a 4×H100 SXM cluster, including amortized cluster cost
Cost curve post-migration Flat · linear in tokens

What was actually broken.

Bedrock had been the right call when they were prototyping. It was the wrong call by month four, when their daily token volume crossed 80M and the cost compounding became visible. The team knew this. They’d been afraid to migrate because (a) nobody on staff had run production inference at scale before, (b) their eval data was inadequate to prove model parity post-migration, and (c) the engineering CEO had personally championed Bedrock at the start and walking that back was politically expensive.

By the time we showed up, the technical problem was solved (vLLM on H100s would obviously work). The problem was the migration confidence loop: how do you cut over a production workload when you can’t prove the new system is as good as the old one?

What we shipped, in order.

WK 1-3
Eval harness first, before any infrastructure

Most teams build the new inference stack first and figure out eval later. We refused. Spent the first three weeks pulling 5,000 real customer prompts from their Bedrock logs (with consent), running them through both Bedrock and a candidate vLLM setup, and scoring outputs with both LLM-as-judge and human review. Without this eval set, the rest of the engagement would have been guessing.

WK 4-8
vLLM cluster build + tuning

4× H100 SXM 80GB nodes, vLLM with PagedAttention and continuous batching, chunked prefill enabled. Tuned --max-num-seqs, --gpu-memory-utilization, and the KV cache parameters for their specific token distribution. GPU utilization landed at 87% in steady state — well above the 40-50% we usually see on naive serving setups.

WK 9-11
Shadow traffic + gradual cutover

Built a router that split traffic between Bedrock and vLLM at configurable percentages. Started at 1%, then 10%, then 25%, then 50%, then 100%. Each step had a 48-hour bake-in with eval-set regression detection. One regression caught at the 25% step required us to tune the temperature parameter on the vLLM side to match Bedrock’s default behavior. Caught it cleanly because of the harness from weeks 1-3.

WK 12-14
Cost guardrails + handover

Per-tenant token budgets. Anomaly detection on cost-per-1M-tokens. Monthly cost review template. Recorded walkthrough of every component for their two ML platform engineers. Handed off with a 3-month retainer for backstop support during the bedding-in period.

What didn’t work / what surprised us

Our initial vLLM tuning produced p99 latency that was worse than Bedrock for the first two weeks — about 2.1s vs Bedrock’s 1.8s. We almost abandoned the chunked-prefill setting before realizing the issue was our router’s connection pooling, not vLLM itself. The fix was 30 lines of Go in the router. We lost about 8 days chasing the wrong root cause. Worth mentioning because the case study would be misleading if we pretended this part was clean.

What changed operationally.

$816k in annual savings is the headline. The deeper change: their team owns the inference stack now. Their CEO no longer has to defend the Bedrock dependency in board meetings, their CFO has a linear-in-tokens cost forecast model, and their ML engineers can swap models in days, not quarters — we ship-tested a Mistral swap during the retainer period to demonstrate this, even though they didn’t use it in production.

The $70k engagement cost paid back in 31 days. The 3-month retainer that followed cost another $42k and they renewed for one quarter past that on their own initiative.

David Reyes
We built the eval harness before the inference cluster. Everyone wants to see the new shiny vLLM setup first — but without 5,000 real prompts scored both ways, you’re cutting over on faith. We spent three weeks on harness work that buys you the confidence to actually press the button.
David Reyes
ML Platform Engineer · Lead on this engagement

The other 43 engagements are private.

Most of our work is under NDA — either contractually or by client preference. We’d rather have your trust than your logo on our website. If you want to hear about a specific engagement that mirrors your situation, ask on the first call. We’ll share names, share numbers, and connect you with the relevant past client where they’ve agreed to act as a reference.

43
Engagements under NDA Across all four practice areas
Private
11
Past clients on the reference list Pre-approved to take a call from a serious prospect
Referenceable
31
Repeat clients Came back for a second or third engagement
Returned
If one of these sounds like your situation,
we’d like to look at it with you.
Talk to a senior engineer