Case study · ongoing · 2025 to present
APEX.
A Claude-based multi-agent system that resolves recurring datacenter operational tickets end to end, without human intervention.
~50%
manual resolution effort cut
25+
reusable AI skills built
900+
facilities downstream
How it works
APEX assembles a fresh team of specialized Claude sub-agents for every ticket. A triage agent reads the ticket and decides which specialist agents are needed. Those agents have access to a library of 25+ skills (diagnostics, queries, runbook executions, system probes) and hand work off to each other until the ticket is resolved or escalated. A supervisor agent watches the conversation, enforces policy, and summarizes the resolution for the on-call engineer to review.
What made it actually work
- 01
Context, not capability
On every ticket category we tackled, the delta between a working agent and a broken one was about whether the agent had the right context at the right step, not whether it was clever enough. Skill decomposition and handoff design mattered more than prompt tuning.
- 02
A real audit trail
We log every intermediate reasoning step, tool call, and confidence score. When something fails (and at scale, things fail in distributions, not in single events), we can debug the probability surface, not just a stack trace.
- 03
Built for the operator
The eventual users are not researchers. They want a reason for every action, an obvious way to override, and a boring, stable interface. Demos optimize for “wow.” Production agents optimize for “I trust this thing on my on-call rotation.”
Stack
Most of the detail above is intentionally generalized, since APEX runs against AWS internal systems and the specifics aren't mine to publish. Happy to discuss architecture in interviews and 1:1 conversations.