Case study · ongoing · 2025 to present

APEX.

A Claude-based multi-agent system that resolves recurring datacenter operational tickets end to end, without human intervention.

~50%

manual resolution effort cut

25+

reusable AI skills built

900+

facilities downstream

How it works

APEX assembles a fresh team of specialized Claude sub-agents for every ticket. A triage agent reads the ticket and decides which specialist agents are needed. Those agents have access to a library of 25+ skills (diagnostics, queries, runbook executions, system probes) and hand work off to each other until the ticket is resolved or escalated. A supervisor agent watches the conversation, enforces policy, and summarizes the resolution for the on-call engineer to review.

What made it actually work

01
Context, not capability
On every ticket category we tackled, the delta between a working agent and a broken one was about whether the agent had the right context at the right step, not whether it was clever enough. Skill decomposition and handoff design mattered more than prompt tuning.
02
A real audit trail
We log every intermediate reasoning step, tool call, and confidence score. When something fails (and at scale, things fail in distributions, not in single events), we can debug the probability surface, not just a stack trace.
03
Built for the operator
The eventual users are not researchers. They want a reason for every action, an obvious way to override, and a boring, stable interface. Demos optimize for “wow.” Production agents optimize for “I trust this thing on my on-call rotation.”

Stack

ClaudeMulti-agent orchestrationSub-agent delegationAWS LambdaDynamoDBTypeScriptPythonCustom skill libraryInternal web app for observability

Most of the detail above is intentionally generalized, since APEX runs against AWS internal systems and the specifics aren't mine to publish. Happy to discuss architecture in interviews and 1:1 conversations.