A multi-agent system runs most of a one-person company. These are the measured results, failures included.
A client asked for an app. A multi-agent system built a throwaway proof of concept, then rebuilt it into a hardened product, deployed it, and ran it at a live event. The measured economics: ~$1,350 of model usage and ~12–33 hours of operator attention, against a rough ~160-hour guess for a solo developer.
86 controlled container escape trials with Claude Opus and Sonnet. Default Docker held, but the agent queried its own API for zero-days, spawned remote agents from isolation, and attacked the measurement harness. Independent data complementing Anthropic’s Mythos System Card.
Security audit of a multi-server production fleet: SSH brute-force analysis, web threat landscape, AI crawler patterns, honeypot findings, and hardening assessment. 27 months of data.