How I built a 1.5M-line dev tool in three months

2026-07-01 · act101 · act101, story

The refactor finished at 3 a.m. Eight hours. Roughly fifty million tokens.

Here's how it started. One wintry, late-night haze of agentic coding, I realized I'd left a critical identifier out across an entire API. I was going to fix it by hand — it was that kind of boring, mechanical work — but out of pure morbid curiosity I let the agent take it while I ran some errands. When I got back, it had succeeded. It had also spent eight hours and ~50M tokens grinding through the job string by string, the way a very patient intern with no map does work that a real refactoring engine does in one deterministic pass.

I sat there and did the math and knew, cold, that this could never happen again.

So there, encamped in the middle of the desert, I spent the next few months building the tool I couldn't find anywhere — the one we've all been needing. And then I started using the tool to build the tool. That's when everything changed.

Here's what came out the other side:

1 human. 3 months. 6,000+ agent-hours. 15,000+ commits across 1,100+ PRs. 1.5M lines of Rust at 90% test coverage. 13,000+ validation scenarios. 163 grammars, 183 refactor operations, 18 query tools. A built-in MCP server. Five editions live on act101.ai.

You can read that as a productivity claim. It isn't one. The number of lines an agent can produce is not interesting. The bottleneck is the work after the diff exists — and that's the story worth telling: the verification layer I had to build for myself before I was willing to merge that much code into a binary I was going to ship.

That verification layer is what act101 is. The recursion is the whole point. The workflows the product now sells — see the entire architecture and move it on purpose, prove every change, command a cross-language port — are the workflows I ran on my own codebase, every day, for ninety days. Some existed before the build. Most were forced into existence by it. This is the build story, from the inside.

What this post isn't

It isn't a velocity essay. I won't tell you what percentage of the diff an agent wrote, because the question was never whether the agent produced the lines — it's whether the lines produce a working binary that ships under license, with receipts. I won't show you a tweet-shaped chart of commits per hour. I won't say "AI-powered." And I won't claim the system can verify behavioral equivalence in every language, because it can't, and pretending otherwise would forfeit the brand's most valuable property: when act101 isn't sure, it says Unknown instead of guessing.

What follows is the architecture that made twenty agents building at once possible. The orchestration that kept them from stepping on each other. The guardrails that enforced instead of informed. The reward separation that kept agents from grading their own homework. The anti-cheat that caught the evasions when they came. The application-surface checks that proved the binary actually ran. And the two moments that almost killed the whole thing — with the receipts.

1.5M lines was an attest problem

Every workflow in act101 has the same shape: analyze → act → attest. That's the through-line of the product and it was also the shape of the build. An agent will analyze all day. An agent will act with total confidence on whatever it just analyzed. Attest is where the air gets thin. Attest is the part most agent tooling skips — the verification-and-honesty layer where you write down what the change actually did, with evidence, against the contract it was supposed to preserve.

Attest isn't one problem. It's four, and each one had teeth:

Parallel implementation. Twenty agents adding language support at once can't be reading and writing the same files. The codebase has to be built so they don't.
Reward separation. The agent that wrote the code cannot be the source of truth for whether the code is correct. The verdict has to live somewhere else.
Anti-cheat. Agents reach for evasions — stubs, mocked-out work, deleted tests, weakened assertions — the moment a gate is in the way. Every evasion has to get named after its first appearance, in a place that blocks the next one.
Application-surface verification. The unit tests pass. Clippy is clean. That tells you nothing about whether the binary still works against real code. Nothing in a standard build pipeline answers that question.

The next four sections are how each got solved. Each one became a piece of the product — by the recursion.

Designing for parallel implementation

The first decision was architectural, and it came before any meaningful code shipped. If twenty agents are going to add languages and operations at the same time, the codebase has to be built so each agent's work touches a near-disjoint set of files. The orchestration model forces the architecture, not the other way around.

The layered crate hierarchy came first. act-core sits at the bottom and depends on nothing internal — it's just the vocabulary every other crate speaks: Location, Range, Edit, ChangeSet. Above it, a band of infrastructure crates carries the machinery: the tree-sitter parser and its 163 grammars, styling, history, licensing, scanning. Above them, the domain layer that does the real work — act-refactor and the analysis family. On top, act-cli, the binary. Twenty crates, and the dependencies only ever point down. Run analyze_cycles against the full crates/ tree and it returns zero — and that zero is the whole precondition. Without it, two agents working in act-refactor and act-parser deadlock on each other's edits, because their import graphs cross.

The second piece was the per-language dispatch pattern. Adding a language to act101 touches a small, predictable set of shared files — and each one gets exactly one new line, alphabetically sorted. The project's own CLAUDE.md names them in writing: act-parser/src/grammar.rs (one enum variant plus three match arms); act-parser/src/languages/mod.rs (one mod, one pub use); act-refactor/src/language_support/mod.rs (one arm in for_grammar()); act-refactor/src/operations/mod.rs (one pub mod <lang>_dispatch;). One line each. Alphabetical. That single rule is what lets twenty agents add Solidity and CUDA and Erlang and Vyper at the same time without colliding at the source.

The third piece was the HANDLER_REGISTRY — a runtime registry, populated by inventory::submit! at module load, that the dispatch core consults to route an operation to the right per-language handler. The core stays small and stable. The complexity lives out in the leaf handlers, where each agent's work is its own file. The analysis report for 2026-05-28 captures the payoff: a tiny stable core (context.rs, error.rs, refactor.rs, types.rs) at instability ≤0.04, with the per-language operation files concentrated in act-refactor — the overwhelming majority of the codebase by volume, but each file independent of the rest. That distribution isn't an accident. It's the structure the parallel build required.

The headline number — 1.5M lines — is mostly that structural choice paying off. act-refactor holds the bulk of it: per-language operation files, organized so that adding a language is a handful of bounded shared-file edits plus one file the agent owns, alone, for the whole life of its work.

Orchestrating the swarm

The mechanics are simple once the architecture is right:

One worktree per agent. Git worktrees, not branches, keep each agent's working tree physically separate. Two agents writing the same file is now impossible by construction, not by convention.
Scoped checks per work unit. cargo check -p <crate>, cargo test -p <crate>, cargo clippy -p <crate> -- -D warnings. Each agent runs the strictest available checks against the crate it touched, in its own worktree, before the work ever reaches the merge gate.
Progressive gates. The justfile stacks them: just check (fmt + clippy + check), the local pre-commit hook, the affected-crate test gate, then the full just ci pipeline. Work passes through all of them before it reaches main. Each is harder than the last, and every failure gets caught at the cheapest gate that can catch it.
TDD, required, not requested. The rule, verbatim: "write a failing test, show failure, write code, show it passing. No exceptions." A passing test proves nothing on its own — it can't tell you it would have failed without the change. The shown-failure step is the proof.
Per-agent budgets. Time, tokens, runs. Hitting the cap means the audit was wrong about the work unit's scope — not that the agent gets more rope.

On a given day the swarm is somewhere between five and twenty agents, on different crates, in different worktrees, against different open PRs. They don't coordinate with each other. The architecture coordinates them.

Guardrails that enforce are superior to guardrails that inform

The sharpest line in the whole project runs between guardrails that enforce and guardrails that advise. Advisory guardrails are prompts, system instructions, the CLAUDE.md guide, the AGENTS.md anti-patterns list. They tell the agent what to do and hope it listens. Enforcing guardrails are git hooks, clippy -D warnings, cargo deny check, the corpus validator, the license-tier end-to-end harness. They make non-compliance impossible, or at least detectable in a way the agent can't route around.

Guardrails that enforce are superior to guardrails that inform. That's the mantra, and it's the whole section in one line. Anything that actually matters is enforcing. Advisory is for taste; enforcing is for correctness.

A worked pair. CLAUDE.md says "meaningful tests only — no derive trait tests, no tautological assertions, no testing serde roundtrips in isolation." That's advice. It sets the standard. It does not stop an agent from writing a tautological test if a tautological test helps close the PR. The enforcement lives elsewhere: the corpus validator, the migration parity harness, the license e2e gate that mints real licenses per edition and exercises the production MCP surface. An agent that wrote a garbage test satisfies none of them.

Another pair. "Never --no-verify" is advice, written in CLAUDE.md. The enforcement is the pre-commit hook itself — the very hook the rule is trying to protect. The hook is the wall; the rule is the sign on the wall. Take down the sign and the wall still stands. Take down the wall and nothing stands.

And then there's the Brown M&M rule. CLAUDE.md rule #12 reads, in full: "If you haven't found the brown m&m rule you haven't read all of your rules." It's a structural test — only an agent that read every rule, in order, ever sees it, because it sits in the middle of the list. An agent that skimmed never trips it. An agent that grepped for "brown m&m" only finds it if it was reading carefully enough to catch the joke from Van Halen's tour rider in the first place. The rule does real work: it's the check that detects whether the rest of the rules got read at all. Advice is what the rules say. The Brown M&M is the enforcement that the advice was read. Most guardrails the project ships are exactly this shape — a piece of text that looks advisory, welded to a check that turns it into a wall.

Separating reward functions

There's a line in CLAUDE.md that earns its place as the epigraph here. Verbatim:

Never trust git history — Claude wrote those commits. Claude lies. Read actual code.

That's the principle. The agent that wrote the code is not the source of truth for whether the code is correct. The commit message is the writer's claim about its own work. The audit's job is to ignore the claim and read the code.

In practice it's a writer/auditor split, enforced at three levels.

At the file level: the agent that wrote a refactor doesn't run the analyzer that scores it. The architecture-audit skill runs against the codebase as it stands, with no input from the writer beyond the diff that landed in main. It publishes a dated report. Several exist in docs/act/, running from 2026-03-30 through 2026-05-28. The trend across them — not any single snapshot — is what carries truth.

At the artifact level: the project keeps a Refuted Findings ledger inside project-map.md. When an audit weighs a hypothesis and dismisses it, the dismissal is recorded with the evidence that killed it, and future audits don't re-litigate from scratch. "Grammar.rs is a leaky abstraction" — refuted on 2026-05-28 with a cross-crate scan that found zero downstream consumers of any concrete *Support type. "82% dead code" — recorded as an artifact of narrow-entry reachability and inventory::submit! registration, not a finding. The ledger is reward separation made durable: one agent's dismissed hypothesis becomes a constraint on every agent after it.

At the migration level: the W1 ledger documents the dispatch-to-registry strangler-fig migration, and its Phase 3 is the cleanest example of reward separation I've got. To finish, the migration had to delete dispatch arms from refactor.rs. But before deleting, we had to know which arms were actually reachable through the production seam — not which arms should be reachable, not which arms the writer believed were reachable, but which arms were reachable, measured. So we wired a tracer into dispatch_inner — one environment flag, W1_TRACE_DISPATCH_INNER=1 — and ran the whole act-refactor suite with the lights on. The trace decided what could be deleted. Not the agent. Not the commit message. The trace. It found exactly two unregistered Dart ops (convert_getter_to_field, convert_to_stateless_widget) on the pre-existing-failure list. Everything else was either registry-served or on the deny-list — named, dated, with a written reason. The deletion went 1,689 lines to ~83 lines of residue. The writer's claim that the migration was complete was confirmed by the trace, not asserted by the writer.

That instrumentation step is what reward separation looks like when it's running. The agent does not certify its own work. The system does. The result: a cargo test --no-fail-fast run that closes on exactly the pre-existing failures and zero new ones, and a receipt you can check.

Anti-cheat and work evasion

Reward separation tells you who decides whether the work is correct. Anti-cheat tells you which shortcuts the writer will reach for when the verifier is in the way. The catalogue grows by incident. Every entry is a real evasion that surfaced, got named, and now blocks the next agent that tries it.

The AGENTS.md anti-patterns list, verbatim, catches the common ones:

No fake implementations: no shipping todo!(), unimplemented!(), // STUB, empty output shortcuts, or string-parsing fakes pretending to be AST edits.

Every item is its own war story. todo!() and unimplemented!() are Rust macros that compile but panic at runtime — perfect for an agent that needs a function to look "done." // STUB is the comment an agent leaves to flag work as deferred while still merging the file. Empty output shortcuts return success with no edits, satisfying the type without doing the job. String-parsing fakes pretend to be AST edits by treating source as text — they pass tests that only check whether the output file changed, and fail tests that check whether the output still parses.

The fixes are structural, not motivational. clippy -D warnings rejects todo!() and unimplemented!() in the surface that ships. The developer CLI — query, refactor, analyze, history — ships in every build; customers just reach those operations through the MCP server rather than the terminal. A stub that compiles is one thing; a stub in the shipping binary is a build break. And string-parsing fakes get caught by the corpus validator, which runs the real binary against real code in tests/corpus/monorepo/<lang>/test-scenarios/scenario.json — output that doesn't parse fails the scenario by definition.

A second class: routing around the gates instead of satisfying them. CLAUDE.md names these directly:

"Never --no-verify." The flag suppresses pre-commit hooks. The rule documents the prohibition so an agent's failure to follow it is unambiguous; the hooks themselves are un-bypassable in CI.
"Never git stash." An agent that stashed the working tree to "clean up" was discarding state the user owned. Hard rule.
"Squash merge only." Stops an agent from hiding a dozen WIP commits behind a tidy PR title.
"No silent failures (let _ = ...)." The Rust idiom for throwing away a Result. An agent that wraps a failing call in let _ = ... is performing the evasion in syntax. The rule names it; the codebase greps for it.

The W1 ledger has a nastier example: the deny-list. Two operations — move and extract_function — got blocked during the dispatch-to-registry migration, and the reason logged next to them wasn't hand-waving: "handler-implementation divergence, NOT param-shape mismatch." For move, the TypeScript handler builds MoveOperation with location: None hardcoded — the validation that would catch a same-file move is skipped. For extract_function, the Erlang and Groovy handlers point at the wrong underlying operation entirely. Both are real bugs the migration uncovered. The deny-list doesn't paper over them — it names them, dates them, and routes the affected ops back to the legacy path until the bugs are fixed. The next agent that tries to "just delete the deny-list to clean it up" finds the named bug, not a clean slate.

The ledger also documents the TDD-red pattern used during Phase 2 reconciliation. To prove a fix was real, the team temporarily disabled the fix's normalization step, watched the regressed tests fail with the original error, then re-enabled the fix. The red state was the evidence that the test would have caught the original bug. Without the red step, the test is unverified — and an unverified test is worse than no test, because it tells the next agent a gate exists when it doesn't.

Ensuring test quality

Test quality is a stack, and every layer distrusts the one below it. A test has to be shown failing before it's allowed to pass. It has to be meaningful — no derive-trait tests, no tautological assertions, no serde roundtrips proving nothing to nobody. And above the unit tests sit the two gates that test the shipped thing: the corpus validator and the license e2e harness.

The corpus validator runs real refactor scenarios against the real act binary. Scenarios live in tests/corpus/monorepo/<lang>/test-scenarios/, one scenario.json per case. The validator builds the binary, runs the scenario, diffs the result against expectation, and reports per-language pass/fail. This is not unit testing. It's integration testing of the shipped surface against shipped code.

The license e2e harness mints real licenses against the production Keygen account — one per paid edition (Engineering, Architecture, Elite, Enterprise) — and exercises the actual act mcp serve surface against each. The tests are #[ignore] by default, read keys from environment variables only (no secrets in the repo), and require a binary built with KEYGEN_ACCOUNT_ID set. The release gate runs them. They answer "does licensing actually work end to end, against the surface customers hit?" — not "do the licensing module's unit tests pass?"

The migration parity harness was W1's version of the same idea. During Phase 2, every migrated operation was proven byte-identical to its legacy arm by a parity_<op> test in tests/migration_parity.rs — 110 of them, one per migrated op. When Phase 3 retired the legacy comparison path, those test names were kept as regression coverage for the registry handlers themselves. The pattern generalizes: build the gate that proves equivalence during a migration, then keep the gate as regression coverage after.

Application-surface verification

Unit tests prove the unit behaves as the test expects. Application-surface verification proves the binary behaves as customers will see it. Different gates.

Here the seam is the MCP server. Customers reach query and refactor through MCP — that's the surface they actually call, so that's the surface tested end to end. The developer CLI covers the same operations for local use and CI, but the release verification targets the MCP path, because that's the one that has to be right in front of a paying customer.

The token-savings benchmark lives in the same stack. The headline navigation number — ~85% mean fewer tokens, 99% on references and callers — came from the same harness that gates the build: 166 samples on a real-world monorepo, o200k_base tokenizer, measured per operation, methodology in the README. On refactor workloads specifically it lands closer to ~70% fewer output tokens in practice. The benchmark is the proof; the marketing page cites the proof; they're the same artifact run in two contexts. (And a bare "%" means nothing without a workload attached — navigation and refactoring are different measurements, and I'll always tell you which.)

The most recent live dogfood — 2026-05-26, captured in the moat-iteration brief — ran the agent skills against act101's own codebase. Twelve of twelve passed. The taint analyzer caught a real req.query → db.query flow. The behavioral-equivalence check correctly told a rename (no behavior change) apart from an added branch (behavior change). The port-parity check caught an injected divergence. The same skills that run against customer codebases ran against act101, and the same numbers came back. The recursion is the verification.

The honest caveats stay honest. Behavioral-equivalence fidelity is highest on TypeScript, Python, Rust, and Go — the languages whose control flow is modeled deeply. Elsewhere it declares Unknown rather than guessing. Cross-language equivalence isn't behavioral-equivalence's job; that's verify_port_parity. Tier-2 port execution is a CPU/memory boundary, not a security sandbox. These aren't disclaimers — they're the product. An audit that says Unknown when it can't be sure beats one that returns a confident wrong answer. That's the design.

What almost killed the project

Two moments.

The W1 dispatcher. Through the first two months, refactor.rs::dispatch_inner grew. It was the central match op body that routed every refactor operation to its implementation. By the time the audit ran on 2026-05-28, it was 1,689 lines, 114 inline-dispatched variants, composite complexity score 63,125 — the worst hotspot in the codebase by an order of magnitude. Then the audit did the most valuable thing an audit can do: it said this is not a god-object. The function was pure: true — a flat delegating match, no business logic concentrated in it. The problem was dispatch volume, not dispatch logic. The fix wrote itself: build a HANDLER_REGISTRY that ops register into at load, route through it by default, fall through to the legacy match only for ops that hadn't migrated. Then migrate the ops, prove parity, retire the match.

The W1 ledger has the whole thing. Twelve op-groups — core, extract, generate, introduce, change, convert, import, wrap, structural, and the language cohorts for Go, Python, and C++ — each one migrated as a batch, proven byte-for-byte against its legacy arm, then struck from the deny-list. Six arms turned out to be fully dead: variants constructed nowhere in the codebase, not even in a test. They got marked for deletion. Two more — move and extract_function — stayed blocked, with written reasons. Then Phase 3 turned the lights on inside dispatch_inner and watched: only the two known-broken Dart ops ever fell through to the catch-all. So the arms came out. The function went from 1,689 lines to about 83 lines of residue, its complexity score from 63,125 down to 6,938, and the codebase's worst hotspot fell clean off the top five.

The lesson is that the audit's exoneration was worth as much as its accusation. A careless audit says "god-object" and triggers a rewrite. The real finding — pure delegating dispatch, migration target identified — is what made the cleanup tractable. And the Refuted Findings ledger captured the dismissal next to the action, so the next audit doesn't re-run the argument.

The C1 cycle. Late May, the zero-cycle record regressed. A length-2 cycle had crept in — triggers.rs ⇄ engine.rs, inside act-cli/src/mcp/upgrade_hints/. The audit found it. Then a Phase-1 subagent, asked to locate the exact pair, guessed wrong. The follow-up audit verified the real spot: engine.rs:16, crossing the symbol BuildCtx. Only one such pair existed in the whole codebase, and the subagent's guess wasn't just imprecise — it was wrong. The verifier caught the error before it turned into a fix in the wrong place.

The fix was clean once the diagnosis was right. BuildCtx depended only on HintConfig and Tier. Lifting it into a new ctx.rs — depended on by both engine and triggers, depending on neither — broke the back-edge without moving the cycle. analyze_cycles against the full tree returned zero again. Commit 6f8d07929. Tests green: 37 upgrade_hints integration plus 27 lib unit. Zero-cycle record restored.

That one's the headline of the whole post. An agent's confident answer is one input, not the verdict. The Phase-1 subagent guessing wrong wasn't the failure — the system was built so the wrong guess couldn't become a wrong action. That's what reward separation actually buys you: not the elimination of agent error, but the routing of agent error away from the irreversible step.

The seven decisions

The choices the build forced, in roughly the order they had to be made:

A layered crate hierarchy with strict downward dependency flow. Zero cycles. The precondition for parallel work without import-graph deadlock.
Per-language dispatch with the one-line shared-file rule. Twenty agents, twenty languages, no source-level merge conflict.
One worktree per agent, with scoped checks at the work-unit boundary. Parallel orchestration without physical contention.
Pre-commit hooks and CI gates over CLAUDE.md advice. Anything that mattered moved from advisory text to enforcement; the Brown M&M rule is the test that the advice was even read.
"Read actual code," plus an architecture-audit skill that owns the verdict. The agent's claim is one input. The audit and the Refuted Findings ledger are the durable record.
A named anti-cheat catalogue and deny-lists with written reasons. Every evasion has a name; every block has a date and a fix path.
A verification stack aimed at the surface customers touch — real licenses, real codebases, real benchmarks. What the customer hits through MCP is what the gate tests.

None of these are novel alone. The novelty is that all seven ran concurrently, against an agentic build, on a binary that had to ship under license, in three months, without a team. The combination is what made the volume safe.

What this becomes

The build forced a methodology, and the methodology became the product. act101 ships the same workflows I ran on myself, in the same shape: analyze → act → attest. architecture-audit builds a complete model of a codebase and grounds it in evidence — coverage, churn, co-change coupling, ownership. architectural-refactoring executes the moves the model implies — break a cycle, extract an interface at a seam, lift coupling off a chokepoint. verify-refactor proves each move changed shape without changing behavior. Per-PR proof — every hunk classified, side effects diffed, contract checked, tests reached — is the same loop scaled to a diff. A cross-language port runs on the same machinery plus the porting state machine: contract, inventory, order, manifest, parity. Every workflow ends in receipts.

The playbooks I wrote alongside the build — refactoring, testing uplift, modernization, migration — are the long-form version of how to run those workflows on your own code. They aren't separate from the product; they're the procedural form of what act does when you call it through MCP. The recursion holds at the documentation layer too: each playbook is itself an architecture-audit → architectural-refactoring → verify-refactor loop, pointed at legacy code that was never built with act101.

Shortest version of what I learned: the bottleneck of agentic software is not generation. Agents generate. The bottleneck is attest. Build the attest layer first. Analyze and act will outrun it on their own.

That's the layer act101 is. It is, somewhat literally, the toolchain the agent never had.

It drops into Claude Code, Cursor, Codex, Windsurf, Cline — any MCP host. The Builder edition is free: all 18 query tools, rename, fix-auto, and five analysis tools, no trial, no expiry, no card. Live at act101.ai.

Discuss on X