Refactoring with Agents: A Playbook

act101 · refactoring, agents

Shipping a refactor used to mean a careful person reading a class for an afternoon, sketching the cut on paper, writing the tests, and pushing the diff. Today it means asking an agent to "clean this up" and hoping the diff doesn't break prod. The first style was slow and reliable. The second is fast and unreliable. The whole point of this piece is that you don't have to pick — the same engineering discipline that made refactoring trustworthy in the IDE era still works, you just have to wire it around the agent instead of the human.

If you want the verdict in one breath: an agent refactor is a deploy — version the audit it worked from, gate the diff on the verify step, never merge a refactor whose behavior you can't prove preserved, and roll back without ceremony when the verify step disagrees with you. Skip any of those and you don't have agentic refactoring; you have a very confident git push.

Why refactoring with agents needs its own discipline

Classic refactoring rested on two quiet assumptions: the person doing the work understood the code they were changing, and a green test suite meant behavior was preserved. Agents break both at once, and most teams haven't noticed yet because the diffs still look reasonable on a first read.

The agent's "understanding" is a snapshot, not a model. It read what fit in context. The class it didn't open is the one that breaks at runtime.
Tests pass for the wrong reason. An agent that rewrites a function will happily also rewrite the test that pinned its behavior, and now both are wrong in agreement.
The diff is bigger than it looks. A "small" rename touches forty files. A "small" extraction silently changes call ordering. Review fatigue sets in by file twelve and the rest gets a nod.
Behavior changes hide inside refactors. The agent inlined a default, dropped a nil guard it didn't see called, swapped a stable sort for an unstable one. None of it shows up as a logic change in the PR title.
The work isn't reproducible. A second agent run produces a different diff for the same prompt. Without a recorded audit and plan, "redo it the same way" is a wish.

The bottom line: classic refactoring measured whether the code got cleaner. Agentic refactoring has to measure whether behavior was preserved while the code got cleaner — and those are two different gates, run by two different stages, in that order. The whole playbook is built around closing the second one.

The maintainability loop, end to end

     audit  ->  refactor  ->  verify
       ^                         |
       |_________________________|
          (audit gets re-scored after merge)

Three stages, one direction, and a back-edge so the audit you started with is the audit you re-score against. The order matters: the audit decides what the refactor is allowed to touch, the refactor produces a diff bounded by the audit, and the verify stage decides whether the diff is allowed to merge. Skip the audit and the agent invents its own scope. Skip verify and you're shipping vibes.

This loop is the canonical pattern for agentic refactoring — it predates any one tool, and any team can run it with off-the-shelf parts. We'll walk each stage in turn. Every stage has a uniquely agentic twist — the audit isn't a Confluence page, the refactor isn't a one-shot prompt, and the verify isn't npm test.

Stage 1 · The audit

The single most common reason an agentic refactor goes sideways isn't the model — it's that nobody decided what the refactor was for before they started. The audit stage is unglamorous, takes a couple of hours per service, and saves you a week of churn per refactor that follows. Do four things:

Pick a target boundary. Module, package, bounded context, hot file. Not "the whole repo." An audit without a boundary is a tour.
Score what's actually wrong. Coupling, layering violations, cyclomatic complexity, dead code, hot files (high churn × high blame count), inconsistent abstractions, type-safety gaps. Numbers, not adjectives — and they come from structural analyzers run against the tree, not an engineer's memory: a coupling matrix, a dependency-cycle list, complexity hotspots, dead-code and layering reports, type-completeness gaps.
Name the smells you will not touch this round. A good audit is as much about scope exclusions as inclusions. The agent will refactor anything you don't fence off.
Write the audit to disk. A versioned artifact, in git, with a SHA. This is the contract every downstream stage works from.

Borrow from architecture review, not code review. Code review treats the diff as the asset; architecture review treats the structure as the asset. The audit's job is to make the structure legible to the next stage so the diff it produces is bounded, not heroic.

# audits/checkout-service.2026-05-30.yaml — one file, in git, one SHA.
audit:
  target: services/checkout
  scored_on: 2026-05-30
  scope:
    include: [order, payment, inventory]
    exclude: [legacy_promo, experimental_tax]   # known landmines
  smells:
    - id: coupling.payment-inventory
      kind: inappropriate-intimacy
      evidence: payment/refund.go reaches into inventory.SKU internals
      severity: high
    - id: layering.controller-db
      kind: violated-layering
      evidence: 14 controllers query the DB directly
      severity: high
    - id: dead.legacy-coupon-paths
      kind: dead-code
      evidence: no callers since 2024-11; covered by 0 tests
      severity: medium
  out_of_scope:
    - rename anything in the public API
    - touch the migration runner
  budgets:
    max_files_changed: 60
    max_loc_changed: 2500
    max_runs: 3

The biggest single thing most teams skip: writing the audit down at all. A refactor without an audit is the agent's audit, run silently in its head, with no SHA to roll back to. Make it a file. The file's SHA is what the next two stages cite.

Stage 2 · The refactor

This is the stage everyone thinks is the whole job, and it's actually the easy one once the audit is real. The refactor stage doesn't decide what to change; it executes the audit. If the agent proposes a change that isn't in the audit, that's a planning failure upstream — fix the audit, don't fudge the refactor.

The move that removes the risk is to make edits through deterministic, AST-aware operations — rename, extract, inline, move, generate — applied across the whole project, each with a receipt and a single-step undo, rather than freehand text edits. That is what turns "the agent improvised and silently changed behavior" from the default failure mode into a non-event: the same operation on the same input produces the same diff on every run, so the non-determinism the discipline exists to fence off is gone at the source rather than caught downstream.

The shape that works:

One smell, one PR. Bundling two refactors triples the review cost and quadruples the verify cost. A 400-LOC PR that fixes one smell is shippable. A 2,000-LOC PR that fixes five is a re-review forever.
Plan, then patch. The agent emits a written plan first — files touched, public-API impact, ordering of edits, behavior the patch must not change. The plan is reviewed before any diff is generated. Most bad refactors die here, cheaply.
Edits respect a budget. Files changed, LOC changed, runs allowed. The budget lives in the audit. Hitting it doesn't mean "ask for more"; it means "the audit was wrong about the scope," go back and fix it.
Public API moves are gated. A rename of an exported symbol is not the same risk as a rename of a private one. The agent flags every public-surface change for explicit human sign-off before the patch lands.
The diff cites the audit. Each hunk references which smell it addresses. Hunks that don't cite anything get cut before review.

One refactor = one declarative artifact. Audit SHA, smell IDs being addressed, plan, diff, run metadata. All in the PR. The PR's content is the artifact; nothing about how the agent reached it should live only in chat history.

# refactors/2026-05-30-decouple-payment-inventory.yaml
refactor:
  audit: audits/checkout-service.2026-05-30.yaml@a1b2c3d
  addresses: [coupling.payment-inventory]
  plan:
    - extract InventoryReservation interface in inventory/
    - inject into payment/refund.go via constructor
    - delete payment's direct calls into inventory.SKU
  public_api_changes: []        # if non-empty, requires human approval
  budgets_used:
    files_changed: 11
    loc_changed: 340
    runs: 1
  verify:
    plan: verify/2026-05-30-decouple-payment-inventory.yaml

Stage 3 · Verify behavior preservation

The verify stage is the single highest-leverage thing in this whole loop. Without it, a refactor is a deploy with your fingers crossed and a passing test suite. With it, you catch the regressions that hide inside diffs everybody assumed were behavior-preserving.

Verify has to do more than run the existing tests. The existing tests are part of what the agent might have edited. Verify needs at least four kinds of signal, and the agentic ones aren't optional:

Characterization tests. Pin the current behavior of the code being refactored, generated before the refactor runs. The agent writes them against the pre-refactor binary; the post-refactor binary has to pass them unchanged. This is the closest thing to a behavior-diff you can automate.
Test diff review. Every test the agent edited gets its own gate — was the edit a legitimate rename, or did the agent quietly weaken an assertion? A test going from assertEquals to assertNotNull is a red flag; the gate catches it.
Property checks where they fit. Hashing, parsing, serialization, idempotency — anything with an invariant gets a property-based test bolted on, not just an example test. A clean refactor preserves invariants the example tests never thought to express.
Behavior trace diff. A representative trace from the pre-refactor build, replayed against the post-refactor build, compared at the IO boundary. If the output differs, the refactor isn't behavior-preserving, no matter what the unit tests say.
Structural re-score. Re-run the same structural analyzers that produced the audit against the post-refactor tree. The targeted smell has to be measurably gone, and no new cycle, coupling spike, or layering break may have appeared in its place. This is the architecture-side gate — distinct from the behavior-preservation signals above — and it's what closes the loop's back-edge mechanically instead of by re-auditing from scratch.

Don't ship a single number. Verify reports per dimension: characterization pass rate, test-diff suspicion score, property-check pass rate, trace-diff delta. "Tests pass" hides the case where the agent quietly changed three tests to make them pass. Score per dimension, with a threshold each.

# verify/run.py — a tiny but real verify gate.
import sys, json
from verify import characterize, test_diff, properties, trace_diff

target = "services/checkout"
audit  = "audits/checkout-service.2026-05-30.yaml"

scores = {
    "characterization":  characterize.run(target, baseline="main"),
    "test_diff":         test_diff.score(against="main"),
    "properties":        properties.run(target),
    "trace_diff":        trace_diff.run(target, traces="fixtures/checkout/*.jsonl"),
}

THRESHOLDS = {
    "characterization": 1.00,   # behavior preservation is binary
    "test_diff":        0.90,   # weakened assertions drop this fast
    "properties":       1.00,
    "trace_diff":       0.99,
}
failed = {k: scores[k] for k, t in THRESHOLDS.items() if scores[k] < t}
if failed:
    print("VERIFY GATE FAILED:", failed); sys.exit(1)
print("VERIFY GATE PASSED:", scores)

The merge gate

The gate is what stops a refactor PR from touching main. The minimum that earns its keep:

Audit SHA present and unmodified. The PR cites a real audit at a real revision. The audit hasn't been edited inside the same PR (that's a planning leak).
Plan reviewed before diff generated. Trivially enforced — the plan file lands in a commit before any code diff. If it doesn't, the PR is closed and re-opened the right way.
Verify gate green on every dimension. No "overall passing." A single dimension red is a red PR.
Public-API changes signed off. A human owner of the affected surface has approved, not just an LLM-as-reviewer.
Budget respected. Files and LOC under the audit's cap. Going over means re-audit, not "approved with note."

A refactor PR with all five is mergeable. A refactor PR with four of five is interesting — and goes back to whichever stage owns the missing one.

Best practices, in plain English

One refactor = one audit smell. The PR title is the smell ID. If you can't name a single smell, you don't have a refactor; you have a rewrite.
The audit is the spec. When the agent and the audit disagree, the audit wins. Update the audit or update the refactor — never silently both.
Characterize before you cut. Pin behavior before the first edit. A characterization test written after the refactor pins the refactor, not the behavior.
Pin the agent and the model. Same model version, same agent definition, recorded in the refactor artifact. "It worked last time" isn't reproducible without it.
Trace every run. Plan, tool calls, files read, files written, tokens used. A refactor whose run trace you can't replay is a refactor you can't audit.
Public API moves get human owners. Not "the team," not "Slack" — a name in a CODEOWNERS file.
Treat dead code carefully. "No callers" doesn't mean "no callers in production." Verify with a week of trace data before the agent deletes anything.
Refactor and rewrite are different jobs. A rewrite needs an evals-style gate (does the new system behave like the old one in expectation?). A refactor needs verify. Mixing them sinks both.
Make the agent show its work. Plan in markdown, edits in atomic commits, reasoning in the PR body. "Here's the diff, trust me" is the failure mode.
The on-call runbook is part of the loop. When a refactor breaks something in prod, the post-mortem updates the audit's smell catalogue, not just the test suite.

Failure modes & gotchas

These have actually taken teams down. Every one has a one-week fix that nobody had time for.

Behavior-preserving on paper, not in trace. Unit tests green, characterization green, but the production trace diff shows a 0.3% change in ordering. Fix: trace diff on real fixtures, not just unit tests.
The agent rewrote the tests. Tests now pass because they assert less. Fix: test-diff scoring gate; assertions that weaken automatically block the PR.
Audit drift. The audit was written in March; it's now May; the smells listed don't match the current code. Fix: an audit has a TTL; expired audits are re-run, not extended.
Public-API rename creeps in. The agent thought a public symbol was internal and renamed it. Fix: API-surface diff in the gate; any public-surface delta needs a human owner sign-off.
Refactor-on-refactor. Two open PRs each refactor the same module; the second to merge silently undoes parts of the first. Fix: audit ownership is exclusive while a refactor PR is open against it.
Untraceable run. The PR landed; six weeks later something broke; nobody can reproduce how the agent produced the diff. Fix: model version, agent definition, plan, tool trace — all attached to the PR as artifacts.
The audit became a wishlist. Forty smells, all P1, none scoped. Fix: the audit caps itself — a service-level audit ships with at most ten smells, ranked, with budgets each.
One-shot prompting. A single "refactor this module" prompt produces a 4,000-LOC diff the team can't review. Fix: plan-then-patch as a hard step, not a convention.

The gotcha behind half of them: the refactor's behavior depends on something that wasn't in git. An audit edited in chat, a model that auto-upgraded, a characterization test generated and discarded. Make every behavior-defining piece an artifact in the PR and most of this list goes away.

Cost, blast radius, and the agent budget

Cost isn't a finance problem — it's a blast-radius problem in disguise. An agent allowed to refactor without bounds will produce a diff nobody can review, which means it will merge half-reviewed, which means the first regression you see is the one users find. Bound the work or the work bounds you.

Budget files changed and LOC changed, per refactor. Set in the audit; enforced by the gate. Hitting the cap means re-audit, not "approve with caveats."
Cap runs per refactor. Three runs is usually plenty. Ten is the agent thrashing.
Cap context spend. Tokens per refactor, per agent. A refactor that took 600k tokens didn't understand the code; it summarized it badly five times.
Cap reviewer load. Diffs over N hunks get auto-split into staged PRs, smallest first. A reviewer who can't hold the diff in their head is a reviewer who waves it through.
Cheap models for the boring cases. A small model finds dead code and renames; the strong model does the cross-module restructures. Routing pays for itself in a week.

How act101 fits

Everything above is tool-agnostic on purpose — the loop predates any one product, and you can wire it from a drawer of single-purpose libraries. This is the one place the playbook names a tool, because act101 was built to run exactly this loop in the agent's own protocol (MCP), and it maps one-to-one onto act101's analyze → act → attest shape:

The audit is analyze. act101's analyze_* suite emits the coupling matrix, dependency cycles, chokepoints, complexity hotspots, dead code, layer violations, and type-completeness gaps as structured, scored output — the exact numbers the audit YAML cites, AST-aware across 163 languages. Static analysis always collected far more structure than it ever showed a human; here the whole of it goes to the agent, which is precisely the reader that can use it.
The refactor is act. Deterministic operations — rename, extract, inline, move, generate — produce the diff AST-aware across the project, each with receipts and a single-step undo. A named operation, not a freehand rewrite, so the same input yields the same diff. That is what makes the refactor stage reproducible instead of improvised.
Verify is attest. attest is act101's behavior-preservation gate, not just a structural re-score. verify_contract_preserved and verify_behavioral_equivalence check the post-refactor function against the contract it was supposed to hold — signature, side effects, observable behavior — so a refactor that quietly dropped a nil guard or swapped a stable sort fails the gate instead of the trace. The same analyze_* suite re-scores the tree in the same pass, so the targeted smell is measurably gone and no new cycle or coupling spike took its place. act gate --receipts writes the whole verdict out as a content-addressed receipt. Your project-specific characterization, property, and trace tests compose on top — attest gives you the contract and equivalence checks most agent tooling skips, and your suite pins the behavior only you can describe.
The surface is MCP + CLI. Every operation is callable as an MCP tool from inside Claude Code, Cursor, or any MCP host, so the agent runs the loop in-band while it works — and as the act CLI for the verify gate in CI.

The pattern is tool-agnostic; the implementation is not. You could assemble a version of this loop from a dozen single-purpose libraries and a custom scorer — and the four prior generations of refactoring tooling are why the pattern is sound. But wiring structural analysis, deterministic transformation, and re-scoring into one thing the agent can actually drive is the work act101 has already done. The maturity gain is in running the loop; act101 is what makes running it the default instead of a project.

The maturity ladder

Most teams don't sit at one tier — they're advanced on audits and primitive on verify, or the reverse. Tick what you actually do today, not what you mean to do.

[ ] Refactor target boundaries are decided before the agent starts
[ ] The audit is a versioned artifact in git, not a chat message
[ ] The agent emits a plan that is reviewed before any diff is generated
[ ] Each refactor PR addresses exactly one named smell
[ ] Characterization tests are written against the pre-refactor build
[ ] A test-diff scoring gate blocks weakened assertions
[ ] Property checks run on invariant-bearing code
[ ] Trace diff runs on representative production fixtures
[ ] Public-API changes require a named human owner's sign-off
[ ] Roll back is one click and verifies behavior before declaring success

Zero to three: cowboy refactoring. Four to six: serious but porous — regressions slip through the test layer. Seven to nine: the loop is doing the work. Ten: the loop improves itself; your job becomes watching the smells trend down.

A reasonable 30 / 60 / 90-day plan

Days 1–30 — get to honest. Pick one service, one module. Write an audit by hand, commit it. Run one agentic refactor against it with no gates beyond the existing CI. You're not improving anything yet; you're making the current process visible.
Days 31–60 — build the verify gate. Characterization tests on the targeted module, test-diff scoring on every refactor PR, plan-before-diff as a hard step. You can now ship agentic refactors safely, even if slowly.
Days 61–90 — close the loop. Trace-diff on production fixtures, public-API sign-off, audit TTLs, post-merge re-scoring. The audit's smell trend becomes a chart you watch. The loop improves itself.

What the IDE era got right (and where the loop goes from here)

The IDE-era refactoring tools — Refactoring by Fowler, the catalog inside IntelliJ and ReSharper, the structural search built into the Smalltalk and Eiffel communities — were obsessed with one idea: behavior preservation by construction. A rename happened atomically across the project, an extract method was provably equivalent to the original, a move-class respected references. The tools were narrow on purpose, because narrow is what made them safe.

Agentic refactoring is broader and looser. The same operations the IDE could prove safe, the agent can only argue are safe. The whole maintainability loop exists to recover the IDE's guarantee from the outside: deterministic, AST-aware operations are the agent-era descendant of ReSharper's provably-safe rename and extract, applied through the AST rather than by hand. Audits constrain what the agent attempts, deterministic operations constrain how it attempts it, and re-scoring proves the result. Read Fowler alongside this playbook; the overlap is the part that's actually load-bearing.

The shortest possible summary: write the audit down. Bound the refactor to the audit. Verify behavior preservation as a separate stage with its own gate. Trace every run. Roll back without ceremony. That loop, run boringly for six months, is the whole game.

Discuss on X