Refactoring with Agents: A Playbook

Shipping a refactor used to mean a careful person reading a class for an afternoon, sketching the cut on paper, writing the tests, and pushing the diff. Today it means asking an agent to "clean this up" and hoping the diff doesn't break prod. The first style was slow and reliable. The second is fast and unreliable. The whole point of this piece is that you don't have to pick — the same engineering discipline that made refactoring trustworthy in the IDE era still works, you just have to wire it around the agent instead of the human.

If you want the verdict in one breath: an agent refactor is a deploy — version the audit it worked from, gate the diff on the verify step, never merge a refactor whose behavior you can't prove preserved, and roll back without ceremony when the verify step disagrees with you. Skip any of those and you don't have agentic refactoring; you have a very confident git push.

Why refactoring with agents needs its own discipline

Classic refactoring rested on two quiet assumptions: the person doing the work understood the code they were changing, and a green test suite meant behavior was preserved. Agents break both at once, and most teams haven't noticed yet because the diffs still look reasonable on a first read.

The bottom line: classic refactoring measured whether the code got cleaner. Agentic refactoring has to measure whether behavior was preserved while the code got cleaner — and those are two different gates, run by two different stages, in that order. The whole playbook is built around closing the second one.

The maintainability loop, end to end

     audit  ->  refactor  ->  verify
       ^                         |
       |_________________________|
          (audit gets re-scored after merge)

Three stages, one direction, and a back-edge so the audit you started with is the audit you re-score against. The order matters: the audit decides what the refactor is allowed to touch, the refactor produces a diff bounded by the audit, and the verify stage decides whether the diff is allowed to merge. Skip the audit and the agent invents its own scope. Skip verify and you're shipping vibes.

This loop is the canonical pattern for agentic refactoring — it predates any one tool, and any team can run it with off-the-shelf parts. We'll walk each stage in turn. Every stage has a uniquely agentic twist — the audit isn't a Confluence page, the refactor isn't a one-shot prompt, and the verify isn't npm test.

Stage 1 · The audit

The single most common reason an agentic refactor goes sideways isn't the model — it's that nobody decided what the refactor was for before they started. The audit stage is unglamorous, takes a couple of hours per service, and saves you a week of churn per refactor that follows. Do four things:

  1. Pick a target boundary. Module, package, bounded context, hot file. Not "the whole repo." An audit without a boundary is a tour.
  2. Score what's actually wrong. Coupling, layering violations, cyclomatic complexity, dead code, hot files (high churn × high blame count), inconsistent abstractions, type-safety gaps. Numbers, not adjectives — and they come from structural analyzers run against the tree, not an engineer's memory: a coupling matrix, a dependency-cycle list, complexity hotspots, dead-code and layering reports, type-completeness gaps.
  3. Name the smells you will not touch this round. A good audit is as much about scope exclusions as inclusions. The agent will refactor anything you don't fence off.
  4. Write the audit to disk. A versioned artifact, in git, with a SHA. This is the contract every downstream stage works from.

Borrow from architecture review, not code review. Code review treats the diff as the asset; architecture review treats the structure as the asset. The audit's job is to make the structure legible to the next stage so the diff it produces is bounded, not heroic.

# audits/checkout-service.2026-05-30.yaml — one file, in git, one SHA.
audit:
  target: services/checkout
  scored_on: 2026-05-30
  scope:
    include: [order, payment, inventory]
    exclude: [legacy_promo, experimental_tax]   # known landmines
  smells:
    - id: coupling.payment-inventory
      kind: inappropriate-intimacy
      evidence: payment/refund.go reaches into inventory.SKU internals
      severity: high
    - id: layering.controller-db
      kind: violated-layering
      evidence: 14 controllers query the DB directly
      severity: high
    - id: dead.legacy-coupon-paths
      kind: dead-code
      evidence: no callers since 2024-11; covered by 0 tests
      severity: medium
  out_of_scope:
    - rename anything in the public API
    - touch the migration runner
  budgets:
    max_files_changed: 60
    max_loc_changed: 2500
    max_runs: 3

The biggest single thing most teams skip: writing the audit down at all. A refactor without an audit is the agent's audit, run silently in its head, with no SHA to roll back to. Make it a file. The file's SHA is what the next two stages cite.

Stage 2 · The refactor

This is the stage everyone thinks is the whole job, and it's actually the easy one once the audit is real. The refactor stage doesn't decide what to change; it executes the audit. If the agent proposes a change that isn't in the audit, that's a planning failure upstream — fix the audit, don't fudge the refactor.

The move that removes the risk is to make edits through deterministic, AST-aware operations — rename, extract, inline, move, generate — applied across the whole project, each with a receipt and a single-step undo, rather than freehand text edits. That is what turns "the agent improvised and silently changed behavior" from the default failure mode into a non-event: the same operation on the same input produces the same diff on every run, so the non-determinism the discipline exists to fence off is gone at the source rather than caught downstream.

The shape that works:

  1. One smell, one PR. Bundling two refactors triples the review cost and quadruples the verify cost. A 400-LOC PR that fixes one smell is shippable. A 2,000-LOC PR that fixes five is a re-review forever.
  2. Plan, then patch. The agent emits a written plan first — files touched, public-API impact, ordering of edits, behavior the patch must not change. The plan is reviewed before any diff is generated. Most bad refactors die here, cheaply.
  3. Edits respect a budget. Files changed, LOC changed, runs allowed. The budget lives in the audit. Hitting it doesn't mean "ask for more"; it means "the audit was wrong about the scope," go back and fix it.
  4. Public API moves are gated. A rename of an exported symbol is not the same risk as a rename of a private one. The agent flags every public-surface change for explicit human sign-off before the patch lands.
  5. The diff cites the audit. Each hunk references which smell it addresses. Hunks that don't cite anything get cut before review.

One refactor = one declarative artifact. Audit SHA, smell IDs being addressed, plan, diff, run metadata. All in the PR. The PR's content is the artifact; nothing about how the agent reached it should live only in chat history.

# refactors/2026-05-30-decouple-payment-inventory.yaml
refactor:
  audit: audits/checkout-service.2026-05-30.yaml@a1b2c3d
  addresses: [coupling.payment-inventory]
  plan:
    - extract InventoryReservation interface in inventory/
    - inject into payment/refund.go via constructor
    - delete payment's direct calls into inventory.SKU
  public_api_changes: []        # if non-empty, requires human approval
  budgets_used:
    files_changed: 11
    loc_changed: 340
    runs: 1
  verify:
    plan: verify/2026-05-30-decouple-payment-inventory.yaml

Stage 3 · Verify behavior preservation

The verify stage is the single highest-leverage thing in this whole loop. Without it, a refactor is a deploy with your fingers crossed and a passing test suite. With it, you catch the regressions that hide inside diffs everybody assumed were behavior-preserving.

Verify has to do more than run the existing tests. The existing tests are part of what the agent might have edited. Verify needs at least four kinds of signal, and the agentic ones aren't optional:

Don't ship a single number. Verify reports per dimension: characterization pass rate, test-diff suspicion score, property-check pass rate, trace-diff delta. "Tests pass" hides the case where the agent quietly changed three tests to make them pass. Score per dimension, with a threshold each.

# verify/run.py — a tiny but real verify gate.
import sys, json
from verify import characterize, test_diff, properties, trace_diff

target = "services/checkout"
audit  = "audits/checkout-service.2026-05-30.yaml"

scores = {
    "characterization":  characterize.run(target, baseline="main"),
    "test_diff":         test_diff.score(against="main"),
    "properties":        properties.run(target),
    "trace_diff":        trace_diff.run(target, traces="fixtures/checkout/*.jsonl"),
}

THRESHOLDS = {
    "characterization": 1.00,   # behavior preservation is binary
    "test_diff":        0.90,   # weakened assertions drop this fast
    "properties":       1.00,
    "trace_diff":       0.99,
}
failed = {k: scores[k] for k, t in THRESHOLDS.items() if scores[k] < t}
if failed:
    print("VERIFY GATE FAILED:", failed); sys.exit(1)
print("VERIFY GATE PASSED:", scores)

The merge gate

The gate is what stops a refactor PR from touching main. The minimum that earns its keep:

  1. Audit SHA present and unmodified. The PR cites a real audit at a real revision. The audit hasn't been edited inside the same PR (that's a planning leak).
  2. Plan reviewed before diff generated. Trivially enforced — the plan file lands in a commit before any code diff. If it doesn't, the PR is closed and re-opened the right way.
  3. Verify gate green on every dimension. No "overall passing." A single dimension red is a red PR.
  4. Public-API changes signed off. A human owner of the affected surface has approved, not just an LLM-as-reviewer.
  5. Budget respected. Files and LOC under the audit's cap. Going over means re-audit, not "approved with note."

A refactor PR with all five is mergeable. A refactor PR with four of five is interesting — and goes back to whichever stage owns the missing one.

Best practices, in plain English

Failure modes & gotchas

These have actually taken teams down. Every one has a one-week fix that nobody had time for.

The gotcha behind half of them: the refactor's behavior depends on something that wasn't in git. An audit edited in chat, a model that auto-upgraded, a characterization test generated and discarded. Make every behavior-defining piece an artifact in the PR and most of this list goes away.

Cost, blast radius, and the agent budget

Cost isn't a finance problem — it's a blast-radius problem in disguise. An agent allowed to refactor without bounds will produce a diff nobody can review, which means it will merge half-reviewed, which means the first regression you see is the one users find. Bound the work or the work bounds you.

How act101 fits

Everything above is tool-agnostic on purpose — the loop predates any one product, and you can wire it from a drawer of single-purpose libraries. This is the one place the playbook names a tool, because act101 was built to run exactly this loop in the agent's own protocol (MCP), and it maps one-to-one onto act101's analyze → act → attest shape:

The pattern is tool-agnostic; the implementation is not. You could assemble a version of this loop from a dozen single-purpose libraries and a custom scorer — and the four prior generations of refactoring tooling are why the pattern is sound. But wiring structural analysis, deterministic transformation, and re-scoring into one thing the agent can actually drive is the work act101 has already done. The maturity gain is in running the loop; act101 is what makes running it the default instead of a project.

The maturity ladder

Most teams don't sit at one tier — they're advanced on audits and primitive on verify, or the reverse. Tick what you actually do today, not what you mean to do.

Zero to three: cowboy refactoring. Four to six: serious but porous — regressions slip through the test layer. Seven to nine: the loop is doing the work. Ten: the loop improves itself; your job becomes watching the smells trend down.

A reasonable 30 / 60 / 90-day plan

  1. Days 1–30 — get to honest. Pick one service, one module. Write an audit by hand, commit it. Run one agentic refactor against it with no gates beyond the existing CI. You're not improving anything yet; you're making the current process visible.
  2. Days 31–60 — build the verify gate. Characterization tests on the targeted module, test-diff scoring on every refactor PR, plan-before-diff as a hard step. You can now ship agentic refactors safely, even if slowly.
  3. Days 61–90 — close the loop. Trace-diff on production fixtures, public-API sign-off, audit TTLs, post-merge re-scoring. The audit's smell trend becomes a chart you watch. The loop improves itself.

What the IDE era got right (and where the loop goes from here)

The IDE-era refactoring tools — Refactoring by Fowler, the catalog inside IntelliJ and ReSharper, the structural search built into the Smalltalk and Eiffel communities — were obsessed with one idea: behavior preservation by construction. A rename happened atomically across the project, an extract method was provably equivalent to the original, a move-class respected references. The tools were narrow on purpose, because narrow is what made them safe.

Agentic refactoring is broader and looser. The same operations the IDE could prove safe, the agent can only argue are safe. The whole maintainability loop exists to recover the IDE's guarantee from the outside: deterministic, AST-aware operations are the agent-era descendant of ReSharper's provably-safe rename and extract, applied through the AST rather than by hand. Audits constrain what the agent attempts, deterministic operations constrain how it attempts it, and re-scoring proves the result. Read Fowler alongside this playbook; the overlap is the part that's actually load-bearing.

The shortest possible summary: write the audit down. Bound the refactor to the audit. Verify behavior preservation as a separate stage with its own gate. Trace every run. Roll back without ceremony. That loop, run boringly for six months, is the whole game.