Testing Uplift with Agents: A Playbook

A modernization without tests is a rewrite blindfolded. Most teams know this and ship the modernization anyway, because writing characterization tests for a decade-old service is the unglamorous, expensive, person-month of work nobody volunteers for. Agents change the economics — but only if you don't let them produce the kind of tests that pass for the wrong reasons. This piece is about how to use agents to lift a real test suite into existence on top of legacy code, in a way that earns the trust the modernization is about to lean on.

If you want the verdict in one breath: coverage is not the metric, mutation score is — pin existing behavior, prove the pins fail when the code breaks, and never let the agent ship a test it can't justify with a behavior it preserved. Skip any of that and you don't have a test suite; you have a wall of green that will be just as green the day the modernization regresses prod.

Why testing uplift with agents needs its own discipline

Classic test writing assumed two things: a human understood the behavior they were pinning, and a passing test meant something useful was checked. Agents break both at once, and the failure is harder to spot than in refactoring — because a refactor produces a diff and a test produces a green check.

The bottom line: classic test uplift measured whether coverage went up. Agentic test uplift has to measure whether the suite's ability to catch regressions went up — and those are two completely different numbers, with the second one being the one you actually want.

The uplift loop, end to end

test-audit  ->  test-generation  ->  verify-tests
        ^                                  |
        |__________________________________|
                  (audit gets re-scored after merge)

Three stages, one direction, a back-edge so the audit you started with is the audit you re-score against. The audit decides where the suite is least informative, the generation stage produces tests bounded by the audit, and verify decides whether the new tests are real tests. Skip the audit and the agent invents its own definition of "needs tests." Skip verify and you ship a wall of greenness with no diagnostic value.

This loop is the canonical pattern for agentic test uplift — it predates any one tool, and any team can run it. It rests on one honest division of labor. The structural half — where the suite is weak, and what the code under test actually does — is best answered by AST-aware structural analysis, not a coverage badge. The quality half — whether a test actually catches bugs — is mutation testing. Those are two different tools measuring two different things, and you run them together. Sourcing the audit's ranked hole list from structural analysis (instead of a coverage percentage), and characterizing behavior from the AST (instead of the agent re-reading whole files), is what keeps a 200-test uplift both faithful and inside a context budget. One measured note on how act101 fills the structural half comes at the end; everything else here is tool-agnostic.

Each stage has a uniquely agentic twist. The audit isn't a coverage report. The generation isn't a single prompt. The verify isn't npm test.

Stage 1 · test-audit

The most common reason an agentic testing pass fails isn't the model — it's that nobody told the agent where the suite was actually weakest. Coverage reports lie by construction: they show executed lines, not asserted behaviors. The audit's job is to make the weakness legible. Do four things:

  1. Pick a target boundary. Module, service, bounded context. Not "the whole repo." An audit without a boundary produces tests everywhere and confidence nowhere.
  2. Score what's wrong with the existing suite, not just what's missing. Mutation score by file, snapshot density, mock-to-assertion ratio, tests that import the SUT but never call it, tests with assertion-strength below a threshold. Numbers, not adjectives.
  3. Map the high-blast-radius holes. Uncovered files with high churn × high blame × prod-incident history. The audit ranks holes; it doesn't list them. Coupling and chokepoint analysis is what turns "uncovered" into "uncovered and depended on by half the service" — the ranking signal, not just the coverage gap.
  4. Write the audit to disk. A versioned artifact, in git, with a SHA. The generation stage cites it. The verify stage rescores against it.

Coverage is a side-effect metric. Track it, don't optimize for it. The metric the audit cares about is mutation score by module — the percentage of synthetic bugs the current suite would catch. Everything below 50% on a critical path is a hole, no matter what the coverage badge says.

# audits/checkout-suite.2026-05-30.yaml — one file, in git, one SHA.
audit:
  target: services/checkout
  scored_on: 2026-05-30
  signal:
    line_coverage: 0.42
    branch_coverage: 0.31
    mutation_score: 0.18         # the real number
    assertion_strength: 0.55
  holes:
    - id: payment.refund-path
      kind: zero-coverage-high-risk
      evidence: payment/refund.go, 240 LOC, 0 tests, 6 prod incidents in 12mo
      severity: high
    - id: inventory.snapshot-wallpaper
      kind: snapshots-without-assertions
      evidence: 18 snapshot tests that import SKU but never exercise it
      severity: medium
    - id: order.mock-saturated
      kind: implementation-pinned
      evidence: order_test.go — 14 mocks, asserts on mock calls not output
      severity: medium
    - id: pricing.assertion-strength
      kind: weak-assertions
      evidence: 22 tests using assertNotNull / assertTruthy on rich objects
      severity: low
  out_of_scope:
    - rewriting integration tests that hit the network
    - replacing the e2e harness
  budgets:
    max_tests_added: 200
    max_runtime_added_ms: 30000
    max_runs: 3

The single biggest thing most teams skip: scoring the quality of the existing suite, not just its size. A 90%-coverage suite with a 12% mutation score is a worse starting point than a 40%-coverage suite at 65%, because the first one has trained the team to trust a green CI run that means almost nothing. The audit has to surface that gap before the generation stage gets the floor.

Stage 2 · test-generation

This is the stage everyone thinks is the whole job, and it's actually the easiest one once the audit is real. Generation doesn't decide what to test; it executes the audit. If the agent proposes a test that doesn't address a named hole, that's a planning failure — fix the audit, don't ship the test.

Characterize through the AST rather than by re-reading files. Reference, caller, and type queries return exactly what the code under test touches and what touches it, so the pinned behavior is the real contract and not the agent's summary of a file it skimmed — and the surgical queries, instead of brute-force reads, are also what keep a 200-test uplift from drowning the context window. Faithful characterization and token discipline are the same move here.

The shape that works:

  1. One hole, one PR. A PR that closes one named hole is reviewable. A PR that closes seven is a test-suite drop, not a code change. Reviewers wave drops through.
  2. Default to characterization. The agent writes tests against the current behavior of the legacy code, with the explicit understanding that the current behavior may include bugs. Specification tests come later, after a human decides which behaviors are intended.
  3. Pick the right shape for each hole. Characterization for legacy behavior. Property checks for invariants (parsing, hashing, idempotency, ordering). Integration for module seams. Example-based unit tests last, not first.
  4. No new snapshot tests by default. Snapshots are allowed only against a behavior the agent can name in one sentence in the PR. "Pins the JSON shape of /api/v2/orders/:id" passes. "Pins the output of formatPrice" doesn't.
  5. Mocks are budgeted, not free. The audit gives the PR a mock budget. Over budget means the agent is pinning implementation; revise the plan.
  6. The test cites the hole. Each test references the audit smell ID it addresses. Tests that don't cite anything get cut before merge.

One uplift = one declarative artifact. Audit SHA, hole IDs being addressed, approach (characterization / property / integration), pinned behaviors, mock budget used. All in the PR. Reviewers can read the artifact in a minute and know whether the tests are real.

# uplifts/2026-05-30-payment-refund-characterization.yaml
uplift:
  audit: audits/checkout-suite.2026-05-30.yaml@a1b2c3d
  addresses: [payment.refund-path]
  approach: characterization
  pins:
    - behavior: refund of partially-shipped order
    - behavior: refund with negative balance
    - behavior: refund of refunded order is idempotent
    - behavior: refund precision matches storage precision (no float drift)
  mocks_used: 0                # this one's a pure characterization pass
  budgets_used:
    tests_added: 34
    runtime_added_ms: 4200
    runs: 1
  verify:
    plan: verify/2026-05-30-payment-refund-path.yaml

Stage 3 · verify-tests

Verify is the single highest-leverage stage in the loop, and the one teams skip first because it looks redundant — "we wrote tests, didn't we?" Yes, and the question verify answers is whether those tests are tests. Without it, a generation pass is a wall of green nobody trusts. With it, you catch the four ways agentic tests pretend to be useful.

Verify has to do more than run the new tests against the current code (that always passes — the agent wrote them to). It needs four signals, and the agentic ones aren't optional:

Don't ship a single number. Verify reports per dimension: mutation score, assertion strength, duplicate fraction, runtime delta. "Tests pass" hides the case where 200 new tests caught 4% more mutations. Score per dimension, with a threshold each.

# verify/run.py — a tiny but real verify gate.
import sys
from verify import mutation, assertions, dedup, runtime

target = "services/checkout"

scores = {
    "mutation_score":      mutation.run(target, against="HEAD"),
    "assertion_strength":  assertions.score(target),
    "non_duplicates":      1 - dedup.fraction(target),
    "runtime_overhead":    1 - runtime.delta_fraction(target),
}

THRESHOLDS = {
    "mutation_score":      0.70,
    "assertion_strength":  0.80,
    "non_duplicates":      0.95,
    "runtime_overhead":    0.80,   # at most 20% slower
}
failed = {k: scores[k] for k, t in THRESHOLDS.items() if scores[k] < t}
if failed:
    print("VERIFY GATE FAILED:", failed); sys.exit(1)
print("VERIFY GATE PASSED:", scores)

The merge gate

The gate is what stops a generation PR from inflating the suite without raising its informativeness. The minimum that earns its keep:

  1. Audit SHA present and unmodified. The PR cites a real audit at a real revision. The audit hasn't been edited in the same PR.
  2. One PR, one hole. PRs addressing more than one hole are auto-split, smallest first.
  3. Every new test cites a hole. Tests with no audit reference are cut before merge.
  4. Verify gate green on every dimension. Mutation, assertion strength, duplicates, runtime — all individually green. No "overall pass."
  5. Snapshot tests carry a one-sentence justification. Or they're not snapshots; they're walls.

A PR with all five is mergeable. A PR with four of five is interesting and goes back to whichever stage owns the missing one.

Best practices, in plain English

Failure modes & gotchas

These have actually taken teams down. Every one has a one-week fix that nobody had time for.

The gotcha behind half of them: the new tests have no recorded intent. An audit-hole reference plus a one-sentence pinned behavior in the artifact eliminates most of this list before it ships.

Cost, test-suite weight, and the agent budget

Cost isn't a finance problem — it's a CI-tax problem in disguise. An agent allowed to generate without bounds will inflate the suite by an order of magnitude, slow CI to a crawl, and produce a wall of green that hides the few new tests that actually catch things. Bound the work or the work bounds you.

How act101 implements the loop

The loop is the canon; act101 is the implementation — with one division of labor stated plainly. act101 owns the structural half: where the suite is weak and what the code under test actually does. The quality half — does a test catch bugs — is mutation testing, which act101 does not do and does not pretend to. You run them together.

Mutation testing is the irreplaceable quality gate, and it's external; act101 is the irreplaceable structural map and surgical surface that makes the agent's tests faithful in the first place. The pattern is tool-agnostic, but its structural half is a single engine the agent already speaks to rather than a pile of language-specific scripts. The maturity gain is in running the loop — act101 is what makes that affordable.

The maturity ladder

Most teams don't sit at one tier — they're advanced on coverage and primitive on mutation, or have a great property library and no audit. Tick what you actually do today, not what you mean to do.

Zero to three: coverage theatre. Four to six: a real suite is forming but porous — implementation-pinned tests are slowing every refactor down. Seven to nine: the suite is informative; modernization is now safe to start. Ten: the loop is improving itself; mutation score trends up while suite runtime stays flat.

A reasonable 30 / 60 / 90-day plan

  1. Days 1–30 — get to honest. Pick one service. Run mutation testing on it; publish the score. Stop reporting line coverage to leadership. Write one audit by hand and commit it. You're not adding tests yet; you're making the suite's current weakness visible.
  2. Days 31–60 — build the verify gate. Mutation score and assertion strength as merge gates on the targeted module. Snapshot justification policy. Mock budgets. You can now ship agentic uplift PRs safely, even if slowly.
  3. Days 61–90 — close the loop. Audit TTLs and post-merge re-scoring. Runtime delta gate. Suite weight ceilings per module. Mutation score becomes a chart that trends; modernization on this service is now a credible next step.

What Feathers got right (and the handoff to modernization)

Michael Feathers' Working Effectively with Legacy Code defined a piece of legacy code, exactly, as code without tests. The book's whole method is: write characterization tests first, then change anything. The technique is twenty years old, and the reason it hasn't been universally adopted isn't that anyone disagrees — it's that writing characterization tests by hand on a million-line system is a person-year of work nobody funds.

Agentic uplift is what makes Feathers' technique tractable at the scale the modernization actually needs: the audit picks the boundary from structural analysis, the generation stage pins behavior through AST queries rather than the agent's recollection, and the verify stage proves the pins are real with mutation testing alongside. Read Feathers alongside this playbook; the overlap is the part that's actually load-bearing — a surgical, language-agnostic structural surface is what turns a person-year of characterization into a loop you can run boringly.

The handoff to modernization is direct: the characterization suite produced by this loop is the contract that a modernization is measured against. Modernization, properly run, is "change the implementation arbitrarily, keep this suite green." Without the suite, modernization is a rewrite. With it, modernization is a refactor at a different altitude. That's the topic of the next playbook.

The shortest possible summary: stop watching coverage; start watching mutation score. Audit the suite, not just the code. Pin behavior first; specify intent later. Verify is mutation testing, not npm test. Trace every run. Roll back flakes without ceremony. That loop, run boringly for six months, is the step zero modernization has been waiting for.