Testing Uplift with Agents: A Playbook
A modernization without tests is a rewrite blindfolded. Most teams know this and ship the modernization anyway, because writing characterization tests for a decade-old service is the unglamorous, expensive, person-month of work nobody volunteers for. Agents change the economics — but only if you don't let them produce the kind of tests that pass for the wrong reasons. This piece is about how to use agents to lift a real test suite into existence on top of legacy code, in a way that earns the trust the modernization is about to lean on.
If you want the verdict in one breath: coverage is not the metric, mutation score is — pin existing behavior, prove the pins fail when the code breaks, and never let the agent ship a test it can't justify with a behavior it preserved. Skip any of that and you don't have a test suite; you have a wall of green that will be just as green the day the modernization regresses prod.
Why testing uplift with agents needs its own discipline
Classic test writing assumed two things: a human understood the behavior they were pinning, and a passing test meant something useful was checked. Agents break both at once, and the failure is harder to spot than in refactoring — because a refactor produces a diff and a test produces a green check.
- Agents generate tests that pass by construction. The agent reads the code, writes a test that calls the code, asserts what the code returned. Of course it passes. That's not a test; it's a snapshot of today's bug.
- Coverage goes up; informativeness doesn't. A thousand new tests can move line coverage twenty points without catching a single regression that wasn't already caught. The number on the dashboard rises while the suite's diagnostic power stays flat.
- Snapshot tests look like assertions. A
toMatchSnapshot()next to forty bytes of JSON is indistinguishable, from the dashboard, from a real invariant check. It locks in current output; it tells you nothing about correctness. - Mock-saturated tests pin implementation, not behavior. An agent that mocks the dependency, then asserts the mock was called, has written a test that breaks the day you refactor the call site — and passes the day the call site silently returns the wrong answer.
- Generated tests outlive their context. The agent emitted them in one session, against one version of the code, with a plan that lives only in chat history. Six months later, when a test starts failing, nobody knows whether it's pinning a real invariant or an accident of the day it was generated.
The bottom line: classic test uplift measured whether coverage went up. Agentic test uplift has to measure whether the suite's ability to catch regressions went up — and those are two completely different numbers, with the second one being the one you actually want.
The uplift loop, end to end
test-audit -> test-generation -> verify-tests
^ |
|__________________________________|
(audit gets re-scored after merge)
Three stages, one direction, a back-edge so the audit you started with is the audit you re-score against. The audit decides where the suite is least informative, the generation stage produces tests bounded by the audit, and verify decides whether the new tests are real tests. Skip the audit and the agent invents its own definition of "needs tests." Skip verify and you ship a wall of greenness with no diagnostic value.
This loop is the canonical pattern for agentic test uplift — it predates any one tool, and any team can run it. It rests on one honest division of labor. The structural half — where the suite is weak, and what the code under test actually does — is best answered by AST-aware structural analysis, not a coverage badge. The quality half — whether a test actually catches bugs — is mutation testing. Those are two different tools measuring two different things, and you run them together. Sourcing the audit's ranked hole list from structural analysis (instead of a coverage percentage), and characterizing behavior from the AST (instead of the agent re-reading whole files), is what keeps a 200-test uplift both faithful and inside a context budget. One measured note on how act101 fills the structural half comes at the end; everything else here is tool-agnostic.
Each stage has a uniquely agentic twist. The audit isn't a coverage report. The generation isn't a single prompt. The verify isn't npm test.
Stage 1 · test-audit
The most common reason an agentic testing pass fails isn't the model — it's that nobody told the agent where the suite was actually weakest. Coverage reports lie by construction: they show executed lines, not asserted behaviors. The audit's job is to make the weakness legible. Do four things:
- Pick a target boundary. Module, service, bounded context. Not "the whole repo." An audit without a boundary produces tests everywhere and confidence nowhere.
- Score what's wrong with the existing suite, not just what's missing. Mutation score by file, snapshot density, mock-to-assertion ratio, tests that import the SUT but never call it, tests with assertion-strength below a threshold. Numbers, not adjectives.
- Map the high-blast-radius holes. Uncovered files with high churn × high blame × prod-incident history. The audit ranks holes; it doesn't list them. Coupling and chokepoint analysis is what turns "uncovered" into "uncovered and depended on by half the service" — the ranking signal, not just the coverage gap.
- Write the audit to disk. A versioned artifact, in git, with a SHA. The generation stage cites it. The verify stage rescores against it.
Coverage is a side-effect metric. Track it, don't optimize for it. The metric the audit cares about is mutation score by module — the percentage of synthetic bugs the current suite would catch. Everything below 50% on a critical path is a hole, no matter what the coverage badge says.
# audits/checkout-suite.2026-05-30.yaml — one file, in git, one SHA.
audit:
target: services/checkout
scored_on: 2026-05-30
signal:
line_coverage: 0.42
branch_coverage: 0.31
mutation_score: 0.18 # the real number
assertion_strength: 0.55
holes:
- id: payment.refund-path
kind: zero-coverage-high-risk
evidence: payment/refund.go, 240 LOC, 0 tests, 6 prod incidents in 12mo
severity: high
- id: inventory.snapshot-wallpaper
kind: snapshots-without-assertions
evidence: 18 snapshot tests that import SKU but never exercise it
severity: medium
- id: order.mock-saturated
kind: implementation-pinned
evidence: order_test.go — 14 mocks, asserts on mock calls not output
severity: medium
- id: pricing.assertion-strength
kind: weak-assertions
evidence: 22 tests using assertNotNull / assertTruthy on rich objects
severity: low
out_of_scope:
- rewriting integration tests that hit the network
- replacing the e2e harness
budgets:
max_tests_added: 200
max_runtime_added_ms: 30000
max_runs: 3
The single biggest thing most teams skip: scoring the quality of the existing suite, not just its size. A 90%-coverage suite with a 12% mutation score is a worse starting point than a 40%-coverage suite at 65%, because the first one has trained the team to trust a green CI run that means almost nothing. The audit has to surface that gap before the generation stage gets the floor.
Stage 2 · test-generation
This is the stage everyone thinks is the whole job, and it's actually the easiest one once the audit is real. Generation doesn't decide what to test; it executes the audit. If the agent proposes a test that doesn't address a named hole, that's a planning failure — fix the audit, don't ship the test.
Characterize through the AST rather than by re-reading files. Reference, caller, and type queries return exactly what the code under test touches and what touches it, so the pinned behavior is the real contract and not the agent's summary of a file it skimmed — and the surgical queries, instead of brute-force reads, are also what keep a 200-test uplift from drowning the context window. Faithful characterization and token discipline are the same move here.
The shape that works:
- One hole, one PR. A PR that closes one named hole is reviewable. A PR that closes seven is a test-suite drop, not a code change. Reviewers wave drops through.
- Default to characterization. The agent writes tests against the current behavior of the legacy code, with the explicit understanding that the current behavior may include bugs. Specification tests come later, after a human decides which behaviors are intended.
- Pick the right shape for each hole. Characterization for legacy behavior. Property checks for invariants (parsing, hashing, idempotency, ordering). Integration for module seams. Example-based unit tests last, not first.
- No new snapshot tests by default. Snapshots are allowed only against a behavior the agent can name in one sentence in the PR. "Pins the JSON shape of
/api/v2/orders/:id" passes. "Pins the output offormatPrice" doesn't. - Mocks are budgeted, not free. The audit gives the PR a mock budget. Over budget means the agent is pinning implementation; revise the plan.
- The test cites the hole. Each test references the audit smell ID it addresses. Tests that don't cite anything get cut before merge.
One uplift = one declarative artifact. Audit SHA, hole IDs being addressed, approach (characterization / property / integration), pinned behaviors, mock budget used. All in the PR. Reviewers can read the artifact in a minute and know whether the tests are real.
# uplifts/2026-05-30-payment-refund-characterization.yaml
uplift:
audit: audits/checkout-suite.2026-05-30.yaml@a1b2c3d
addresses: [payment.refund-path]
approach: characterization
pins:
- behavior: refund of partially-shipped order
- behavior: refund with negative balance
- behavior: refund of refunded order is idempotent
- behavior: refund precision matches storage precision (no float drift)
mocks_used: 0 # this one's a pure characterization pass
budgets_used:
tests_added: 34
runtime_added_ms: 4200
runs: 1
verify:
plan: verify/2026-05-30-payment-refund-path.yaml
Stage 3 · verify-tests
Verify is the single highest-leverage stage in the loop, and the one teams skip first because it looks redundant — "we wrote tests, didn't we?" Yes, and the question verify answers is whether those tests are tests. Without it, a generation pass is a wall of green nobody trusts. With it, you catch the four ways agentic tests pretend to be useful.
Verify has to do more than run the new tests against the current code (that always passes — the agent wrote them to). It needs four signals, and the agentic ones aren't optional:
- Mutation testing. Inject synthetic bugs into the code under test; rerun the suite. A real test fails on most mutations. A snapshot or a mock-only test fails on almost none. This is the metric. Anything below 70% on the targeted module fails the gate.
- Assertion-strength scoring. A linter scores every new test's assertions.
assertEqualon a rich value is strong.assertNotNull,assertTruthy,assertContainswith one expected key — weak. A PR whose new tests skew weak goes back. - Duplicate detection. AST-similarity over the suite catches agents that generated four tests for the same path with different names. Duplicates inflate the count without raising informativeness — and structural similarity, not test-name matching, is the only thing that sees them.
- Runtime delta. New tests get a runtime budget. A 12-minute test suite that becomes a 22-minute suite over a quarter is a CI tax nobody approved. The audit caps this; verify enforces it.
Don't ship a single number. Verify reports per dimension: mutation score, assertion strength, duplicate fraction, runtime delta. "Tests pass" hides the case where 200 new tests caught 4% more mutations. Score per dimension, with a threshold each.
# verify/run.py — a tiny but real verify gate.
import sys
from verify import mutation, assertions, dedup, runtime
target = "services/checkout"
scores = {
"mutation_score": mutation.run(target, against="HEAD"),
"assertion_strength": assertions.score(target),
"non_duplicates": 1 - dedup.fraction(target),
"runtime_overhead": 1 - runtime.delta_fraction(target),
}
THRESHOLDS = {
"mutation_score": 0.70,
"assertion_strength": 0.80,
"non_duplicates": 0.95,
"runtime_overhead": 0.80, # at most 20% slower
}
failed = {k: scores[k] for k, t in THRESHOLDS.items() if scores[k] < t}
if failed:
print("VERIFY GATE FAILED:", failed); sys.exit(1)
print("VERIFY GATE PASSED:", scores)
The merge gate
The gate is what stops a generation PR from inflating the suite without raising its informativeness. The minimum that earns its keep:
- Audit SHA present and unmodified. The PR cites a real audit at a real revision. The audit hasn't been edited in the same PR.
- One PR, one hole. PRs addressing more than one hole are auto-split, smallest first.
- Every new test cites a hole. Tests with no audit reference are cut before merge.
- Verify gate green on every dimension. Mutation, assertion strength, duplicates, runtime — all individually green. No "overall pass."
- Snapshot tests carry a one-sentence justification. Or they're not snapshots; they're walls.
A PR with all five is mergeable. A PR with four of five is interesting and goes back to whichever stage owns the missing one.
Best practices, in plain English
- Stop reporting line coverage. The number is honest but useless. Report mutation score by module, weekly. Watch the trend, not the badge.
- The audit is the spec. The agent's "this looks like it needs tests" is folklore. The audit's ranked hole list is the spec.
- Characterization, not specification, on legacy code. You don't know which behaviors are intended yet. Pin first, decide later, edit the tests when intent is decided.
- No snapshots by default. Allow them only with a named, one-sentence behavior they pin. Default-deny beats default-allow every time.
- Mocks are a budget. Per-test and per-PR. Over budget means the test is asserting on the wrong layer.
- Property checks where invariants exist. Hashing, parsing, serialization, idempotency, ordering. A property check is the test you wish you'd written every time something silently broke.
- Pin the model and the agent. Same model version, same agent definition, recorded in the uplift artifact. Reproducibility belongs to the suite, not the chat.
- Trace every run. Plan, tool calls, files read, files written, tokens used. A test you can't explain six months later is a test you should delete.
- The flaky-test budget is zero. Generated flaky tests are the worst kind of debt — they train the team to ignore failures. A flake from generation is a rollback, not a retry.
- Treat the test suite as a first-class artifact. It has a runtime budget, a maintenance owner, a deprecation policy. "More tests" is not the same as "better tests."
Failure modes & gotchas
These have actually taken teams down. Every one has a one-week fix that nobody had time for.
- Coverage moved, mutation didn't. Line coverage went from 42% to 71%; mutation score went from 18% to 21%. The suite is bigger and exactly as informative. Fix: gate on mutation score, not coverage.
- Snapshot wallpaper. Three hundred snapshot tests, all green, none assert a named behavior. Fix: require a one-sentence justification per snapshot; auto-deny unjustified ones.
- Mock-on-mock assertions. The test mocks the dependency, then asserts the mock was called with the right argument. The day you change the call site, the test breaks; the day the call site silently returns the wrong value, it doesn't. Fix: assertion-strength scoring; cap mocks per test.
- The agent pinned the bug. A characterization test pins behavior that turns out to be a known bug. Six weeks later the bug is fixed; the test fails; nobody knows whether to update the test or revert the fix. Fix: characterization tests tag their pinned behavior with
kind: legacy-bug | legacy-feature | unknown; bug-tagged tests block on intent review. - The audit became a wishlist. 80 holes, all P1, none scoped. Fix: the audit caps itself. A service-level audit ships with at most ten ranked holes per round.
- CI tax. Three months of generation passes added eight minutes to CI. Devs start avoiding PRs. Fix: runtime delta gate, enforced per PR.
- Test churn ate review. Every refactor now breaks a dozen tests because they pinned implementation. Fix: assertion-strength gate plus mock budget — the conditions that make tests implementation-pinned are exactly what those gates catch.
- The pinned behavior was nondeterministic. The agent characterized a function that returns time-dependent output; the test is flaky from minute one. Fix: a determinism check in verify — if the same test produces different outputs on three back-to-back runs, the PR is closed.
The gotcha behind half of them: the new tests have no recorded intent. An audit-hole reference plus a one-sentence pinned behavior in the artifact eliminates most of this list before it ships.
Cost, test-suite weight, and the agent budget
Cost isn't a finance problem — it's a CI-tax problem in disguise. An agent allowed to generate without bounds will inflate the suite by an order of magnitude, slow CI to a crawl, and produce a wall of green that hides the few new tests that actually catch things. Bound the work or the work bounds you.
- Budget tests added and runtime added, per PR. Set in the audit, enforced by the gate. Hitting the cap means re-audit, not "approve with caveats."
- Cap runs per uplift. Three runs is usually plenty. Five is the agent failing to converge.
- Cap context spend. Tokens per uplift. A pass that cost 800k tokens didn't understand the code; it summarized it badly five times.
- Cap suite weight. A module gets a tests-per-LOC ceiling. Over the ceiling, the next PR has to retire equal-or-greater weight in low-informativeness tests it's replacing.
- Cheap models for the boring cases. A small model does property scaffolding and rename-style fixes; the strong model writes characterization tests for high-risk paths. Routing pays for itself in a week.
How act101 implements the loop
The loop is the canon; act101 is the implementation — with one division of labor stated plainly. act101 owns the structural half: where the suite is weak and what the code under test actually does. The quality half — does a test catch bugs — is mutation testing, which act101 does not do and does not pretend to. You run them together.
- The audit — act101's
analyze_*suite ranks the holes. Test-gap analysis finds uncovered surface; coupling, chokepoint, and complexity scoring turns "uncovered" into "uncovered and high-blast-radius," which is the ranking the audit needs. It is AST-aware across 163 languages, so it reads the same structure in any language the suite is written in. - Generation — act101's AST-aware queries are how the agent characterizes faithfully.
references,callers,skeleton, and type queries return the exact slice of the code under test, so the agent pins real behavior instead of its summary of a file it skimmed — and does it without burning the context window on brute-force reads. Test-harness generation scaffolds the test shells. - Verify, structural half — act101's AST-similarity is the duplicate-detection gate, catching the four-tests-for-one-path inflation that test-name matching misses. The mutation-score and assertion-strength gates that decide whether the tests are real are external (
mutmut/pitest/stryker; a small assertion-strength linter) — pair them with act101; they are the gate act101 deliberately does not try to be. - The surface — all of it is callable as MCP tools from inside Cursor or Claude Code, so the agent runs the loop in-band while it works, and as the
actCLI for the verify gate in CI.
Mutation testing is the irreplaceable quality gate, and it's external; act101 is the irreplaceable structural map and surgical surface that makes the agent's tests faithful in the first place. The pattern is tool-agnostic, but its structural half is a single engine the agent already speaks to rather than a pile of language-specific scripts. The maturity gain is in running the loop — act101 is what makes that affordable.
The maturity ladder
Most teams don't sit at one tier — they're advanced on coverage and primitive on mutation, or have a great property library and no audit. Tick what you actually do today, not what you mean to do.
- [ ] The team reports mutation score by module, not coverage percentage
- [ ] An audit is a versioned artifact in git, not a chat message
- [ ] Each generation PR addresses exactly one named hole
- [ ] Characterization is the default approach on legacy code
- [ ] Mock budgets are enforced per PR
- [ ] Snapshot tests require a named, one-sentence behavior they pin
- [ ] Assertion-strength scoring is a merge gate
- [ ] Mutation score is a merge gate, threshold per module
- [ ] Runtime delta is a merge gate, capped per PR
- [ ] Flaky generated tests are rolled back, not retried
Zero to three: coverage theatre. Four to six: a real suite is forming but porous — implementation-pinned tests are slowing every refactor down. Seven to nine: the suite is informative; modernization is now safe to start. Ten: the loop is improving itself; mutation score trends up while suite runtime stays flat.
A reasonable 30 / 60 / 90-day plan
- Days 1–30 — get to honest. Pick one service. Run mutation testing on it; publish the score. Stop reporting line coverage to leadership. Write one audit by hand and commit it. You're not adding tests yet; you're making the suite's current weakness visible.
- Days 31–60 — build the verify gate. Mutation score and assertion strength as merge gates on the targeted module. Snapshot justification policy. Mock budgets. You can now ship agentic uplift PRs safely, even if slowly.
- Days 61–90 — close the loop. Audit TTLs and post-merge re-scoring. Runtime delta gate. Suite weight ceilings per module. Mutation score becomes a chart that trends; modernization on this service is now a credible next step.
What Feathers got right (and the handoff to modernization)
Michael Feathers' Working Effectively with Legacy Code defined a piece of legacy code, exactly, as code without tests. The book's whole method is: write characterization tests first, then change anything. The technique is twenty years old, and the reason it hasn't been universally adopted isn't that anyone disagrees — it's that writing characterization tests by hand on a million-line system is a person-year of work nobody funds.
Agentic uplift is what makes Feathers' technique tractable at the scale the modernization actually needs: the audit picks the boundary from structural analysis, the generation stage pins behavior through AST queries rather than the agent's recollection, and the verify stage proves the pins are real with mutation testing alongside. Read Feathers alongside this playbook; the overlap is the part that's actually load-bearing — a surgical, language-agnostic structural surface is what turns a person-year of characterization into a loop you can run boringly.
The handoff to modernization is direct: the characterization suite produced by this loop is the contract that a modernization is measured against. Modernization, properly run, is "change the implementation arbitrarily, keep this suite green." Without the suite, modernization is a rewrite. With it, modernization is a refactor at a different altitude. That's the topic of the next playbook.
The shortest possible summary: stop watching coverage; start watching mutation score. Audit the suite, not just the code. Pin behavior first; specify intent later. Verify is mutation testing, not npm test. Trace every run. Roll back flakes without ceremony. That loop, run boringly for six months, is the step zero modernization has been waiting for.