Continuous Quality with Agents: A Playbook
Continuous integration won the argument that the build should be green on every commit, not once a quarter before release. Continuous delivery won the next one: that a green build should be shippable on demand, through a pipeline, without a war room. Both became practices, then products — a CI server, a deployment pipeline — that sit in the path of every change and answer one question automatically, on every push, forever. Neither of them ever answered whether the code was any good. For thirty years that gap didn't matter much, because a human wrote the code with intent, an architect's eye, and a sense of what they were degrading. Agents removed all three, at machine speed, and the gap became the hole everything falls through.
Continuous Quality (CQ) is the third pillar. If you want the verdict in one breath: CQ is a pipeline stage that gates on the health of the code, not the behavior of the code — it scores security and architecture on every push the way CI scores tests, fails the merge on a regression you can prove, hands the findings back to the agent that wrote them as something the agent can act on, and never re-litigates a finding the team already dismissed. Skip the gate and quality is a dashboard nobody opens. Skip the feedback path and you have a scanner, not a loop. Skip the ratchet and the gate trains your team to ignore it.
Why continuous quality with agents needs its own discipline
CI and CD assumed the author understood the change and a green suite meant the change was safe. Agents break both, and the failure is structural, not occasional — it happens on a fraction of every commit, which is exactly the cadence CI was built to police and CQ now has to.
- CI proves behavior; it is blind to health. A suite can be green on top of a hardcoded credential, a
CORS *, a god file, and a fresh dependency cycle. Tests pin what the code does, not what the code is. - Quality debt now accrues per commit, not per quarter. A human degraded a codebase slowly enough that an annual architecture review caught it. An agent degrades it on Tuesday afternoon. The only review cadence that keeps up is the one that runs on push.
- The agent that wrote the smell cannot see it. Its "understanding" was a context snapshot. It didn't open the file the new coupling reaches into, so it can't know it just tangled two modules.
- Security regressions look like features in a diff.
verify=False,alg:none, aNEXT_PUBLIC_secret, a.cursorruleswith an injected directive, an.mcp.jsonthat grants RCE — none of these read as a logic change in the PR title. SAST sometimes catches the first kind; nothing in CI catches the AI-config kind. - Findings with no feedback path are noise. A wall of alerts a human is supposed to triage is the thing teams mute first. If the finding doesn't arrive as something the agent can act on, the loop never closes and the gate decays into a formality.
- Quality without a ratchet drifts backward. A gate that re-flags a finding the team deliberately accepted is a gate that gets switched off. Continuity — remembering what was decided — is what makes the score mean something over time.
The bottom line: CI measured whether the change still works, CD measured whether the change can ship. CQ measures whether the change made the codebase healthier or sicker — and because an agent is the author, that measurement has to run on every push, return a number you can gate on, and feed back into the thing that produced the change. That last clause is the whole game and the part a plain scanner doesn't have.
The continuous-quality loop, end to end
agent writes ─► CI (build / test) ─► CQ (scan + Health Score + gate) ─► CD (deploy)
▲ │
└──────────── remediation bundle ──────────┘
findings + AST targets + edit anchors feed the next agent run
Four boxes, one direction, and a back-edge that is the entire point. CI and CD are the two stages everyone already runs; CQ slots between them as a standing gate — same trigger as CI (push / PR), same authority as CD (it can stop a merge). The back-edge is what makes it continuous quality rather than occasional scanning: the findings don't terminate in a report, they return to the agent as structured, actionable context and seed the next change.
This is also the loop the other four playbooks in this set plug into. Refactoring, testing uplift, migration, and modernization are operations a team invokes deliberately, each ending in its own verify step. CQ is the always-on version of that verify step, running on every commit whether or not anyone invoked an operation. Run those four against a repo and CQ is the gate that proves the work landed clean and stays clean.
CI→CD→CQ is the canonical shape, and it predates any one tool — any team can wire it from the scanners and CI primitives they already have. This playbook walks the three things you have to build to turn a scanner into a loop — the gate, the feedback path, and the ratchet — and each has a uniquely agentic twist: the gate scores structure not behavior, the feedback path targets an agent not a human, and the ratchet remembers decisions across runs. The closing section names one toolchain — act101 online — that ships all three as a single stage; the pattern itself is yours to run with whatever you already have.
Stage 1 · the quality gate
The gate is the CI-shaped half: it runs on every push and pull request, on the same runner as your tests, and it answers one deterministic question — is this commit healthier or sicker than the baseline? Four properties make it a gate and not a dashboard.
- It runs on every push, on your infrastructure. The scan executes in your Actions runner, scoped to the changed files and their blast radius. There is no compute leaving your side and no per-scan cost — the same economic shape that let CI run on every commit instead of nightly.
- It emits one deterministic number, with two halves. A 0–100 Health Score split into a Security sub-score and an Architecture sub-score — one grade, drill down to each half. The contract is the contract CI taught us to demand: reproducible from the same commit, never network-dependent. Same SHA in, same score out, or it isn't a gate.
- It scores the two things CI is blind to. Security: hardcoded credentials, taint/injection flows, unsafe surface, plus the AI-native classes nobody else gates on —
.cursorrules/AI-config backdoors,.mcp.jsonRCE, frontend secret leaks, insecure defaults, permissive datastore rules. Architecture: the structural rot agents accrue fastest — god files, cycles, dead code, tangled coupling, broken layering. - It speaks SARIF, so the platform does the rest. A scan that emits SARIF uploads to GitHub code scanning: findings become Security-tab alerts with inline PR annotations, dedup, and history — and, for free on public repos, Copilot Autofix suggestions. The gate doesn't reinvent the alert UI; it feeds the one the platform already has.
Borrow from CI, not from the security scanner. A nightly SAST run that emails a 400-finding PDF is the thing teams learned to ignore. CI won because it was fast, ran on every commit, and returned a binary you could gate on. The quality gate earns the same trust the same way: per-push, deterministic, one number, blocking only on a regression.
# .github/workflows/continuous-quality.yml — the gate, wired like a CI stage.
name: continuous-quality
on: [push, pull_request]
jobs:
health:
runs-on: ubuntu-latest
permissions:
contents: read
security-events: write # for SARIF upload
steps:
- uses: actions/checkout@v4
- uses: <your-scanner>/scan-action@v1
with:
scope: changed # changed files + blast radius, not whole repo
sarif: true # findings → code-scanning alerts
report: false # opt-in LLM report; off by default (spends tokens)
# quality-gate.yaml — an illustrative gate policy: what "healthier or sicker"
# means, committed to git and resolvable from one SHA.
gate:
security:
floor: 90 # hard floor — security never ratchets down
block_new: true # any new security finding the diff introduces fails the merge
architecture:
baseline: main # delta gate, not an absolute bar
max_regression: 0 # the diff may not lower the architecture sub-score
block_new_classes: [dependency_cycle, god_file, layering_violation]
pre_existing: ignore # the gate blocks what THIS change adds, not the whole backlog
The single most common way teams get this wrong is treating the gate as an absolute bar on day one — turning it on at "score must be ≥ 95" against a repo that scores 60, so it red-flags every PR and gets disabled by Friday. The gate's job is to block regressions the change introduces, not to hold the whole backlog hostage. Security gets a hard floor; architecture gets a no-regression delta against the baseline. The backlog is Stage 3's problem.
Stage 2 · the feedback path
This is the half that has no equivalent in a classic scanner, and it's what makes the practice continuous quality instead of scheduled blame. A finding is worthless until it reaches whoever can fix it — and the author is now an agent. So the gate's output is engineered for an agent to consume, not a human to read.
- The remediation bundle is the deliverable, not the report. Every gate run emits a token-minimal, LLM-ready bundle: the findings, the AST targets they sit on, and the edit anchors a tool would need to act. A finding is pre-resolved to the exact node an agent should change, so the agent spends tokens fixing, not re-discovering.
- Shift the scan left, into the agent loop. Expose the same engine as an MCP tool and the agent calls it while building — inside Cursor or Claude Code — and gets the security+architecture read before the commit, not only in CI. The gate in CI becomes the backstop; the in-editor scan is the agent catching itself.
- On-demand, blast-radius-scoped. An on-demand comment trigger on a PR runs a scan scoped to the change's impact closure — the diff seeds an impact graph, and the scan runs that set on the user's runner. Cheaper and more correct than a dumb diff scan, and it puts a re-scan one comment away.
- The agent fixes; the platform authors; the human merges. CQ never writes the edit. The bundle (or a code-scanning alert) is handed to whatever agent the team points at it — a code-scanning "assign to Copilot" fix, or a BYO agent grounded on the bundle over MCP — which opens a PR the human reviews. The loop closes when the re-scan on that fix PR comes back clean.
A finding is a message to the next agent run, not a verdict on the last one. Classic scanning addressed a human reviewer and stopped. CQ addresses the agent, in the agent's protocol, with the AST targets it needs — so the same automation that wrote the smell can be pointed at fixing it, and the result re-enters the gate.
// remediation-bundle.json — the gate's output, shaped for an agent.
{
"commit": "9f3c1a2",
"health": { "score": 78, "security": 84, "architecture": 72 },
"findings": [
{
"id": "sec.frontend-secret-leak.a91f", // stable — drives dedup + the refuted ledger
"class": "frontend_secret_leak",
"severity": "high",
"half": "security",
"location": { "file": "web/src/api.ts", "line": 12, "col": 9 },
"ast_target": "VariableDeclaration#SUPABASE_SERVICE_ROLE",
"edit_anchor": "web/src/api.ts:12:9..12:71",
"remediation": "service_role key is shipped to the client via VITE_; move to a server route"
}
],
"for_agent": "fix the findings above at their edit_anchors; re-run the scan to confirm"
}
Stage 3 · the ratchet
A gate that forgets is a gate that nags, and a gate that nags gets turned off. The ratchet is what lets the score mean something across hundreds of commits: decisions persist, the gate respects them, and the number only moves one way over time.
- The refuted-findings ledger. When the team dismisses a finding — in the GitHub Security tab, or by listing it in a refuted-findings ledger committed to the repo — it stays quiet on every future run. Because the bundle's stable finding
idis carried as the SARIFpartialFingerprints, GitHub's alert dedup and the ledger agree on what "the same finding" is. Dismissed once, silent forever. - Audit continuity. If the repo carries a
project-map.mdor a remediation log, the gate ingests them and reports incrementally — building on prior architectural state instead of re-deriving it, so recommendations don't re-open settled questions. These artifacts live in the user's repo, committed by the team, so the architecture half gets smarter over time with nothing stored on the vendor side. - The score is a trend, not a snapshot. The point of a per-push number is the chart. Security held at or above its floor; architecture trending up as the backlog drains. The ratchet is what turns "we scanned it" into "the codebase is measurably healthier than it was a month ago" — the metric a CQ practice is ultimately judged on.
// .act/refuted.json — decisions the gate must remember, in git.
{
"refuted": [
{
"id": "arch.complexity-hotspot.parser-core.7b2",
"reason": "hot path, intentionally dense; benchmarked, owner-signed",
"by": "akmartin",
"on": "2026-05-28"
},
{
"id": "sec.unsafe-surface.ffi-bridge.0c4",
"reason": "audited FFI boundary; mitigations documented in SECURITY.md",
"by": "akmartin",
"on": "2026-05-30"
}
]
}
// project-map.md (separate file) is ingested for incremental architecture scoring.
The merge gate
The merge gate is where CQ borrows CD's authority: it can stop a change from reaching main. The minimum that earns its keep:
- Security sub-score at or above its floor. A hard bar, not a delta. Security never ratchets down — a regression below the floor is a red PR, full stop.
- No new findings introduced by the diff. The change may not add a finding. Pre-existing backlog findings do not block (that's the ratchet's job to drain, not the gate's job to wall off).
- Architecture sub-score not regressed against baseline. A delta gate: the diff may not lower the architecture half versus
main. New cycles, god files, and layering breaks are blocking classes. - Refuted findings excluded. Anything in the ledger is silent. A PR red because of a dismissed finding is a broken gate, and teams are right to distrust it.
- SARIF uploaded and alerts reconciled. The findings are in the Security tab with stable fingerprints, so history and dedup hold across runs.
A PR green on all five is mergeable. A PR red on any one goes back to the stage that owns it — security floor and new-finding gates to the feedback path (point the agent at the bundle), architecture regression to whichever operation introduced it.
Best practices, in plain English
- Turn the gate on in report-only mode first. Run it on every PR with
block: falsefor two weeks. You're not gating yet — you're establishing the baseline and proving the score is stable before it can stop a merge. - Gate regressions, drain backlog. The gate blocks what the change adds. The backlog comes down through Stage 3 and the refactoring playbook, on its own cadence, not by holding every PR hostage.
- Security is a floor; architecture is a delta. A hard bar for the urgent half, a no-regression bar for the structural half. Conflating them either lets security rot or makes the gate unusable.
- Make the finding the agent's problem. The bundle, not the PDF. If a human has to translate a finding into an edit, the loop isn't closed — it's just a slower scanner.
- Scan in the agent loop, not only in CI. The in-editor scan (over MCP) catches the smell before the commit; the CI gate is the backstop. Shift-left is cheaper than block-and-rework.
- Never let CQ author the edit. The gate reports; the agent fixes; the human merges. The moment the gate writes code, you've put an un-reviewed automated change on the critical path and lost the liability story.
- Ratchet on stable fingerprints. Dismissals and dedup both key on the same finding
id. If your fingerprints drift, the ledger leaks and the gate starts re-nagging. - Commit the audit artifacts.
project-map.md, the remediation log, the refuted ledger — these live in the repo so continuity survives a runner with no memory. State that isn't in git doesn't exist next run. - Keep the report opt-in. The human-readable LLM report spends tokens and is non-core. The deterministic score, SARIF, and bundle stand on their own; the prose is a convenience, gated behind an explicit toggle.
- Watch the trend, not the run. The health-score chart over weeks is the artifact leadership should see. A single red PR is noise; a sub-score trending the wrong way for a month is the signal.
Failure modes & gotchas
These have actually sunk CQ rollouts. Each has a fix nobody had time for until the gate got disabled.
- The absolute-bar trap. Turned on at "score ≥ 95" against a 60 repo; every PR red; gate off by Friday. Fix: regression-only gating, backlog drained separately.
- Alert fatigue from the whole backlog. SARIF dumps 400 pre-existing findings into the Security tab on first upload and the team mutes the integration. Fix: surface new findings on the PR; let the backlog sit as triageable history, not blocking noise.
- The score moved and nobody knows why. A dependency bump or a rule update shifted the number with no code change. Fix: deterministic, version-pinned rules; the score is reproducible from the SHA and the ruleset version, both recorded on the run.
- Findings with no owner. Alerts pile up because they address "the team." Fix: route the bundle to an agent automatically; the human owner reviews a fix PR, not a raw finding.
- The ratchet leaks. A finding dismissed last week comes back this week because the fingerprint changed. Fix: stable finding
idas the single key for both dedup and the refuted ledger; treat a fingerprint change as a rule bug. - CQ blocks CD on a flaky signal. A non-deterministic finding intermittently reds the gate and the team learns to re-run until green — which is the same as no gate. Fix: any finding that isn't reproducible from the SHA is quarantined out of the blocking set until it's deterministic.
- The agent "fixes" by suppressing. Pointed at a finding, an agent adds it to the refuted ledger instead of fixing it. Fix: ledger edits require a human sign-off with a reason; a refutation in a fix PR is a review red flag, the same way a weakened test assertion is.
- Security regression disguised as a feature. A PR adds
rejectUnauthorized: falseto "fix" a TLS error and tests stay green. Fix: the security floor withblock_new: true— the gate sees the insecure-default class regardless of the green suite.
The gotcha behind most of them: the gate's authority is only as good as its determinism and its memory. A flaky score or a leaky ledger doesn't just produce a bad run — it teaches the team the gate is wrong, and a gate the team overrides on reflex is worse than no gate at all.
Cost, blast radius, and the agent budget
CQ has to be cheaper than the failure it prevents, or it won't run on every push — and a quality gate that only runs nightly is just a slower audit. The design rule, borrowed straight from why CI could afford to run per-commit: N scans is a fixed-cost event, not a variable one.
- Scan compute runs on the user's runner. Same place the tests run; zero incremental infrastructure. This is the property that lets the gate run per-push instead of per-night.
- Scope to the blast radius, not the repo. Changed files plus their impact closure. A 12-file PR scans the symbols it actually touches, not 10,000 files, so the gate adds seconds to CI, not minutes.
- The LLM report is opt-in and on the user's tokens. The deterministic half is free to run unlimited; the prose report — the only token-spending piece — is off by default and runs on the user's own model budget when enabled.
- Entitlement is verified offline. A good gate checks its license on the runner with no network call per scan. N scans generate zero control-plane traffic; the only thing that scales is install-time provisioning, bounded by customer count, not scan count.
- Bound the fix budget, not just the scan. When an agent is pointed at the bundle, cap files-changed and runs per fix PR the same way the refactoring playbook does. A fix that sprawls past its anchors is a fix that's doing something other than the finding.
- Cheap models trim the bundle; strong models fix. A small model can group and prioritize findings; reserve the expensive agent for authoring the change. The deterministic AST targets in the bundle mean even the strong model spends tokens fixing, not exploring.
The marginal cost of one more scan is approximately the cost of not answering an optional counter ping. That is the number that makes "on every push" affordable, which is the number that makes it continuous.
The maturity ladder
Most teams are advanced on one half and absent on the other — solid SAST in CI, nothing watching architecture; or a beautiful architecture audit that runs twice a year by hand. Tick what runs on every push today, not what you mean to wire up.
- [ ] A quality scan runs automatically on every push / PR, not on a schedule
- [ ] The scan returns one deterministic score, reproducible from the SHA
- [ ] The score has a security half and an architecture half
- [ ] The gate blocks merges on regressions the change introduces
- [ ] Security has a hard floor; architecture has a no-regression delta
- [ ] Findings ship as an agent-ready bundle, not just a human report
- [ ] The scan is callable from inside the agent loop (MCP), not only in CI
- [ ] Dismissed findings are remembered and never re-flagged
- [ ] Audit artifacts (project-map, refuted ledger) live in git
- [ ] The health-score trend is a chart someone actually watches
Zero to three: you have CI and a security scanner, and you're calling it quality. Four to six: a real gate, but the loop is open — humans still translate every finding by hand. Seven to nine: the loop is closed; the agent that writes the code is the agent that fixes it, and the gate remembers. Ten: the score trends down on its own and your job is watching the chart.
A reasonable 30 / 60 / 90-day plan
- Days 1–30 — install the gate, report-only. Wire the scan into one repo's CI on every PR with blocking off. Establish the baseline Health Score, confirm it's deterministic across re-runs, and let SARIF populate the Security tab. You're measuring, not gating.
- Days 31–60 — make it block, on regressions only. Turn on the security floor and the new-finding gate; add the architecture no-regression delta against
main. Stand up the refuted ledger so dismissals stick. You can now stop a PR that makes the codebase sicker, without holding the backlog hostage. - Days 61–90 — close the loop. Expose the scan in the team's agent (over MCP) so quality shifts left; route the remediation bundle to a fix agent; wire an on-demand comment trigger for blast-radius scans. The agent that writes the code now fixes its own findings and re-scans, and the health-score trend becomes a number leadership watches.
How act101 online implements the loop
The loop is the canon; act101 online is one toolchain that ships all three parts as a single stage instead of a drawer of overlapping scanners you wire together yourself. It's a GitHub App plus Action, so it rides GitHub's own primitives — the Actions runner, SARIF code scanning, branch protection, the Copilot coding agent — rather than reinventing them. What it adds is the three things those primitives don't have on their own: a unified score, an agent-ready feedback path, and a ratchet.
- The gate — the Action (
act101-ai/scan-action@v2) runsact scanon your runner, scoped to the diff's blast radius, and emits the deterministic Health Score (security + architecture) plus SARIF. The architecture half comes from act101'sanalyze_*suite — the structural sub-score no SAST tool produces — and it rides the same SARIF pipe into the Security tab, so GitHub's alert UI, dedup, and Copilot Autofix all light up for it. Entitlement is alicense-keyinput (or GitHub OIDC), verified offline on the runner — no network call per scan. - The feedback path — the same engine is an MCP tool (
mcp__act__scan), so the agent scans itself while building, and every CI run emits the remediation bundle (findings + AST targets + edit anchors) that an agent can act on. This is the half a plain scanner has no equivalent for. - The ratchet — the stable finding
idrides into SARIF aspartialFingerprints, so GitHub's dedup and act101's refuted ledger (.act/refuted.json) agree on identity, and the architecture half ingests the repo's committedproject-map.mdto score incrementally. Memory lives in your repo and the platform, never on the vendor side. - The economics — scan compute on your runner, entitlement verified offline, the LLM report opt-in on your own GitHub Models tokens. This is what lets the gate run per-push instead of nightly — the same economic shape that made CI affordable in the first place.
The pattern is platform-agnostic; the unifying score, the remediation bundle, and the ratchet are the parts you'd otherwise hand-build on top of five overlapping scanners. You already have most of the primitives this rides on — GitHub Actions for the trigger, branch protection for the authority, SARIF code scanning for the alert surface, the Copilot coding agent for the fix path. What none of them do is the architecture sub-score or the agent feedback path, which is precisely the gap act101 online fills, in one gate, on day one. The maturity gain is in running the loop on every push, not in running more scanners that each red the gate for a different reason.
What Humble and Farley got right (and where CQ goes from here)
Continuous Delivery (Humble & Farley, 2010), building on Fowler's and Booch's continuous-integration work before it, made one durable argument: the path from commit to production should be a pipeline — automated, repeatable, and the single source of truth about whether a change is safe to ship. They were obsessed with shrinking the feedback loop, because a signal you get in minutes changes behavior and a signal you get next quarter changes nothing. They built the pipeline around the signals available at the time: does it build, do the tests pass, does it deploy. Those signals all measured behavior.
Continuous Quality is the same argument applied to the signal CI and CD never had — the health of the code, which agents degrade faster than any human ever could and which no test suite reports. CQ adds a stage to the same pipeline, demands the same properties (automated, per-commit, deterministic, fast), and extends the feedback loop one hop further than Humble and Farley could: not just back to the human who can read the result, but back to the agent that produced the change, in a form it can act on. Read Continuous Delivery alongside this playbook; the pipeline discipline is the load-bearing part, and CQ is what you get when you point that discipline at structure instead of behavior.
The shortest possible summary: gate every push on a deterministic health score, block the regressions a change introduces, hand the findings back to the agent as something it can fix, and remember every decision so the number only moves one way. CI made the build trustworthy. CD made the release trustworthy. CQ makes the agent's code trustworthy — and run boringly on every commit for six months, that's the whole game.