Continuous Quality with Agents: A Playbook

Continuous integration won the argument that the build should be green on every commit, not once a quarter before release. Continuous delivery won the next one: that a green build should be shippable on demand, through a pipeline, without a war room. Both became practices, then products — a CI server, a deployment pipeline — that sit in the path of every change and answer one question automatically, on every push, forever. Neither of them ever answered whether the code was any good. For thirty years that gap didn't matter much, because a human wrote the code with intent, an architect's eye, and a sense of what they were degrading. Agents removed all three, at machine speed, and the gap became the hole everything falls through.

Continuous Quality (CQ) is the third pillar. If you want the verdict in one breath: CQ is a pipeline stage that gates on the health of the code, not the behavior of the code — it scores security and architecture on every push the way CI scores tests, fails the merge on a regression you can prove, hands the findings back to the agent that wrote them as something the agent can act on, and never re-litigates a finding the team already dismissed. Skip the gate and quality is a dashboard nobody opens. Skip the feedback path and you have a scanner, not a loop. Skip the ratchet and the gate trains your team to ignore it.

Why continuous quality with agents needs its own discipline

CI and CD assumed the author understood the change and a green suite meant the change was safe. Agents break both, and the failure is structural, not occasional — it happens on a fraction of every commit, which is exactly the cadence CI was built to police and CQ now has to.

The bottom line: CI measured whether the change still works, CD measured whether the change can ship. CQ measures whether the change made the codebase healthier or sicker — and because an agent is the author, that measurement has to run on every push, return a number you can gate on, and feed back into the thing that produced the change. That last clause is the whole game and the part a plain scanner doesn't have.

The continuous-quality loop, end to end

agent writes ─► CI (build / test) ─► CQ (scan + Health Score + gate) ─► CD (deploy)
     ▲                                          │
     └──────────── remediation bundle ──────────┘
            findings + AST targets + edit anchors feed the next agent run

Four boxes, one direction, and a back-edge that is the entire point. CI and CD are the two stages everyone already runs; CQ slots between them as a standing gate — same trigger as CI (push / PR), same authority as CD (it can stop a merge). The back-edge is what makes it continuous quality rather than occasional scanning: the findings don't terminate in a report, they return to the agent as structured, actionable context and seed the next change.

This is also the loop the other four playbooks in this set plug into. Refactoring, testing uplift, migration, and modernization are operations a team invokes deliberately, each ending in its own verify step. CQ is the always-on version of that verify step, running on every commit whether or not anyone invoked an operation. Run those four against a repo and CQ is the gate that proves the work landed clean and stays clean.

CI→CD→CQ is the canonical shape, and it predates any one tool — any team can wire it from the scanners and CI primitives they already have. This playbook walks the three things you have to build to turn a scanner into a loop — the gate, the feedback path, and the ratchet — and each has a uniquely agentic twist: the gate scores structure not behavior, the feedback path targets an agent not a human, and the ratchet remembers decisions across runs. The closing section names one toolchain — act101 online — that ships all three as a single stage; the pattern itself is yours to run with whatever you already have.

Stage 1 · the quality gate

The gate is the CI-shaped half: it runs on every push and pull request, on the same runner as your tests, and it answers one deterministic question — is this commit healthier or sicker than the baseline? Four properties make it a gate and not a dashboard.

  1. It runs on every push, on your infrastructure. The scan executes in your Actions runner, scoped to the changed files and their blast radius. There is no compute leaving your side and no per-scan cost — the same economic shape that let CI run on every commit instead of nightly.
  2. It emits one deterministic number, with two halves. A 0–100 Health Score split into a Security sub-score and an Architecture sub-score — one grade, drill down to each half. The contract is the contract CI taught us to demand: reproducible from the same commit, never network-dependent. Same SHA in, same score out, or it isn't a gate.
  3. It scores the two things CI is blind to. Security: hardcoded credentials, taint/injection flows, unsafe surface, plus the AI-native classes nobody else gates on — .cursorrules/AI-config backdoors, .mcp.json RCE, frontend secret leaks, insecure defaults, permissive datastore rules. Architecture: the structural rot agents accrue fastest — god files, cycles, dead code, tangled coupling, broken layering.
  4. It speaks SARIF, so the platform does the rest. A scan that emits SARIF uploads to GitHub code scanning: findings become Security-tab alerts with inline PR annotations, dedup, and history — and, for free on public repos, Copilot Autofix suggestions. The gate doesn't reinvent the alert UI; it feeds the one the platform already has.

Borrow from CI, not from the security scanner. A nightly SAST run that emails a 400-finding PDF is the thing teams learned to ignore. CI won because it was fast, ran on every commit, and returned a binary you could gate on. The quality gate earns the same trust the same way: per-push, deterministic, one number, blocking only on a regression.

# .github/workflows/continuous-quality.yml — the gate, wired like a CI stage.
name: continuous-quality
on: [push, pull_request]

jobs:
  health:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      security-events: write     # for SARIF upload
    steps:
      - uses: actions/checkout@v4
      - uses: <your-scanner>/scan-action@v1
        with:
          scope: changed          # changed files + blast radius, not whole repo
          sarif: true             # findings → code-scanning alerts
          report: false           # opt-in LLM report; off by default (spends tokens)
# quality-gate.yaml — an illustrative gate policy: what "healthier or sicker"
# means, committed to git and resolvable from one SHA.
gate:
  security:
    floor: 90               # hard floor — security never ratchets down
    block_new: true         # any new security finding the diff introduces fails the merge
  architecture:
    baseline: main          # delta gate, not an absolute bar
    max_regression: 0       # the diff may not lower the architecture sub-score
    block_new_classes: [dependency_cycle, god_file, layering_violation]
  pre_existing: ignore      # the gate blocks what THIS change adds, not the whole backlog

The single most common way teams get this wrong is treating the gate as an absolute bar on day one — turning it on at "score must be ≥ 95" against a repo that scores 60, so it red-flags every PR and gets disabled by Friday. The gate's job is to block regressions the change introduces, not to hold the whole backlog hostage. Security gets a hard floor; architecture gets a no-regression delta against the baseline. The backlog is Stage 3's problem.

Stage 2 · the feedback path

This is the half that has no equivalent in a classic scanner, and it's what makes the practice continuous quality instead of scheduled blame. A finding is worthless until it reaches whoever can fix it — and the author is now an agent. So the gate's output is engineered for an agent to consume, not a human to read.

  1. The remediation bundle is the deliverable, not the report. Every gate run emits a token-minimal, LLM-ready bundle: the findings, the AST targets they sit on, and the edit anchors a tool would need to act. A finding is pre-resolved to the exact node an agent should change, so the agent spends tokens fixing, not re-discovering.
  2. Shift the scan left, into the agent loop. Expose the same engine as an MCP tool and the agent calls it while building — inside Cursor or Claude Code — and gets the security+architecture read before the commit, not only in CI. The gate in CI becomes the backstop; the in-editor scan is the agent catching itself.
  3. On-demand, blast-radius-scoped. An on-demand comment trigger on a PR runs a scan scoped to the change's impact closure — the diff seeds an impact graph, and the scan runs that set on the user's runner. Cheaper and more correct than a dumb diff scan, and it puts a re-scan one comment away.
  4. The agent fixes; the platform authors; the human merges. CQ never writes the edit. The bundle (or a code-scanning alert) is handed to whatever agent the team points at it — a code-scanning "assign to Copilot" fix, or a BYO agent grounded on the bundle over MCP — which opens a PR the human reviews. The loop closes when the re-scan on that fix PR comes back clean.

A finding is a message to the next agent run, not a verdict on the last one. Classic scanning addressed a human reviewer and stopped. CQ addresses the agent, in the agent's protocol, with the AST targets it needs — so the same automation that wrote the smell can be pointed at fixing it, and the result re-enters the gate.

// remediation-bundle.json — the gate's output, shaped for an agent.
{
  "commit": "9f3c1a2",
  "health": { "score": 78, "security": 84, "architecture": 72 },
  "findings": [
    {
      "id": "sec.frontend-secret-leak.a91f",      // stable — drives dedup + the refuted ledger
      "class": "frontend_secret_leak",
      "severity": "high",
      "half": "security",
      "location": { "file": "web/src/api.ts", "line": 12, "col": 9 },
      "ast_target": "VariableDeclaration#SUPABASE_SERVICE_ROLE",
      "edit_anchor": "web/src/api.ts:12:9..12:71",
      "remediation": "service_role key is shipped to the client via VITE_; move to a server route"
    }
  ],
  "for_agent": "fix the findings above at their edit_anchors; re-run the scan to confirm"
}

Stage 3 · the ratchet

A gate that forgets is a gate that nags, and a gate that nags gets turned off. The ratchet is what lets the score mean something across hundreds of commits: decisions persist, the gate respects them, and the number only moves one way over time.

  1. The refuted-findings ledger. When the team dismisses a finding — in the GitHub Security tab, or by listing it in a refuted-findings ledger committed to the repo — it stays quiet on every future run. Because the bundle's stable finding id is carried as the SARIF partialFingerprints, GitHub's alert dedup and the ledger agree on what "the same finding" is. Dismissed once, silent forever.
  2. Audit continuity. If the repo carries a project-map.md or a remediation log, the gate ingests them and reports incrementally — building on prior architectural state instead of re-deriving it, so recommendations don't re-open settled questions. These artifacts live in the user's repo, committed by the team, so the architecture half gets smarter over time with nothing stored on the vendor side.
  3. The score is a trend, not a snapshot. The point of a per-push number is the chart. Security held at or above its floor; architecture trending up as the backlog drains. The ratchet is what turns "we scanned it" into "the codebase is measurably healthier than it was a month ago" — the metric a CQ practice is ultimately judged on.
// .act/refuted.json — decisions the gate must remember, in git.
{
  "refuted": [
    {
      "id": "arch.complexity-hotspot.parser-core.7b2",
      "reason": "hot path, intentionally dense; benchmarked, owner-signed",
      "by": "akmartin",
      "on": "2026-05-28"
    },
    {
      "id": "sec.unsafe-surface.ffi-bridge.0c4",
      "reason": "audited FFI boundary; mitigations documented in SECURITY.md",
      "by": "akmartin",
      "on": "2026-05-30"
    }
  ]
}
// project-map.md (separate file) is ingested for incremental architecture scoring.

The merge gate

The merge gate is where CQ borrows CD's authority: it can stop a change from reaching main. The minimum that earns its keep:

  1. Security sub-score at or above its floor. A hard bar, not a delta. Security never ratchets down — a regression below the floor is a red PR, full stop.
  2. No new findings introduced by the diff. The change may not add a finding. Pre-existing backlog findings do not block (that's the ratchet's job to drain, not the gate's job to wall off).
  3. Architecture sub-score not regressed against baseline. A delta gate: the diff may not lower the architecture half versus main. New cycles, god files, and layering breaks are blocking classes.
  4. Refuted findings excluded. Anything in the ledger is silent. A PR red because of a dismissed finding is a broken gate, and teams are right to distrust it.
  5. SARIF uploaded and alerts reconciled. The findings are in the Security tab with stable fingerprints, so history and dedup hold across runs.

A PR green on all five is mergeable. A PR red on any one goes back to the stage that owns it — security floor and new-finding gates to the feedback path (point the agent at the bundle), architecture regression to whichever operation introduced it.

Best practices, in plain English

Failure modes & gotchas

These have actually sunk CQ rollouts. Each has a fix nobody had time for until the gate got disabled.

The gotcha behind most of them: the gate's authority is only as good as its determinism and its memory. A flaky score or a leaky ledger doesn't just produce a bad run — it teaches the team the gate is wrong, and a gate the team overrides on reflex is worse than no gate at all.

Cost, blast radius, and the agent budget

CQ has to be cheaper than the failure it prevents, or it won't run on every push — and a quality gate that only runs nightly is just a slower audit. The design rule, borrowed straight from why CI could afford to run per-commit: N scans is a fixed-cost event, not a variable one.

The marginal cost of one more scan is approximately the cost of not answering an optional counter ping. That is the number that makes "on every push" affordable, which is the number that makes it continuous.

The maturity ladder

Most teams are advanced on one half and absent on the other — solid SAST in CI, nothing watching architecture; or a beautiful architecture audit that runs twice a year by hand. Tick what runs on every push today, not what you mean to wire up.

Zero to three: you have CI and a security scanner, and you're calling it quality. Four to six: a real gate, but the loop is open — humans still translate every finding by hand. Seven to nine: the loop is closed; the agent that writes the code is the agent that fixes it, and the gate remembers. Ten: the score trends down on its own and your job is watching the chart.

A reasonable 30 / 60 / 90-day plan

  1. Days 1–30 — install the gate, report-only. Wire the scan into one repo's CI on every PR with blocking off. Establish the baseline Health Score, confirm it's deterministic across re-runs, and let SARIF populate the Security tab. You're measuring, not gating.
  2. Days 31–60 — make it block, on regressions only. Turn on the security floor and the new-finding gate; add the architecture no-regression delta against main. Stand up the refuted ledger so dismissals stick. You can now stop a PR that makes the codebase sicker, without holding the backlog hostage.
  3. Days 61–90 — close the loop. Expose the scan in the team's agent (over MCP) so quality shifts left; route the remediation bundle to a fix agent; wire an on-demand comment trigger for blast-radius scans. The agent that writes the code now fixes its own findings and re-scans, and the health-score trend becomes a number leadership watches.

How act101 online implements the loop

The loop is the canon; act101 online is one toolchain that ships all three parts as a single stage instead of a drawer of overlapping scanners you wire together yourself. It's a GitHub App plus Action, so it rides GitHub's own primitives — the Actions runner, SARIF code scanning, branch protection, the Copilot coding agent — rather than reinventing them. What it adds is the three things those primitives don't have on their own: a unified score, an agent-ready feedback path, and a ratchet.

The pattern is platform-agnostic; the unifying score, the remediation bundle, and the ratchet are the parts you'd otherwise hand-build on top of five overlapping scanners. You already have most of the primitives this rides on — GitHub Actions for the trigger, branch protection for the authority, SARIF code scanning for the alert surface, the Copilot coding agent for the fix path. What none of them do is the architecture sub-score or the agent feedback path, which is precisely the gap act101 online fills, in one gate, on day one. The maturity gain is in running the loop on every push, not in running more scanners that each red the gate for a different reason.

What Humble and Farley got right (and where CQ goes from here)

Continuous Delivery (Humble & Farley, 2010), building on Fowler's and Booch's continuous-integration work before it, made one durable argument: the path from commit to production should be a pipeline — automated, repeatable, and the single source of truth about whether a change is safe to ship. They were obsessed with shrinking the feedback loop, because a signal you get in minutes changes behavior and a signal you get next quarter changes nothing. They built the pipeline around the signals available at the time: does it build, do the tests pass, does it deploy. Those signals all measured behavior.

Continuous Quality is the same argument applied to the signal CI and CD never had — the health of the code, which agents degrade faster than any human ever could and which no test suite reports. CQ adds a stage to the same pipeline, demands the same properties (automated, per-commit, deterministic, fast), and extends the feedback loop one hop further than Humble and Farley could: not just back to the human who can read the result, but back to the agent that produced the change, in a form it can act on. Read Continuous Delivery alongside this playbook; the pipeline discipline is the load-bearing part, and CQ is what you get when you point that discipline at structure instead of behavior.

The shortest possible summary: gate every push on a deterministic health score, block the regressions a change introduces, hand the findings back to the agent as something it can fix, and remember every decision so the number only moves one way. CI made the build trustworthy. CD made the release trustworthy. CQ makes the agent's code trustworthy — and run boringly on every commit for six months, that's the whole game.