How we measure
A rank on the act101 leaderboard is instrumented, signed, third-party proof — not a claim. This page is the arithmetic behind every number, so a #1 finish survives a skeptical thread.
The headline is the privacy rule: act101 NEVER ships your code off-site. We ship only opt-in usage metrics. Everything below describes how those metrics are measured — never estimated from a benchmark, never ratio-guessed.
1. The standard candle: a naive full-file agent
Token savings are a measured counterfactual, not a ratio estimate. For every operation, we measure two numbers:
- Naive bytes — what a naive agent would have read or written to acquire the same information: every touched file in full, before and after a change.
- Actual bytes — what
actactually consumed or emitted for the operation.
The naive full-file agent is a standard candle that equalizes across coding agents regardless of their own shortcuts. Naivete is the feature: one rule, no caps, no special cases, no per-agent negotiation about what "would have" happened.
2. The accounting, per operation kind
| Operation kind | naive_bytes | actual_bytes |
|---|---|---|
| read / analyze | Σ full size of touched files | response bytes |
| write / refactor | Σ (before size + after size) per ChangeSet-affected file | response bytes + Σ edit-text bytes |
| no-file ops (status, docs, …) | = actual_bytes | response bytes |
Honest zero. A no-file operation records savings ≡ 0 — not a rounded-down estimate, an actual zero. Operations that touch nothing cost nothing to save, and we say so.
Whole-workspace operations count it all: the naive agent reads every file the operation could have touched, with no cap. Within one operation, a file counts once (a set keyed by path). Across operations, repeated touches count each time — the naive agent re-reads too; each operation is measured independently.
3. Bytes → tokens
tokens_saved = (naive_bytes − actual_bytes) / 4
Bytes are the stored unit, on disk and over the wire. The / 4 is applied only when a number is shown to you or uploaded — one conversion, stated plainly, visible everywhere a token count appears.
4. What is measured vs. estimated
Every savings number on the leaderboard is measured: the byte counts come from the actual files the operation touched and the actual bytes it emitted, recorded at the operation dispatch layer.
Historically, act shipped a ratio-based estimator (a benchmark table of per-operation savings ratios). That estimator and its table have been deleted, not bypassed. Older runs' estimated totals are preserved in a separate legacy_estimated block — rendered as a clearly-labeled line in act stats, never summed with measured numbers. One accounting truth, no mixing.
Nothing on the leaderboard is estimated. If a number can't be measured, it isn't shown.
5. What never leaves your machine
Raw operation records — which files, which paths, which projects — stay on your machine. Only an aggregate payload leaves, and only if you opt in at act onboarding. The uploaded payload carries:
- aggregate byte totals (
naive_bytes,actual_bytes) and the derived token count, - per-operation-name and per-grammar-name counts — product vocabulary, not your data — and
- the week bucket, a nonce, and a session count.
Never repo names, file paths, project hashes, per-project breakdowns, or code. There is no path from your source to the leaderboard.
6. Signed ingestion: fabrication is closed
Every upload is HMAC-signed with an upload token minted only at CLI onboarding. Server-side acceptance:
- Signature valid + nonce unseen — kills replay and fabrication. A payload not signed by a token we minted is rejected; a replayed nonce is rejected.
- Plausibility clamp per event and per day — a ceiling derived from measured maximums. A clamped event is still counted; it just can't be absurd. A cheater still plays, they can't be silly.
- Weekly taper applied at the midnight rollup, not at ingest — full credit up to a generous heavy-use threshold, logarithmic above. Raw events stay honest in storage; the taper is presentation math, tunable without migration.
The leaderboard's legitimacy stance is explicit: legitimacy over cheat-prevention. Once fabrication is cryptographically closed, every remaining cheat requires actually using act — real scans, real sessions. A cheater is a power user and a billboard.
7. The other board: measured health improvement
The "Most Improved" board ranks public repos by AI-Code Health Score gain, and it is equally closed against gaming:
- High-water mark (HWM).
counted_improvement = max(0, best_score_this_week − HWM_at_week_start). Poison-then-fix banks 0 by construction — self-sabotage is pointless, not forbidden, and no intent adjudication exists. - Scoring epochs. The score formula is actively evolving. Each history row records its
scoring_epoch; HWM comparisons are intra-epoch only. A formula change never mints fake improvements. - 1,000-line floor. Repos under 1,000 non-blank lines are ineligible — stated on the board, not hidden.
- Diff-scoped scores never count. PR-mode scan scores are labeled
diff-scopedand never compared to full-repo scores.
8. Rebuild cadence
Both boards rebuild once daily at 00:00 UTC — one cron computes tiers, streaks, achievements, rivalries, and the published JSON. The cadence is itself a published rule and a daily check-in ritual. If a rebuild fails, boards serve the previous day's data and the countdown shows "rebuilding…" — stale beats blank beats wrong.