Your AI says done.
Now it has to prove it.
Coding agents declare victory you can’t check. Harness.WTF is an evidence-gated harness: every finding ships a re-checkable proof bundle, a second independent method re-derives it, and the gate refuses “verified” until the proof actually re-runs. The brain is any model; the harness is the proof.
Works with Claude Code, Codex, and anything that reads AGENTS.md · MIT · zero dependencies
The three lies every agent tells
If you’ve run an AI on a real project, you’ve met all three.
LIE №1 — “almost there”
It runs forever.
2 a.m. The token meter is spinning. It’s “just refactoring one more thing” — on the task you asked for at 9. There is no finish line because nothing defines one.
LIE №2 — “all tests passing”
It says done. It isn’t.
Confident summary, green checkmarks in prose. You find out the way your users do — after it shipped. The tests were never run. The claim was free; the proof was optional.
LIE №3 — “trust me”
It asserts; it can’t prove.
A finding is prose you must re-verify by hand — and agents are measurably overconfident (some succeed 22% of the time yet predict 77%), even confabulating internally-consistent claims. The claim is free; the proof is the product.
The fix
We made lying mechanically impossible.
Every claim runs through one gate that no model can talk its way past. This is the real, unedited refusal when an agent submits a confident finding backed only by its own say-so:
verify_claim severity=critical confidence=0.99 evidence=[assertion: "trust me, it's a bug"] REFUSED: no grounded (re-checkable) evidence — only assertions. status → disputed, not verified. verified requires: re-runnable proof · independent second-method re-derivation · a logged disproof attempt
Honest statuses (disputed · rejected · partial) are always allowed. “Verified” is earned with proof, never declared.
How it works
The model proposes. The harness proves.
Any model, any CLI — one endpoint
Claude Code, Codex, or any agent calls the same harness over MCP (or the CLI). The brain is pluggable and swappable; the proof layer doesn’t care which model found it. Designed to survive model death.
A finding only reaches “verified” through the gate
The gate re-runs the proof in a fresh process (re-execution), demands an independent re-derivation by a different method (e.g. taint vs. cross-file data-flow vs. reachability — or a different model family when the brain is an LLM), requires a logged attempt to disprove it, and checks every piece of evidence traces to a real command. No proof → not verified.
Every finding ships a re-checkable evidence bundle
file:line · a failing→passing reproduction · the diff · what is not claimed · a confidence score — and a full, replayable audit log of every action. You verify it without trusting the model.
It gets better from usage — on its own
Every run deposits a lesson. Patterns proven across projects compound into the shared core automatically; project-specific learnings stay local. The engine improves between model releases — and is never allowed to rewrite its own grader unsupervised.
We eat what we cook
It proves — and disproves — its own work.
FALSIFIED
we killed our own founding claim — on the record
We theorized agents “rot” on long projects and our harness prevents it. A pre-registered, blind A/B (run-0003) showed a plain agent finished a 6-part build 100% with zero regressions. We published the falsification and pivoted. A proof page that only shows wins isn’t proof.
5 → 0
fair A/B tests, zero coding-speed wins — so we changed the product
Across five honest, pre-registered runs a wrapper didn’t make a frontier model code better — at equal compute, structured decomposition even lost. The model isn’t the bottleneck. Proof is. So we built the proof layer, not a faster coder. Raw data →
63 ✓
63 capabilities — every one backed by a passing proof
Not a list of promises: a capability genome where each entry points to a test that exists and runs. 275 tests across the un-bypassable gate, the Tool plane (bandit + semgrep across 5 languages, data-flow taint, cross-file source→sink, reachability), and the self-improvement loop. A capability you can’t point to a proof for isn’t one.
0.909 / 1.0
honest recall / precision — on a reproducible corpus
On 15 realistic, framework-grade repos (Python · JS · Java · Ruby): recall 0.909, precision 1.0
(one known gap: SSTI) — we don’t hide what we miss. And ~42% of raw-scanner noise cut: true
findings kept, dead-code and look-alikes dropped. Reproduce it: harness scorecard.
Get it
Open today. Substrate in active build.
Today (open source): the protocol + closeout gate — clone, install, your agent follows AGENTS.md. In active build: the v2 evidence substrate above — the verification gate, the model-agnostic MCP, the Tool plane (bandit + semgrep across 5 languages, data-flow taint, cross-file source→sink, reachability), auto-remediation, and the self-improvement loop — 63 capabilities each backed by a passing proof, 275 tests green. (Every claim on this page comes from it; the brain runs deterministic today — live-model autonomy needs your own API keys.)
git clone https://github.com/LookNoHandsMom/harness-wtf && export PATH="$PWD/harness-wtf:$PATH" cd your-project && harness-wtf init . Harness.WTF installed · tell your agent: “follow AGENTS.md” · it does the rest harness-wtf doctor . doctor: healthy