We trusted a second reader until we realised we’d never seen it refuse anything
For weeks we treated an automated reviewer like a trusted colleague. It sat between draft and publish, gave a score, and we let the scores guide decisions. Then someone asked a small, cold question: “When was the last time you saw it say no?” We couldn’t answer.
The test we ran
We decided to stop assuming and start proving. We fed the reviewer drafts we knew were bad on purpose. One draft had invented numbers—a confidently precise paragraph that cited “42% growth” with no source. Another read like brochure copy, full of superlatives and no evidence. A third made a clear claim without any finding to support it.
The experiment was simple. If the reviewer truly checked for accuracy, clarity, and evidence, these drafts should fail. If it didn’t, we would see what it missed and why.
What happened, in the exact order
Most of the bad drafts were flagged. The reviewer lowered scores on the brochure-like piece. It penalised the draft that had no finding. That felt like progress.
Then came the one that surprised us: the draft with an invented number. It was written with confidence. It quoted a precise statistic—no source, no method. The reviewer gave it a high score. It passed.
That moment clicked. We could point to the exact draft, the line with “42% growth,” the timestamp in the log, and the approval. The machine had treated the confident-sounding falsehood as if it were legitimate.
The rule that emerged
A reviewer you have never seen say no is not a reviewer; it is a rubber stamp. You only get a reviewer when you have watched it reject something it should.
How we proved the rule
We repeated the test after adjusting the reviewer. Each time we changed thresholds or added new checks, we fed the same known-bad drafts through again. When the reviewer actually refused one of those drafts, we marked that configuration as usable.
We made it a habit. Whenever the reviewer was updated, we ran the bad-draft set. It became our smoke-alarm test: press the button to be sure the alarm works.
Three concrete findings from our work
1) The reviewer we relied on had never rejected anything until we forced it to. We had trusted scores for weeks without ever watching a denial happen.
2) Deliberately bad drafts exposed gaps quickly. The brochure-like draft and the one with no finding were caught, showing the reviewer could handle some problems but not all.
3) Confident false facts can slip through. The invented-number draft passed with a high score because the reviewer did not check facts against a known source or fixed database, only against tone and structure.
A fair counter-example
We also saw cases where the reviewer refused drafts that were actually fine. A well-researched piece with unusual wording was downgraded because the check weighed style against a rigid template. That taught us the reviewer can be overzealous; it will say no to things that actually deserve a yes.
What changed after we accepted the rule
We stopped treating scores as final. Scores became signals that needed verification. We built a short set of seeded failures—bad drafts we keep in a folder—and ran them whenever the reviewer was tweaked. If the seeded failures passed, we didn’t push the new version live.
We also documented the kinds of checks the reviewer actually did. It turned out to be strong on structure and tone, weak on factual cross-checks. That clarity changed how we used it: for style triage, yes; for fact verification, no.
Why calibration has a cost
Building bad drafts on purpose takes time. We created plausible-sounding false numbers, realistic brochure language, and hollow findings. That effort slowed us down. Most teams skip that work. Which is why most automated checks quietly pass everything: nobody ever forces them to say no.
How we know this is generally useful
We repeated the process across several review cycles and different content types. The pattern held. When a check had never rejected anything, it rarely rejected anything meaningful. When we audited it with seeded failures, its failures told us what it could and couldn’t do. The logs, timestamps, and the single invented-number draft are our proof.
FAQ
Why not just trust the scores if they look reasonable?
Because reasonable-looking scores can hide systematic gaps. A score measures what the reviewer checks. If it doesn’t check facts against fixed references, a high score doesn’t mean the facts are right.
How many seeded failures do you need?
Enough to exercise the kinds of checks you rely on. We used five to start: tone, structure, missing finding, invented number, and bad sourcing. Add or swap cases as your content changes.
Isn’t this extra work slowing us down?
Yes. It costs time to build and run seeded failures. But it’s cheaper than publishing false claims or letting a rubber stamp approve bad work.
Trust tools only after you’ve watched them refuse something they should. Scores are helpful. A refusal you have seen is indispensable.
Sources: our content pipeline tests and repeated reviewer audits in ebizapple’s content operations.

Leave a Reply